??大佬都在學(xué)什么？Python爬蟲分析C站大佬收藏夾，跟著大佬一起學(xué)，你就是下一個(gè)大佬??!

Yang_River 發(fā)布于2021-09-06 15:02 / 1284人閱讀

??大佬都在學(xué)什么？Python爬蟲分析C站大佬收藏夾，跟著大佬一起學(xué)，你就是下一個(gè)大佬??!

前言

計(jì)算機(jī)行業(yè)的發(fā)展太快了，有時(shí)候幾天不學(xué)習(xí)，就被時(shí)代所拋棄了，因此對(duì)于我們程序員而言，最重要的就是要時(shí)刻緊跟業(yè)界動(dòng)態(tài)變化，學(xué)習(xí)新的技術(shù)，但是很多時(shí)候我們又不知道學(xué)什么好，萬(wàn)一學(xué)的新技術(shù)并不會(huì)被廣泛使用，太小眾了對(duì)學(xué)習(xí)工作也幫助不大，這時(shí)候我們就想要知道大佬們都在學(xué)什么了，跟著大佬學(xué)習(xí)走彎路的概率就小很多了?，F(xiàn)在就讓我們看看C站大佬們平時(shí)都收藏了什么，大佬學(xué)什么跟著大佬的腳步就好了！

程序說(shuō)明

通過(guò)爬取 “CSDN” 獲取全站排名靠前的博主的公開(kāi)收藏夾，寫入 csv 文件中，根據(jù)所獲取數(shù)據(jù)分析領(lǐng)域大佬們的學(xué)習(xí)趨勢(shì)，并通過(guò)可視化的方式進(jìn)行展示。

數(shù)據(jù)爬取

使用 requests 庫(kù)請(qǐng)求網(wǎng)頁(yè)信息，使用 BeautifulSoup4 和 json 庫(kù)解析網(wǎng)頁(yè)。

獲取 CSDN 作者總榜數(shù)據(jù)

首先，我們需要獲取 CSDN 中在榜的大佬，獲取他/她們的相關(guān)信息。由于數(shù)據(jù)是動(dòng)態(tài)加載的(關(guān)于動(dòng)態(tài)加載的更多說(shuō)明，可以參考博文《渣男，你為什么有這么多小姐姐的照片？因?yàn)槲襊ython爬蟲學(xué)的好啊??！》)，因此使用開(kāi)發(fā)者工具，在網(wǎng)絡(luò)選項(xiàng)卡中可以找到請(qǐng)求的 JSON 數(shù)據(jù)：

觀察請(qǐng)求鏈接：

https://blog.csdn.net/phoenix/web/blog/all-rank?page=0&pageSize=20https://blog.csdn.net/phoenix/web/blog/all-rank?page=1&pageSize=20...

可以發(fā)現(xiàn)每次請(qǐng)求 JSON 數(shù)據(jù)時(shí)，會(huì)獲取20個(gè)數(shù)據(jù)，為了獲取排名前100的大佬數(shù)據(jù)，使用如下方式構(gòu)造請(qǐng)求：

url_rank_pattern = "https://blog.csdn.net/phoenix/web/blog/all-rank?page={}&pageSize=20"for i in range(5):    url = url_rank_pattern.format(i)    #聲明網(wǎng)頁(yè)編碼方式    response = requests.get(url=url, headers=headers)    response.encoding = "utf-8"    response.raise_for_status()    soup = BeautifulSoup(response.text, "html.parser")

請(qǐng)求得到 Json 數(shù)據(jù)后，使用 json 模塊解析數(shù)據(jù)(當(dāng)然也可以使用 re 模塊，根據(jù)自己的喜好選擇就好了)，獲取用戶信息，從需求上講，這里僅需要用戶 userName，因此僅解析 userName 信息，也可以根據(jù)需求獲取其他信息：

userNames = []information = json.loads(str(soup))for j in information["data"]["allRankListItem"]:    # 獲取id信息    userNames.append(j["userName"])

獲取收藏夾列表

獲取到大佬的 userName 信息后，通過(guò)主頁(yè)來(lái)觀察收藏夾列表的請(qǐng)求方式，本文以自己的主頁(yè)為例(給自己推廣一波)，分析方法與上一步類似，在主頁(yè)中切換到“收藏”選項(xiàng)卡，同樣利用開(kāi)發(fā)者工具的網(wǎng)絡(luò)選項(xiàng)卡：

觀察請(qǐng)求收藏夾列表的地址：

https://blog.csdn.net/community/home-api/v1/get-favorites-created-list?page=1&size=20&noMore=false&blogUsername=LOVEmy134611

可以看到這里我們上一步獲取的 userName 就用上了，可以通過(guò)替換 blogUsername 的值來(lái)獲取列表中大佬的收藏夾列表，同樣當(dāng)收藏夾數(shù)量大于20時(shí)，可以通過(guò)修改 page 值來(lái)獲取所有收藏夾列表：

collections = "https://blog.csdn.net/community/home-api/v1/get-favorites-created-list?page=1&size=20&noMore=false&blogUsername={}"for userName in userNames:    url = collections.format(userName)    #聲明網(wǎng)頁(yè)編碼方式    response = requests.get(url=url, headers=headers)    response.encoding = "utf-8"    response.raise_for_status()    soup = BeautifulSoup(response.text, "html.parser")

請(qǐng)求得到 Json 數(shù)據(jù)后，使用 json 模塊解析數(shù)據(jù)，獲取收藏夾信息，從需求上講，這里僅需要收藏夾 id，因此僅解析 id 信息，也可以根據(jù)需求獲取其他信息(例如可以獲取關(guān)注人數(shù)等信息，找到最受歡迎的收藏夾)：

file_id_list = []information = json.loads(str(soup))# 獲取收藏夾總數(shù)collection_number = information["data"]["total"]# 獲取收藏夾idfor j in information["data"]["list"]:    file_id_list.append(j["id"])

這里大家可能會(huì)問(wèn)，現(xiàn)在 CSDN 不是有新舊兩種主頁(yè)么，請(qǐng)求方式能一樣么？答案是：不一樣，在瀏覽器端進(jìn)行訪問(wèn)時(shí)，舊版本使用了不同的請(qǐng)求接口，但是我們同樣可以使用新版本的請(qǐng)求方式來(lái)進(jìn)行獲取，因此就不必區(qū)分新、舊版本的請(qǐng)求接口了，獲取收藏?cái)?shù)據(jù)時(shí)情況也是一樣的。

獲取收藏?cái)?shù)據(jù)

最后，單擊收藏夾展開(kāi)按鈕，就可以看到收藏夾中的內(nèi)容了，然后同樣利用開(kāi)發(fā)者工具的網(wǎng)絡(luò)選項(xiàng)卡進(jìn)行分析：

觀察請(qǐng)求收藏夾的地址：

https://blog.csdn.net/community/home-api/v1/get-favorites-item-list?blogUsername=LOVEmy134611&folderId=9406232&page=1&pageSize=200

可以看到剛剛獲取的用戶 userName 和收藏夾 id 就可以構(gòu)造請(qǐng)求獲取收藏夾中的收藏信息了：

file_url = "https://blog.csdn.net/community/home-api/v1/get-favorites-item-list?blogUsername={}&folderId={}&page=1&pageSize=200"for file_id in file_id_list:    url = file_url.format(userName,file_id)    #聲明網(wǎng)頁(yè)編碼方式    response = requests.get(url=url, headers=headers)    response.encoding = "utf-8"    response.raise_for_status()    soup = BeautifulSoup(response.text, "html.parser")

最后用 re 模塊解析：

    user = user_dict[userName]    user = preprocess(user)    # 標(biāo)題    title_list  = analysis(r""title":"(.*?)",", str(soup))    # 鏈接    url_list = analysis(r""url":"(.*?)"", str(soup))    # 作者    nickname_list = analysis(r""nickname":"(.*?)",", str(soup))    # 收藏日期    date_list = analysis(r""dateTime":"(.*?)",", str(soup))    for i in range(len(title_list)):        title = preprocess(title_list[i])        url = preprocess(url_list[i])        nickname = preprocess(nickname_list[i])        date = preprocess(date_list[i])

爬蟲程序完整代碼

import timeimport requestsfrom bs4 import BeautifulSoupimport osimport jsonimport reimport csvif not os.path.exists("col_infor.csv"):    #創(chuàng)建存儲(chǔ)csv文件存儲(chǔ)數(shù)據(jù)    file = open("col_infor.csv", "w", encoding="utf-8-sig",newline="")    csv_head = csv.writer(file)    #表頭    header = ["userName","title","url","anthor","date"]    csv_head.writerow(header)    file.close()headers = {    "User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"}def preprocess(string):    return string.replace(","," ")url_rank_pattern = "https://blog.csdn.net/phoenix/web/blog/all-rank?page={}&pageSize=20"userNames = []user_dict = {}for i in range(5):    url = url_rank_pattern.format(i)    #聲明網(wǎng)頁(yè)編碼方式    response = requests.get(url=url, headers=headers)    response.encoding = "utf-8"    response.raise_for_status()    soup = BeautifulSoup(response.text, "html.parser")    information = json.loads(str(soup))    for j in information["data"]["allRankListItem"]:        # 獲取id信息        userNames.append(j["userName"])        user_dict[j["userName"]] = j["nickName"]def get_col_list(page,userName):    collections = "https://blog.csdn.net/community/home-api/v1/get-favorites-created-list?page={}&size=20&noMore=false&blogUsername={}"    url = collections.format(page,userName)    #聲明網(wǎng)頁(yè)編碼方式    response = requests.get(url=url, headers=headers)    response.encoding = "utf-8"    response.raise_for_status()    soup = BeautifulSoup(response.text, "html.parser")    information = json.loads(str(soup))    return informationdef analysis(item,results):    pattern = re.compile(item, re.I|re.M)    result_list = pattern.findall(results)    return result_listdef get_col(userName, file_id, col_page):    file_url = "https://blog.csdn.net/community/home-api/v1/get-favorites-item-list?blogUsername={}&folderId={}&page={}&pageSize=200"    url = file_url.format(userName,file_id, col_page)    #聲明網(wǎng)頁(yè)編碼方式    response = requests.get(url=url, headers=headers)    response.encoding = "utf-8"    response.raise_for_status()    soup = BeautifulSoup(response.text, "html.parser")    user = user_dict[userName]    user = preprocess(user)    # 標(biāo)題    title_list  = analysis(r""title":"(.*?)",", str(soup))    # 鏈接    url_list = analysis(r""url":"(.*?)"", str(soup))    # 作者    nickname_list = analysis(r""nickname":"(.*?)",", str(soup))    # 收藏日期    date_list = analysis(r""dateTime":"(.*?)",", str(soup))    for i in range(len(title_list)):        title = preprocess(title_list[i])        url = preprocess(url_list[i])        nickname = preprocess(nickname_list[i])        date = preprocess(date_list[i])        if title and url and nickname and date:            with open("col_infor.csv", "a+", encoding="utf-8-sig") as f:                f.write(user + "," + title + "," + url + "," + nickname + "," + date  + "/n")    return informationfor userName in userNames:    page = 1    file_id_list = []    information = get_col_list(page, userName)    # 獲取收藏夾總數(shù)    collection_number = information["data"]["total"]    # 獲取收藏夾id    for j in information["data"]["list"]:        file_id_list.append(j["id"])    while collection_number > 20:        page = page + 1        collection_number = collection_number - 20        information = get_col_list(page, userName)        # 獲取收藏夾id        for j in information["data"]["list"]:            file_id_list.append(j["id"])    collection_number = 0    # 獲取收藏信息    for file_id in file_id_list:        col_page = 1        information = get_col(userName, file_id, col_page)        number_col = information["data"]["total"]        while number_col > 200:            col_page = col_page + 1            number_col = number_col - 200            get_col(userName, file_id, col_page)    number_col = 0

爬取數(shù)據(jù)結(jié)果

展示部分爬取結(jié)果：

數(shù)據(jù)分析及可視化

最后使用 wordcloud 庫(kù)，繪制詞云展示大佬收藏。

from os import pathfrom PIL import Imageimport matplotlib.pyplot as pltimport jiebafrom wordcloud import WordCloud, STOPWORDSimport pandas as pdimport matplotlib.ticker as tickerimport numpy as npimport mathimport redf = pd.read_csv("col_infor.csv", encoding="utf-8-sig",usecols=["userName","title","url","anthor","date"])place_array = df["title"].valuesplace_list = "，".join(place_array)with open("text.txt","a+") as f:    f.writelines(place_list)###當(dāng)前文件路徑d = path.dirname(__file__)# Read the whole text.file = open(path.join(d, "text.txt")).read()##進(jìn)行分詞#停用詞stopwords = ["的","與","和","建議","收藏","使用","了","實(shí)現(xiàn)","我","中","你","在","之"]text_split = jieba.cut(file)  # 未去掉停用詞的分詞結(jié)果   list類型#去掉停用詞的分詞結(jié)果  list類型text_split_no = []for word in text_split:    if word not in stopwords:        text_split_no.append(word)#print(text_split_no)text =" ".join(text_split_no)#背景圖片picture_mask = np.array(Image.open(path.join(d, "path.jpg")))stopwords = set(STOPWORDS)stopwords.add("said")wc = WordCloud(      #設(shè)置字體，指定字體路徑    font_path=r"C:/Windows/Fonts/simsun.ttc",     # font_path=r"/usr/share/fonts/wps-office/simsun.ttc",     background_color="white",       max_words=2000,       mask=picture_mask,      stopwords=stopwords)  # 生成詞云wc.generate(text)# 存儲(chǔ)圖片wc.to_file(path.join(d, "result.jpg"))

云服務(wù)器 GPU云服務(wù)器大佬們大佬 vps大佬大佬人工智能

文章版權(quán)歸作者所有，未經(jīng)允許請(qǐng)勿轉(zhuǎn)載,若此文章存在違規(guī)行為，您可以聯(lián)系管理員刪除。

轉(zhuǎn)載請(qǐng)注明本文地址：http://m.hztianpu.com/yun/119310.html

發(fā)表評(píng)論

登陸后可評(píng)論

0條評(píng)論

Yang_River

男|高級(jí)講師

我要關(guān)注我要私信

TA的文章

虛擬主機(jī)技術(shù)是什么-什么是虛擬主機(jī)？

閱讀 4077·2021-09-22 16:03
如何登陸云主機(jī)-怎么登錄云主機(jī)？

閱讀 5496·2021-09-22 15:40
??大佬都在學(xué)什么？Python爬蟲分析C站大佬收藏夾，跟著大佬一起學(xué)，你就是下一個(gè)大佬??!

閱讀 1285·2021-09-06 15:02
web前端編碼規(guī)范整合

閱讀 927·2019-08-30 15:53
微信小程序中圖片上傳阿里云Oss

閱讀 2321·2019-08-29 15:35
大話-node真的是單線程嗎？

閱讀 1171·2019-08-23 18:22
使用Proxy實(shí)現(xiàn)雙向綁定

閱讀 3415·2019-08-23 16:06
JavaScript之this

閱讀 706·2019-08-23 12:27

成人无码视频,亚洲精品久久久久av无码,午夜精品久久久久久毛片,亚洲中文字幕日韩无码

資訊專欄INFORMATION COLUMN

上云采購(gòu)季！| 2核2G4M爆款云服務(wù)器低至59元/年，更有多臺(tái)、長(zhǎng)期優(yōu)惠，快來(lái)選購(gòu)！

??大佬都在學(xué)什么？Python爬蟲分析C站大佬收藏夾，跟著大佬一起學(xué)，你就是下一個(gè)大佬??!