摘要:關(guān)于數(shù)據(jù)來源本項(xiàng)目寫于年七月初,主要使用爬取網(wǎng)貸之家以及人人貸的數(shù)據(jù)進(jìn)行分析。注這是現(xiàn)在網(wǎng)貸之家的請(qǐng)求后臺(tái)的接口,爬蟲編寫的時(shí)候與數(shù)據(jù)接口與如今的請(qǐng)求接口不一樣,所以網(wǎng)貸之家的數(shù)據(jù)爬蟲部分已無效。
關(guān)于數(shù)據(jù)來源
本項(xiàng)目寫于2017年七月初,主要使用Python爬取網(wǎng)貸之家以及人人貸的數(shù)據(jù)進(jìn)行分析。
網(wǎng)貸之家是國內(nèi)最大的P2P數(shù)據(jù)平臺(tái),人人貸國內(nèi)排名前二十的P2P平臺(tái)。
源碼地址
抓包工具主要使用chrome的開發(fā)者工具 網(wǎng)絡(luò)一欄,網(wǎng)貸之家的數(shù)據(jù)全部是ajax返回json數(shù)據(jù),而人人貸既有ajax返回?cái)?shù)據(jù)也有html頁面直接生成數(shù)據(jù)。
請(qǐng)求實(shí)例
從數(shù)據(jù)中可以看到請(qǐng)求數(shù)據(jù)的方式(GET或者POST),請(qǐng)求頭以及請(qǐng)求參數(shù)。
從請(qǐng)求數(shù)據(jù)中可以看到返回?cái)?shù)據(jù)的格式(此例中為json)、數(shù)據(jù)結(jié)構(gòu)以及具體數(shù)據(jù)。
注:這是現(xiàn)在網(wǎng)貸之家的API請(qǐng)求后臺(tái)的接口,爬蟲編寫的時(shí)候與數(shù)據(jù)接口與如今的請(qǐng)求接口不一樣,所以網(wǎng)貸之家的數(shù)據(jù)爬蟲部分已無效。
根據(jù)抓包分析得到的結(jié)果,構(gòu)造請(qǐng)求。在本項(xiàng)目中,使用Python的 requests庫模擬http請(qǐng)求
具體代碼:
import requests class SessionUtil(): def __init__(self,headers=None,cookie=None): self.session=requests.Session() if headers is None: headersStr={"Accept":"application/json, text/javascript, */*; q=0.01", "X-Requested-With":"XMLHttpRequest", "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36", "Accept-Encoding":"gzip, deflate, sdch, br", "Accept-Language":"zh-CN,zh;q=0.8" } self.headers=headersStr else: self.headers=headers self.cookie=cookie //發(fā)送get請(qǐng)求 def getReq(self,url): return self.session.get(url,headers=self.headers).text def addCookie(self,cookie): self.headers["cookie"]=cookie //發(fā)送post請(qǐng)求 def postReq(self,url,param): return self.session.post(url, param).text
在設(shè)置請(qǐng)求頭的時(shí)候,關(guān)鍵字段只設(shè)置了"User-Agent",網(wǎng)貸之家和人人貸的沒有反爬措施,甚至不用設(shè)置"Referer"字段來防止跨域錯(cuò)誤。
爬蟲實(shí)例以下是一個(gè)爬蟲實(shí)例
import json import time from databaseUtil import DatabaseUtil from sessionUtil import SessionUtil from dictUtil import DictUtil from logUtil import LogUtil import traceback def handleData(returnStr): jsonData=json.loads(returnStr) platData=jsonData.get("data").get("platOuterVo") return platData def storeData(jsonOne,conn,cur,platId): actualCapital=jsonOne.get("actualCapital") aliasName=jsonOne.get("aliasName") association=jsonOne.get("association") associationDetail=jsonOne.get("associationDetail") autoBid=jsonOne.get("autoBid") autoBidCode=jsonOne.get("autoBidCode") bankCapital=jsonOne.get("bankCapital") bankFunds=jsonOne.get("bankFunds") bidSecurity=jsonOne.get("bidSecurity") bindingFlag=jsonOne.get("bindingFlag") businessType=jsonOne.get("businessType") companyName=jsonOne.get("companyName") credit=jsonOne.get("credit") creditLevel=jsonOne.get("creditLevel") delayScore=jsonOne.get("delayScore") delayScoreDetail=jsonOne.get("delayScoreDetail") displayFlg=jsonOne.get("displayFlg") drawScore=jsonOne.get("drawScore") drawScoreDetail=jsonOne.get("drawScoreDetail") equityVoList=jsonOne.get("equityVoList") experienceScore=jsonOne.get("experienceScore") experienceScoreDetail=jsonOne.get("experienceScoreDetail") fundCapital=jsonOne.get("fundCapital") gjlhhFlag=jsonOne.get("gjlhhFlag") gjlhhTime=jsonOne.get("gjlhhTime") gruarantee=jsonOne.get("gruarantee") inspection=jsonOne.get("inspection") juridicalPerson=jsonOne.get("juridicalPerson") locationArea=jsonOne.get("locationArea") locationAreaName=jsonOne.get("locationAreaName") locationCity=jsonOne.get("locationCity") locationCityName=jsonOne.get("locationCityName") manageExpense=jsonOne.get("manageExpense") manageExpenseDetail=jsonOne.get("manageExpenseDetail") newTrustCreditor=jsonOne.get("newTrustCreditor") newTrustCreditorCode=jsonOne.get("newTrustCreditorCode") officeAddress=jsonOne.get("officeAddress") onlineDate=jsonOne.get("onlineDate") payment=jsonOne.get("payment") paymode=jsonOne.get("paymode") platBackground=jsonOne.get("platBackground") platBackgroundDetail=jsonOne.get("platBackgroundDetail") platBackgroundDetailExpand=jsonOne.get("platBackgroundDetailExpand") platBackgroundExpand=jsonOne.get("platBackgroundExpand") platEarnings=jsonOne.get("platEarnings") platEarningsCode=jsonOne.get("platEarningsCode") platName=jsonOne.get("platName") platStatus=jsonOne.get("platStatus") platUrl=jsonOne.get("platUrl") problem=jsonOne.get("problem") problemTime=jsonOne.get("problemTime") recordId=jsonOne.get("recordId") recordLicId=jsonOne.get("recordLicId") registeredCapital=jsonOne.get("registeredCapital") riskCapital=jsonOne.get("riskCapital") riskFunds=jsonOne.get("riskFunds") riskReserve=jsonOne.get("riskReserve") riskcontrol=jsonOne.get("riskcontrol") securityModel=jsonOne.get("securityModel") securityModelCode=jsonOne.get("securityModelCode") securityModelOther=jsonOne.get("securityModelOther") serviceScore=jsonOne.get("serviceScore") serviceScoreDetail=jsonOne.get("serviceScoreDetail") startInvestmentAmout=jsonOne.get("startInvestmentAmout") term=jsonOne.get("term") termCodes=jsonOne.get("termCodes") termWeight=jsonOne.get("termWeight") transferExpense=jsonOne.get("transferExpense") transferExpenseDetail=jsonOne.get("transferExpenseDetail") trustCapital=jsonOne.get("trustCapital") trustCreditor=jsonOne.get("trustCreditor") trustCreditorMonth=jsonOne.get("trustCreditorMonth") trustFunds=jsonOne.get("trustFunds") tzjPj=jsonOne.get("tzjPj") vipExpense=jsonOne.get("vipExpense") withTzj=jsonOne.get("withTzj") withdrawExpense=jsonOne.get("withdrawExpense") sql="insert into problemPlatDetail (actualCapital,aliasName,association,associationDetail,autoBid,autoBidCode,bankCapital,bankFunds,bidSecurity,bindingFlag,businessType,companyName,credit,creditLevel,delayScore,delayScoreDetail,displayFlg,drawScore,drawScoreDetail,equityVoList,experienceScore,experienceScoreDetail,fundCapital,gjlhhFlag,gjlhhTime,gruarantee,inspection,juridicalPerson,locationArea,locationAreaName,locationCity,locationCityName,manageExpense,manageExpenseDetail,newTrustCreditor,newTrustCreditorCode,officeAddress,onlineDate,payment,paymode,platBackground,platBackgroundDetail,platBackgroundDetailExpand,platBackgroundExpand,platEarnings,platEarningsCode,platName,platStatus,platUrl,problem,problemTime,recordId,recordLicId,registeredCapital,riskCapital,riskFunds,riskReserve,riskcontrol,securityModel,securityModelCode,securityModelOther,serviceScore,serviceScoreDetail,startInvestmentAmout,term,termCodes,termWeight,transferExpense,transferExpenseDetail,trustCapital,trustCreditor,trustCreditorMonth,trustFunds,tzjPj,vipExpense,withTzj,withdrawExpense,platId) values (""+actualCapital+"",""+aliasName+"",""+association+"",""+associationDetail+"",""+autoBid+"",""+autoBidCode+"",""+bankCapital+"",""+bankFunds+"",""+bidSecurity+"",""+bindingFlag+"",""+businessType+"",""+companyName+"",""+credit+"",""+creditLevel+"",""+delayScore+"",""+delayScoreDetail+"",""+displayFlg+"",""+drawScore+"",""+drawScoreDetail+"",""+equityVoList+"",""+experienceScore+"",""+experienceScoreDetail+"",""+fundCapital+"",""+gjlhhFlag+"",""+gjlhhTime+"",""+gruarantee+"",""+inspection+"",""+juridicalPerson+"",""+locationArea+"",""+locationAreaName+"",""+locationCity+"",""+locationCityName+"",""+manageExpense+"",""+manageExpenseDetail+"",""+newTrustCreditor+"",""+newTrustCreditorCode+"",""+officeAddress+"",""+onlineDate+"",""+payment+"",""+paymode+"",""+platBackground+"",""+platBackgroundDetail+"",""+platBackgroundDetailExpand+"",""+platBackgroundExpand+"",""+platEarnings+"",""+platEarningsCode+"",""+platName+"",""+platStatus+"",""+platUrl+"",""+problem+"",""+problemTime+"",""+recordId+"",""+recordLicId+"",""+registeredCapital+"",""+riskCapital+"",""+riskFunds+"",""+riskReserve+"",""+riskcontrol+"",""+securityModel+"",""+securityModelCode+"",""+securityModelOther+"",""+serviceScore+"",""+serviceScoreDetail+"",""+startInvestmentAmout+"",""+term+"",""+termCodes+"",""+termWeight+"",""+transferExpense+"",""+transferExpenseDetail+"",""+trustCapital+"",""+trustCreditor+"",""+trustCreditorMonth+"",""+trustFunds+"",""+tzjPj+"",""+vipExpense+"",""+withTzj+"",""+withdrawExpense+"",""+platId+"")" cur.execute(sql) conn.commit() conn,cur=DatabaseUtil().getConn() session=SessionUtil() logUtil=LogUtil("problemPlatDetail.log") cur.execute("select platId from problemPlat") data=cur.fetchall() print(data) mylist=list() print(data) for i in range(0,len(data)): platId=str(data[i].get("platId")) mylist.append(platId) print mylist for i in mylist: url=""+i try: data=session.getReq(url) platData=handleData(data) dictObject=DictUtil(platData) storeData(dictObject,conn,cur,i) except Exception,e: traceback.print_exc() cur.close() conn.close
整個(gè)過程中 我們 構(gòu)造請(qǐng)求,然后把解析每個(gè)請(qǐng)求的響應(yīng),其中json返回值使用json庫進(jìn)行解析,html頁面使用BeautifulSoup庫進(jìn)行解析(結(jié)構(gòu)復(fù)雜的html的頁面推薦使用lxml庫進(jìn)行解析),解析到的結(jié)果存儲(chǔ)到mysql數(shù)據(jù)庫中。
爬蟲代碼爬蟲代碼地址(注:爬蟲使用代碼Python2與python3都可運(yùn)行,本人把爬蟲代碼部署在阿里云服務(wù)器上,使用Python2 運(yùn)行)
數(shù)據(jù)分析數(shù)據(jù)分析主要使用Python的numpy、pandas、matplotlib進(jìn)行數(shù)據(jù)分析,同時(shí)輔以海致BDP。
時(shí)間序列分析 數(shù)據(jù)讀取一般采取把數(shù)據(jù)讀取pandas的DataFrame中進(jìn)行分析。
以下就是讀取問題平臺(tái)的數(shù)據(jù)的例子
problemPlat=pd.read_csv("problemPlat.csv",parse_dates=True)#問題平臺(tái)
數(shù)據(jù)結(jié)構(gòu)
eg 問題平臺(tái)數(shù)量隨時(shí)間變化
problemPlat["id"]["2012":"2017"].resample("M",how="count").plot(title="P2P發(fā)生問題")#發(fā)生問題P2P平臺(tái)數(shù)量 隨時(shí)間變化趨勢(shì)
圖形化展示
使用海致BDP完成(Python繪制地圖分布輪子比較復(fù)雜,當(dāng)時(shí)還未學(xué)習(xí))
各省問題平臺(tái)數(shù)量 各省平臺(tái)成交額 規(guī)模分布分析eg 全國六月平臺(tái)成交額分布
代碼
juneData["amount"].hist(normed=True) juneData["amount"].plot(kind="kde",style="k--")#六月份交易量概率分布
核密度圖形展示
成交額取對(duì)數(shù)核密度分布
np.log10(juneData["amount"]).hist(normed=True) np.log10(juneData["amount"]).plot(kind="kde",style="k--")#取 10 對(duì)數(shù)的 概率分布
圖形化展示
可看出取10的對(duì)數(shù)后分布更符合正常的金字塔形。
lujinData=platVolume[platVolume["wdzjPlatId"]==59] corr=pd.rolling_corr(lujinData["amount"],allPlatDayData["amount"],50,min_periods=50).plot(title="陸金所交易額與所有平臺(tái)交易額的相關(guān)系數(shù)變化趨勢(shì)")
圖形化展示
車貸平臺(tái)與全平臺(tái)成交額數(shù)據(jù)對(duì)比
carFinanceDayData=carFinanceData.resample("D").sum()["amount"] fig,axes=plt.subplots(nrows=1,ncols=2,sharey=True,figsize=(14,7)) carFinanceDayData.plot(ax=axes[0],title="車貸平臺(tái)交易額") allPlatDayData["amount"].plot(ax=axes[1],title="所有p2p平臺(tái)交易額")趨勢(shì)預(yù)測(cè) eg預(yù)測(cè)陸金所成交量趨勢(shì)(使用Facebook Prophet庫完成)
lujinAmount=platVolume[platVolume["wdzjPlatId"]==59] lujinAmount["y"]=lujinAmount["amount"] lujinAmount["ds"]=lujinAmount["date"] m=Prophet(yearly_seasonality=True) m.fit(lujinAmount) future=m.make_future_dataframe(periods=365) forecast=m.predict(future) m.plot(forecast)
趨勢(shì)預(yù)測(cè)圖形化展示
數(shù)據(jù)分析代碼地址(注:數(shù)據(jù)分析代碼智能運(yùn)行在Python3 環(huán)境下)
代碼運(yùn)行后樣例(無需安裝Python環(huán)境 也可查看具體代碼解圖形化展示)
這是本人從 Java web轉(zhuǎn)向數(shù)據(jù)方向后自己寫的第一項(xiàng)目,也是自己的第一個(gè)Python項(xiàng)目,在整個(gè)過程中,也沒遇到多少坑,整體來說,爬蟲和數(shù)據(jù)分析以及Python這門語言門檻都是非常低的。
如果想入門Python爬蟲,推薦《Python網(wǎng)絡(luò)數(shù)據(jù)采集》
如果想入門Python數(shù)據(jù)分析,推薦 《利用Python進(jìn)行數(shù)據(jù)分析》
文章版權(quán)歸作者所有,未經(jīng)允許請(qǐng)勿轉(zhuǎn)載,若此文章存在違規(guī)行為,您可以聯(lián)系管理員刪除。
轉(zhuǎn)載請(qǐng)注明本文地址:http://m.hztianpu.com/yun/41378.html
摘要:利用這一業(yè)務(wù)邏輯,惡意爬蟲通過各類社工庫拿到一批手機(jī)號(hào)后可以在短時(shí)內(nèi)驗(yàn)證這批號(hào)碼是否為某一網(wǎng)站的注冊(cè)用戶。事前的甄別預(yù)防才是關(guān)鍵惡意爬蟲在給網(wǎng)站帶來可觀訪問量的同時(shí),也帶來了難以估量的威脅和損失。 整個(gè)互聯(lián)網(wǎng)的流量中,真人占比有多少? 80% ??60% ??50% ? showImg(https://segmentfault.com/img/bVGSra?w=350&h=346); ...
摘要:之前寫了一個(gè)電商爬蟲系列的文章,簡(jiǎn)單的給大家展示了一下爬蟲從入門到進(jìn)階的路徑,但是作為一個(gè)永遠(yuǎn)走在時(shí)代前沿的科技工作者,我們從來都不能停止。金融數(shù)據(jù)實(shí)在是價(jià)值大,維度多,來源廣。由于也是一種,因此通常來說,在中抽取某個(gè)元素是通過來做的。 相關(guān)教程: 手把手教你寫電商爬蟲-第一課 找個(gè)軟柿子捏捏 手把手教你寫電商爬蟲-第二課 實(shí)戰(zhàn)尚妝網(wǎng)分頁商品采集爬蟲 手把手教你寫電商爬蟲-第三課 實(shí)戰(zhàn)...
摘要:之前寫了一個(gè)電商爬蟲系列的文章,簡(jiǎn)單的給大家展示了一下爬蟲從入門到進(jìn)階的路徑,但是作為一個(gè)永遠(yuǎn)走在時(shí)代前沿的科技工作者,我們從來都不能停止。金融數(shù)據(jù)實(shí)在是價(jià)值大,維度多,來源廣。由于也是一種,因此通常來說,在中抽取某個(gè)元素是通過來做的。 相關(guān)教程: 手把手教你寫電商爬蟲-第一課 找個(gè)軟柿子捏捏 手把手教你寫電商爬蟲-第二課 實(shí)戰(zhàn)尚妝網(wǎng)分頁商品采集爬蟲 手把手教你寫電商爬蟲-第三課 實(shí)戰(zhàn)...
摘要:的關(guān)鍵技術(shù)主要有內(nèi)容存儲(chǔ)和分發(fā)技術(shù)。分發(fā)本身是和存儲(chǔ)密不可分的存儲(chǔ)和分發(fā)的實(shí)質(zhì)都是數(shù)據(jù)的讀取和使用,兩者是不可能分割的。只是存儲(chǔ)場(chǎng)景和分發(fā)場(chǎng)景,設(shè)計(jì)有些不同,服務(wù)質(zhì)量的要求也不一樣。根據(jù)區(qū)域和時(shí)段的不同,存儲(chǔ)的價(jià)格也會(huì)有不同。 showImg(https://segmentfault.com/img/remote/1460000019478027); PPIO 是為開發(fā)者打造的去中心化...
閱讀 3867·2021-09-02 09:53
閱讀 2870·2021-07-30 14:57
閱讀 3634·2019-08-30 13:09
閱讀 1279·2019-08-29 13:25
閱讀 889·2019-08-29 12:28
閱讀 1519·2019-08-29 12:26
閱讀 1211·2019-08-28 17:58
閱讀 3384·2019-08-26 13:28