Python抓取百度百科數(shù)據(jù)

loostudy 發(fā)布于2019-07-25 11:26 / 1924人閱讀

摘要：前言本文整理自慕課網(wǎng)開(kāi)發(fā)簡(jiǎn)單爬蟲(chóng)，將會(huì)記錄爬取百度百科詞條相關(guān)頁(yè)面的整個(gè)過(guò)程。本實(shí)例抓取百度百科詞條頁(yè)面以及相關(guān)詞條頁(yè)面的標(biāo)題和簡(jiǎn)介。分析目標(biāo)格式進(jìn)入百度百科詞條頁(yè)面，頁(yè)面中相關(guān)詞條的鏈接比較統(tǒng)一，大都是。

前言

本文整理自慕課網(wǎng)《Python開(kāi)發(fā)簡(jiǎn)單爬蟲(chóng)》，將會(huì)記錄爬取百度百科“python”詞條相關(guān)頁(yè)面的整個(gè)過(guò)程。

抓取策略

確定目標(biāo)：確定抓取哪個(gè)網(wǎng)站的哪些頁(yè)面的哪部分?jǐn)?shù)據(jù)。本實(shí)例抓取百度百科python詞條頁(yè)面以及python相關(guān)詞條頁(yè)面的標(biāo)題和簡(jiǎn)介。
分析目標(biāo)：分析要抓取的url的格式，限定抓取范圍。分析要抓取的數(shù)據(jù)的格式，本實(shí)例中就要分析標(biāo)題和簡(jiǎn)介這兩個(gè)數(shù)據(jù)所在的標(biāo)簽的格式。分析要抓取的頁(yè)面編碼的格式，在網(wǎng)頁(yè)解析器部分，要指定網(wǎng)頁(yè)編碼，然后才能進(jìn)行正確的解析。
編寫(xiě)代碼：在網(wǎng)頁(yè)解析器部分，要使用到分析目標(biāo)得到的結(jié)果。
執(zhí)行爬蟲(chóng)：進(jìn)行數(shù)據(jù)抓取。

分析目標(biāo)

1、url格式
進(jìn)入百度百科python詞條頁(yè)面，頁(yè)面中相關(guān)詞條的鏈接比較統(tǒng)一，大都是/view/xxx.htm。

2、數(shù)據(jù)格式
標(biāo)題位于類(lèi)lemmaWgt-lemmaTitle-title下的h1子標(biāo)簽，簡(jiǎn)介位于類(lèi)lemma-summary下。

3、編碼格式
查看頁(yè)面編碼格式，為utf-8。

經(jīng)過(guò)以上分析，得到結(jié)果如下：

代碼編寫(xiě) 項(xiàng)目結(jié)構(gòu)

在sublime下，新建文件夾baike-spider，作為項(xiàng)目根目錄。
新建spider_main.py，作為爬蟲(chóng)總調(diào)度程序。
新建url_manger.py，作為url管理器。
新建html_downloader.py，作為html下載器。
新建html_parser.py，作為html解析器。
新建html_outputer.py，作為寫(xiě)出數(shù)據(jù)的工具。
最終項(xiàng)目結(jié)構(gòu)如下圖：

spider_main.py

# coding:utf-8
import url_manager, html_downloader, html_parser, html_outputer

class SpiderMain(object):
    def __init__(self):
        self.urls = url_manager.UrlManager()
        self.downloader = html_downloader.HtmlDownloader()
        self.parser = html_parser.HtmlParser()
        self.outputer = html_outputer.HtmlOutputer()

    def craw(self, root_url):
        count = 1
        self.urls.add_new_url(root_url)
        while self.urls.has_new_url():
            try:
                new_url = self.urls.get_new_url()
                print("craw %d : %s" % (count, new_url))
                html_cont = self.downloader.download(new_url)
                new_urls, new_data = self.parser.parse(new_url, html_cont)
                self.urls.add_new_urls(new_urls)
                self.outputer.collect_data(new_data)

                if count == 10:
                    break

                count = count + 1
            except:
                print("craw failed")

        self.outputer.output_html()


if __name__=="__main__":
    root_url = "http://baike.baidu.com/view/21087.htm"
    obj_spider = SpiderMain()
    obj_spider.craw(root_url)

url_manger.py

# coding:utf-8
class UrlManager(object):
    def __init__(self):
        self.new_urls = set()
        self.old_urls = set()

    def add_new_url(self, url):
        if url is None:
            return
        if url not in self.new_urls and url not in self.old_urls:
            self.new_urls.add(url)

    def add_new_urls(self, urls):
        if urls is None or len(urls) == 0:
            return
        for url in urls:
            self.add_new_url(url)

    def has_new_url(self):
        return len(self.new_urls) != 0

    def get_new_url(self):
        new_url = self.new_urls.pop()
        self.old_urls.add(new_url)
        return new_url

html_downloader.py

# coding:utf-8
import urllib.request

class HtmlDownloader(object):
    def download(self, url):
        if url is None:
            return None
        response = urllib.request.urlopen(url)
        if response.getcode() != 200:
            return None
        return response.read()

html_parser.py

# coding:utf-8
from bs4 import BeautifulSoup
import re
from urllib.parse import urljoin

class HtmlParser(object):
    def _get_new_urls(self, page_url, soup):
        new_urls = set()
        # /view/123.htm
        links = soup.find_all("a", href=re.compile(r"/view/d+.htm"))
        for link in links:
            new_url = link["href"]
            new_full_url = urljoin(page_url, new_url)
            # print(new_full_url)
            new_urls.add(new_full_url)
        #print(new_urls)
        return new_urls

    def _get_new_data(self, page_url, soup):
        res_data = {}
        # url
        res_data["url"] = page_url
        #  Python
        title_node = soup.find("dd", class_="lemmaWgt-lemmaTitle-title").find("h1")
        res_data["title"] = title_node.get_text()
        # 
        summary_node = soup.find("div", class_="lemma-summary")
        res_data["summary"] = summary_node.get_text()
        # print(res_data)
        return res_data

    def parse(self, page_url, html_cont):
        if page_url is None or html_cont is None:
            return
        soup = BeautifulSoup(html_cont, "html.parser")
        # print(soup.prettify())
        new_urls = self._get_new_urls(page_url, soup)
        new_data = self._get_new_data(page_url, soup)
        # print("mark")
        return new_urls, new_data

html_outputer.py

# coding:utf-8
class HtmlOutputer(object):
    def __init__(self):
        self.datas = []

    def collect_data(self, data):
        if data is None:
            return
        self.datas.append(data)

    def output_html(self):
        fout = open("output.html","w", encoding="utf-8")

        fout.write("")
        fout.write("")
        fout.write("")

        for data in self.datas:
            fout.write("")
            fout.write("" % data["url"])
            fout.write("" % data["title"])
            fout.write("" % data["summary"])
            fout.write("")

        fout.write("%s %s %s")
        fout.write("")
        fout.write("")

        fout.close()

運(yùn)行

在命令行下，執(zhí)行python spider_main.py。

編碼問(wèn)題

問(wèn)題描述：UnicodeEncodeError: "gbk" codec can"t encode character "xa0" in position ...

使用Python寫(xiě)文件的時(shí)候，或者將網(wǎng)絡(luò)數(shù)據(jù)流寫(xiě)入到本地文件的時(shí)候，大部分情況下會(huì)遇到這個(gè)問(wèn)題。網(wǎng)絡(luò)上有很多類(lèi)似的文章講述如何解決這個(gè)問(wèn)題，但是無(wú)非就是encode，decode相關(guān)的，這是導(dǎo)致該問(wèn)題出現(xiàn)的真正原因嗎？不是的。很多時(shí)候，我們使用了decode和encode，試遍了各種編碼，utf8，utf-8，gbk，gb2312等等，該有的編碼都試遍了，可是仍然出現(xiàn)該錯(cuò)誤，令人崩潰。

在windows下面編寫(xiě)python腳本，編碼問(wèn)題很?chē)?yán)重。將網(wǎng)絡(luò)數(shù)據(jù)流寫(xiě)入文件時(shí)，我們會(huì)遇到幾個(gè)編碼：
1、#encoding="XXX"
這里(也就是python文件第一行的內(nèi)容)的編碼是指該python腳本文件本身的編碼，無(wú)關(guān)緊要。只要XXX和文件本身的編碼相同就行了。
比如notepad++"格式"菜單里面里可以設(shè)置各種編碼，這時(shí)需要保證該菜單里設(shè)置的編碼和encoding XXX相同就行了，不同的話(huà)會(huì)報(bào)錯(cuò)。

2、網(wǎng)絡(luò)數(shù)據(jù)流的編碼
比如獲取網(wǎng)頁(yè)，那么網(wǎng)絡(luò)數(shù)據(jù)流的編碼就是網(wǎng)頁(yè)的編碼。需要使用decode解碼成unicode編碼。

3、目標(biāo)文件的編碼
將網(wǎng)絡(luò)數(shù)據(jù)流寫(xiě)入到新文件，寫(xiě)文件代碼如下：

fout = open("output.html","w")
fout.write(str)

在windows下面，新文件的默認(rèn)編碼是gbk，python解釋器會(huì)用gbk編碼去解析我們的網(wǎng)絡(luò)數(shù)據(jù)流str，然而str是decode過(guò)的unicode編碼，這樣的話(huà)就會(huì)導(dǎo)致解析不了，出現(xiàn)上述問(wèn)題。解決的辦法是改變目標(biāo)文件的編碼：

fout = open("output.html","w", encoding="utf-8")

運(yùn)行結(jié)果

源碼分享

https://github.com/voidking/b...

書(shū)簽

Python開(kāi)發(fā)簡(jiǎn)單爬蟲(chóng)
http://www.imooc.com/learn/563

The Python Standard Library
https://docs.python.org/3/lib...

Beautiful Soup 4.2.0 文檔
https://www.crummy.com/softwa...

Python詞條
http://baike.baidu.com/view/2...
http://baike.baidu.com/item/P...

Python3.x爬蟲(chóng)教程：爬網(wǎng)頁(yè)、爬圖片、自動(dòng)登錄
http://www.2cto.com/kf/201507...

使用python3進(jìn)行優(yōu)雅的爬蟲(chóng)（一）爬取圖片
http://www.jianshu.com/p/6969...

Python UnicodeEncodeError: "gbk" codec can"t encode character 解決方法
http://www.jb51.net/article/6...

Scrapy documentation
https://doc.scrapy.org/en/lat...

GPU云服務(wù)器云服務(wù)器 cdn百度百科 jquery百度百科百度百科怎么做云計(jì)算百度百科

文章版權(quán)歸作者所有，未經(jīng)允許請(qǐng)勿轉(zhuǎn)載,若此文章存在違規(guī)行為，您可以聯(lián)系管理員刪除。

轉(zhuǎn)載請(qǐng)注明本文地址：http://m.hztianpu.com/yun/38399.html

發(fā)表評(píng)論

登陸后可評(píng)論

0條評(píng)論

loostudy

男|高級(jí)講師

我要關(guān)注我要私信

TA的文章

免費(fèi)英文字體Hanson Bold 無(wú)襯線(xiàn)粗體字體無(wú)需商業(yè)授權(quán)

閱讀 4278·2021-09-24 10:24
虛擬主機(jī)怎么傳文件-虛擬主機(jī)怎么上傳文件？

閱讀 1521·2021-09-22 16:01
程序員的算法趣題Q12: 平方根數(shù)字

閱讀 2812·2021-09-06 15:02
關(guān)于CSS的position屬性

閱讀 1089·2019-08-30 13:01
探究行內(nèi)元素和塊級(jí)元素

閱讀 1067·2019-08-30 10:52
：：before和：：after的詳細(xì)介紹

閱讀 693·2019-08-29 16:36
左右兩側(cè)寬度固定,中間自適應(yīng)之【圣杯+雙飛翼】實(shí)現(xiàn)

閱讀 2298·2019-08-29 12:51
Week 1：那些值得一閱的好文章

閱讀 2414·2019-08-28 18:29

成人无码视频,亚洲精品久久久久av无码,午夜精品久久久久久毛片,亚洲中文字幕日韩无码

資訊專(zhuān)欄INFORMATION COLUMN

上云采購(gòu)季！| 2核2G4M爆款云服務(wù)器低至59元/年，更有多臺(tái)、長(zhǎng)期優(yōu)惠，快來(lái)選購(gòu)！

Python抓取百度百科數(shù)據(jù)

Python

相關(guān)文章

零基礎(chǔ)如何學(xué)爬蟲(chóng)技術(shù)

爬蟲(chóng)學(xué)習(xí)之一個(gè)簡(jiǎn)單的網(wǎng)絡(luò)爬蟲(chóng)

Python爬蟲(chóng)學(xué)習(xí)路線(xiàn)

小程序開(kāi)發(fā)（一）：使用scrapy爬蟲(chóng)采集數(shù)據(jù)

Python3網(wǎng)絡(luò)爬蟲(chóng)實(shí)戰(zhàn)---19、代理基本原理

發(fā)表評(píng)論

0條評(píng)論

loostudy

男|高級(jí)講師

TA的文章

免費(fèi)英文字體Hanson Bold 無(wú)襯線(xiàn)粗體字體無(wú)需商業(yè)授權(quán)

虛擬主機(jī)怎么傳文件-虛擬主機(jī)怎么上傳文件？

程序員的算法趣題Q12: 平方根數(shù)字

關(guān)于CSS的position屬性

探究行內(nèi)元素和塊級(jí)元素

：：before和：：after的詳細(xì)介紹

左右兩側(cè)寬度固定,中間自適應(yīng)之【圣杯+雙飛翼】實(shí)現(xiàn)

Week 1：那些值得一閱的好文章

最新活動(dòng)

資訊專(zhuān)欄INFORMATION COLUMN

上云采購(gòu)季！| 2核2G4M爆款云服務(wù)器低至59元/年，更有多臺(tái)、長(zhǎng)期優(yōu)惠，快來(lái)選購(gòu)！

Python抓取百度百科數(shù)據(jù)

Python

相關(guān)文章

發(fā)表評(píng)論

0條評(píng)論

男|高級(jí)講師

TA的文章

最新活動(dòng)

上云采購(gòu)季！| 2核2G4M爆款云服務(wù)器低至59元/年，更有多臺(tái)、長(zhǎng)期優(yōu)惠，快來(lái)選購(gòu)！