Python 爬蟲實戰(zhàn)（一）：使用 requests 和 BeautifulSoup

jokester 發(fā)布于2019-07-30 15:10 / 1917人閱讀

摘要：建立連接插入數(shù)據(jù)使用方法創(chuàng)建一個游標對象執(zhí)行語句提交事務(wù)已經(jīng)存在如果發(fā)生錯誤則回滾關(guān)閉游標連接關(guān)閉數(shù)據(jù)庫連接定時設(shè)置做了一個定時，過段時間就去爬一次。

Python 基礎(chǔ)

我之前寫的《Python 3 極簡教程.pdf》，適合有點編程基礎(chǔ)的快速入門，通過該系列文章學習，能夠獨立完成接口的編寫，寫寫小東西沒問題。

requests

requests，Python HTTP 請求庫，相當于 Android 的 Retrofit，它的功能包括 Keep-Alive 和連接池、Cookie 持久化、內(nèi)容自動解壓、HTTP 代理、SSL 認證、連接超時、Session 等很多特性，同時兼容 Python2 和 Python3，GitHub：https://github.com/requests/r... 。

安裝

Mac：

pip3 install requests

Windows：

pip install requests

發(fā)送請求

HTTP 請求方法有 get、post、put、delete。

import requests

# get 請求
response = requests.get("http://127.0.0.1:1024/developer/api/v1.0/all")

# post 請求
response = requests.post("http://127.0.0.1:1024/developer/api/v1.0/insert")

# put 請求
response = requests.put("http://127.0.0.1:1024/developer/api/v1.0/update")

# delete 請求
response = requests.delete("http://127.0.0.1:1024/developer/api/v1.0/delete")

請求返回 Response 對象，Response 對象是對 HTTP 協(xié)議中服務(wù)端返回給瀏覽器的響應(yīng)數(shù)據(jù)的封裝，響應(yīng)的中的主要元素包括：狀態(tài)碼、原因短語、響應(yīng)首部、響應(yīng) URL、響應(yīng) encoding、響應(yīng)體等等。

# 狀態(tài)碼
print(response.status_code)

# 響應(yīng) URL
print(response.url)

# 響應(yīng)短語
print(response.reason)

# 響應(yīng)內(nèi)容
print(response.json())

定制請求頭

請求添加 HTTP 頭部 Headers，只要傳遞一個 dict 給 headers 關(guān)鍵字參數(shù)就可以了。

header = {"Application-Id": "19869a66c6",
          "Content-Type": "application/json"
          }
response = requests.get("http://127.0.0.1:1024/developer/api/v1.0/all/", headers=header)

構(gòu)建查詢參數(shù)

想為 URL 的查詢字符串(query string)傳遞某種數(shù)據(jù)，比如：http://127.0.0.1:1024/developer/api/v1.0/all?key1=value1&key2=value2 ，Requests 允許你使用 params 關(guān)鍵字參數(shù)，以一個字符串字典來提供這些參數(shù)。

payload = {"key1": "value1", "key2": "value2"}
response = requests.get("http://127.0.0.1:1024/developer/api/v1.0/all", params=payload)

還可以將 list 作為值傳入：

payload = {"key1": "value1", "key2": ["value2", "value3"]}
response = requests.get("http://127.0.0.1:1024/developer/api/v1.0/all", params=payload)

# 響應(yīng) URL
print(response.url)# 打?。篽ttp://127.0.0.1:1024/developer/api/v1.0/all?key1=value1&key2=value2&key2=value3

post 請求數(shù)據(jù)

如果服務(wù)器要求發(fā)送的數(shù)據(jù)是表單數(shù)據(jù)，則可以指定關(guān)鍵字參數(shù) data。

payload = {"key1": "value1", "key2": "value2"}
response = requests.post("http://127.0.0.1:1024/developer/api/v1.0/insert", data=payload)

如果要求傳遞 json 格式字符串參數(shù)，則可以使用 json 關(guān)鍵字參數(shù)，參數(shù)的值都可以字典的形式傳過去。

obj = {
    "article_title": "小公務(wù)員之死2"
}
# response = requests.post("http://127.0.0.1:1024/developer/api/v1.0/insert", json=obj)

響應(yīng)內(nèi)容

Requests 會自動解碼來自服務(wù)器的內(nèi)容。大多數(shù) unicode 字符集都能被無縫地解碼。請求發(fā)出后，Requests 會基于 HTTP 頭部對響應(yīng)的編碼作出有根據(jù)的推測。

# 響應(yīng)內(nèi)容
# 返回是 是 str 類型內(nèi)容
# print(response.text())
# 返回是 JSON 響應(yīng)內(nèi)容
print(response.json())
# 返回是二進制響應(yīng)內(nèi)容
# print(response.content())
# 原始響應(yīng)內(nèi)容，初始請求中設(shè)置了 stream=True
# response = requests.get("http://127.0.0.1:1024/developer/api/v1.0/all", stream=True)
# print(response.raw())

超時

如果沒有顯式指定了 timeout 值，requests 是不會自動進行超時處理的。如果遇到服務(wù)器沒有響應(yīng)的情況時，整個應(yīng)用程序一直處于阻塞狀態(tài)而沒法處理其他請求。

response = requests.get("http://127.0.0.1:1024/developer/api/v1.0/all", timeout=5)  # 單位秒數(shù)

代理設(shè)置

如果頻繁訪問一個網(wǎng)站，很容易被服務(wù)器屏蔽掉，requests 完美支持代理。

# 代理
proxies = {
    "http": "http://127.0.0.1:1024",
    "https": "http://127.0.0.1:4000",
}
response = requests.get("http://127.0.0.1:1024/developer/api/v1.0/all", proxies=proxies)

BeautifulSoup

BeautifulSoup，Python Html 解析庫，相當于 Java 的 jsoup。

安裝

BeautifulSoup 3 目前已經(jīng)停止開發(fā)，直接使用BeautifulSoup 4。

Mac：

pip3 install beautifulsoup4

Windows：

pip install beautifulsoup4

安裝解析器

我用的是 html5lib，純 Python 實現(xiàn)的。

Mac：

pip3 install html5lib

Windows：

pip install html5lib

簡單使用

BeautifulSoup 將復雜 HTML 文檔轉(zhuǎn)換成一個復雜的樹形結(jié)構(gòu)，每個節(jié)點都是 Python 對象。

解析

from bs4 import BeautifulSoup

def get_html_data():
    html_doc = """
    
    
    WuXiaolong
    
    
    分享 Android 技術(shù)，也關(guān)注 Python 等熱門技術(shù)。
    寫博客的初衷：總結(jié)經(jīng)驗，記錄自己的成長。
    你必須足夠的努力，才能看起來毫不費力！專注！精致！
    
    WuXiaolong"s blog
    公眾號：吳小龍同學 
    GitHub
    
       
    """
    soup = BeautifulSoup(html_doc, "html5lib")

tag

tag = soup.head
print(tag)  # WuXiaolong
print(tag.name)  # head
print(tag.title)  # WuXiaolong
print(soup.p)  # 分享 Android 技術(shù)，也關(guān)注 Python 等熱門技術(shù)。
print(soup.a["href"])  # 輸出 a 標簽的 href 屬性：http://wuxiaolong.me/

注意：tag 如果多個匹配，返回第一個，比如這里的 p 標簽。

查找

print(soup.find("p"))  # 分享 Android 技術(shù)，也關(guān)注 Python 等熱門技術(shù)。

find 默認也是返回第一個匹配的標簽，沒找到匹配的節(jié)點則返回 None。如果我想指定查找，比如這里的公眾號，可以指定標簽的如 class 屬性值：

# 因為 class 是 Python 關(guān)鍵字，所以這里指定為 class_。
print(soup.find("p", class_="WeChat"))
# 公眾號

查找所有的 P 標簽：

for p in soup.find_all("p"):
    print(p.string)

實戰(zhàn)

前段時間，有用戶反饋，我的個人 APP 掛了，雖然這個 APP 我已經(jīng)不再維護，但是我也得起碼保證它能正常運行。大部分人都知道這個 APP 數(shù)據(jù)是爬來的（詳見：《手把手教你做個人app》），數(shù)據(jù)爬來的好處之一就是不用自己管數(shù)據(jù)，弊端是別人網(wǎng)站掛了或網(wǎng)站的 HTML 節(jié)點變了，我這邊就解析不到，就沒數(shù)據(jù)。這次用戶反饋，我在想要不要把他們網(wǎng)站數(shù)據(jù)直接爬蟲了，正好自學 Python，練練手，嗯說干就干，本來是想著先用 Python 爬蟲，MySQL 插入本地數(shù)據(jù)庫，然后 Flask 自己寫接口，用 Android 的 Retrofit 調(diào)，再用 bmob sdk 插入 bmob……哎，費勁，感覺行不通，后來我得知 bmob 提供了 RESTful，解決大問題，我可以直接 Python 爬蟲插入就好了，這里我演示的是插入本地數(shù)據(jù)庫，如果用 bmob，是調(diào) bmob 提供的 RESTful 插數(shù)據(jù)。

網(wǎng)站選定

我選的演示網(wǎng)站：https://meiriyiwen.com/random ，大家可以發(fā)現(xiàn)，每次請求的文章都不一樣，正好利用這點，我只要定時去請求，解析自己需要的數(shù)據(jù)，插入數(shù)據(jù)庫就 OK 了。

創(chuàng)建數(shù)據(jù)庫

我直接用 NaviCat Premium 創(chuàng)建的，當然也可以用命令行。

創(chuàng)建表

創(chuàng)建表 article，用的 pymysql，表需要 id，article_title，article_author，article_content 字段，代碼如下，只需要調(diào)一次就好了。

import pymysql


def create_table():
    # 建立連接
    db = pymysql.connect(host="localhost",
                         user="root",
                         password="root",
                         db="python3learn")
    # 創(chuàng)建名為 article 數(shù)據(jù)庫語句
    sql = """create table if not exists article (
    id int NOT NULL AUTO_INCREMENT, 
    article_title text,
    article_author text,
    article_content text,
    PRIMARY KEY (`id`)
    )"""
    # 使用 cursor() 方法創(chuàng)建一個游標對象 cursor
    cursor = db.cursor()
    try:
        # 執(zhí)行 sql 語句
        cursor.execute(sql)
        # 提交事務(wù)
        db.commit()
        print("create table success")
    except BaseException as e:  # 如果發(fā)生錯誤則回滾
        db.rollback()
        print(e)

    finally:
        # 關(guān)閉游標連接
        cursor.close()
        # 關(guān)閉數(shù)據(jù)庫連接
        db.close()


if __name__ == "__main__":
    create_table()

解析網(wǎng)站

首先需要 requests 請求網(wǎng)站，然后 BeautifulSoup 解析自己需要的節(jié)點。

import requests
from bs4 import BeautifulSoup


def get_html_data():
    # get 請求
    response = requests.get("https://meiriyiwen.com/random")

    soup = BeautifulSoup(response.content, "html5lib")
    article = soup.find("div", id="article_show")
    article_title = article.h1.string
    print("article_title=%s" % article_title)
    article_author = article.find("p", class_="article_author").string
    print("article_author=%s" % article.find("p", class_="article_author").string)
    article_contents = article.find("div", class_="article_text").find_all("p")
    article_content = ""
    for content in article_contents:
        article_content = article_content + str(content)
        print("article_content=%s" % article_content)

插入數(shù)據(jù)庫

這里做了一個篩選，默認這個網(wǎng)站的文章標題是唯一的，插入數(shù)據(jù)時，如果有了同樣的標題就不插入。

import pymysql


def insert_table(article_title, article_author, article_content):
    # 建立連接
    db = pymysql.connect(host="localhost",
                         user="root",
                         password="root",
                         db="python3learn",
                         charset="utf8")
    # 插入數(shù)據(jù)
    query_sql = "select * from article where article_title=%s"
    sql = "insert into article (article_title,article_author,article_content) values (%s, %s, %s)"
    # 使用 cursor() 方法創(chuàng)建一個游標對象 cursor
    cursor = db.cursor()
    try:
        query_value = (article_title,)
        # 執(zhí)行 sql 語句
        cursor.execute(query_sql, query_value)
        results = cursor.fetchall()
        if len(results) == 0:
            value = (article_title, article_author, article_content)
            cursor.execute(sql, value)
            # 提交事務(wù)
            db.commit()
            print("--------------《%s》 insert table success-------------" % article_title)
            return True
        else:
            print("--------------《%s》 已經(jīng)存在-------------" % article_title)
            return False

    except BaseException as e:  # 如果發(fā)生錯誤則回滾
        db.rollback()
        print(e)

    finally:  # 關(guān)閉游標連接
        cursor.close()
        # 關(guān)閉數(shù)據(jù)庫連接
        db.close()

定時設(shè)置

做了一個定時，過段時間就去爬一次。

import sched
import time


# 初始化 sched 模塊的 scheduler 類
# 第一個參數(shù)是一個可以返回時間戳的函數(shù)，第二個參數(shù)可以在定時未到達之前阻塞。
schedule = sched.scheduler(time.time, time.sleep)


# 被周期性調(diào)度觸發(fā)的函數(shù)
def print_time(inc):
    # to do something
    print("to do something")
    schedule.enter(inc, 0, print_time, (inc,))


# 默認參數(shù) 60 s
def start(inc=60):
    # enter四個參數(shù)分別為：間隔事件、優(yōu)先級（用于同時間到達的兩個事件同時執(zhí)行時定序）、被調(diào)用觸發(fā)的函數(shù)，
    # 給該觸發(fā)函數(shù)的參數(shù)（tuple形式）
    schedule.enter(0, 0, print_time, (inc,))
    schedule.run()


if __name__ == "__main__":
    # 5 s 輸出一次
    start(5)

完整代碼

import pymysql
import requests
from bs4 import BeautifulSoup
import sched
import time


def create_table():
    # 建立連接
    db = pymysql.connect(host="localhost",
                         user="root",
                         password="root",
                         db="python3learn")
    # 創(chuàng)建名為 article 數(shù)據(jù)庫語句
    sql = """create table if not exists article (
    id int NOT NULL AUTO_INCREMENT, 
    article_title text,
    article_author text,
    article_content text,
    PRIMARY KEY (`id`)
    )"""
    # 使用 cursor() 方法創(chuàng)建一個游標對象 cursor
    cursor = db.cursor()
    try:
        # 執(zhí)行 sql 語句
        cursor.execute(sql)
        # 提交事務(wù)
        db.commit()
        print("create table success")
    except BaseException as e:  # 如果發(fā)生錯誤則回滾
        db.rollback()
        print(e)

    finally:
        # 關(guān)閉游標連接
        cursor.close()
        # 關(guān)閉數(shù)據(jù)庫連接
        db.close()


def insert_table(article_title, article_author, article_content):
    # 建立連接
    db = pymysql.connect(host="localhost",
                         user="root",
                         password="root",
                         db="python3learn",
                         charset="utf8")
    # 插入數(shù)據(jù)
    query_sql = "select * from article where article_title=%s"
    sql = "insert into article (article_title,article_author,article_content) values (%s, %s, %s)"
    # 使用 cursor() 方法創(chuàng)建一個游標對象 cursor
    cursor = db.cursor()
    try:
        query_value = (article_title,)
        # 執(zhí)行 sql 語句
        cursor.execute(query_sql, query_value)
        results = cursor.fetchall()
        if len(results) == 0:
            value = (article_title, article_author, article_content)
            cursor.execute(sql, value)
            # 提交事務(wù)
            db.commit()
            print("--------------《%s》 insert table success-------------" % article_title)
            return True
        else:
            print("--------------《%s》 已經(jīng)存在-------------" % article_title)
            return False

    except BaseException as e:  # 如果發(fā)生錯誤則回滾
        db.rollback()
        print(e)

    finally:  # 關(guān)閉游標連接
        cursor.close()
        # 關(guān)閉數(shù)據(jù)庫連接
        db.close()


def get_html_data():
    # get 請求
    response = requests.get("https://meiriyiwen.com/random")

    soup = BeautifulSoup(response.content, "html5lib")
    article = soup.find("div", id="article_show")
    article_title = article.h1.string
    print("article_title=%s" % article_title)
    article_author = article.find("p", class_="article_author").string
    print("article_author=%s" % article.find("p", class_="article_author").string)
    article_contents = article.find("div", class_="article_text").find_all("p")
    article_content = ""
    for content in article_contents:
        article_content = article_content + str(content)
        print("article_content=%s" % article_content)

    # 插入數(shù)據(jù)庫
    insert_table(article_title, article_author, article_content)


# 初始化 sched 模塊的 scheduler 類
# 第一個參數(shù)是一個可以返回時間戳的函數(shù)，第二個參數(shù)可以在定時未到達之前阻塞。
schedule = sched.scheduler(time.time, time.sleep)


# 被周期性調(diào)度觸發(fā)的函數(shù)
def print_time(inc):
    get_html_data()
    schedule.enter(inc, 0, print_time, (inc,))


# 默認參數(shù) 60 s
def start(inc=60):
    # enter四個參數(shù)分別為：間隔事件、優(yōu)先級（用于同時間到達的兩個事件同時執(zhí)行時定序）、被調(diào)用觸發(fā)的函數(shù)，
    # 給該觸發(fā)函數(shù)的參數(shù)（tuple形式）
    schedule.enter(0, 0, print_time, (inc,))
    schedule.run()


if __name__ == "__main__":
    start(60*5)

問題：這只是對一篇文章爬蟲，如果是那種文章列表，點擊是文章詳情，這種如何爬蟲解析？首先肯定要拿到列表，再循環(huán)一個個解析文章詳情插入數(shù)據(jù)庫？還沒有想好該如何做更好，留給后面的課題吧。

最后

雖然我學 Python 純屬業(yè)余愛好，但是也要學以致用，不然這些知識很快就忘記了，期待下篇 Python 方面的文章。

參考

快速上手 — Requests 2.18.1 文檔

爬蟲入門系列（二）：優(yōu)雅的HTTP庫requests

Beautiful Soup 4.2.0 文檔

爬蟲入門系列（四）：HTML文本解析庫BeautifulSoup

云服務(wù)器 GPU云服務(wù)器 python爬蟲實戰(zhàn) python3爬蟲實戰(zhàn) 爬蟲和python python和爬蟲

文章版權(quán)歸作者所有，未經(jīng)允許請勿轉(zhuǎn)載,若此文章存在違規(guī)行為，您可以聯(lián)系管理員刪除。

轉(zhuǎn)載請注明本文地址：http://m.hztianpu.com/yun/41082.html

發(fā)表評論

登陸后可評論

0條評論

jokester

男|高級講師

我要關(guān)注我要私信

TA的文章

tensorflow怎么下載

閱讀 3070·2023-04-25 17:22
用純css實現(xiàn)打星星效果（三）

閱讀 1626·2019-08-30 15:54
視覺格式化模型(Visual formatting model)

閱讀 1341·2019-08-30 15:53
移動端開發(fā)IOS 6PLUS中表單輸入造成的頁面高度縮小bug

閱讀 1896·2019-08-30 15:43
快速判斷瀏覽器是否支持特定css、js功能

閱讀 3192·2019-08-29 12:29
字符串replace方法的使用

閱讀 1299·2019-08-26 11:37
vue formData上傳圖片以及其他表單數(shù)據(jù)

閱讀 3366·2019-08-23 18:02
Ajax詳解

閱讀 1694·2019-08-23 14:15

成人无码视频,亚洲精品久久久久av无码,午夜精品久久久久久毛片,亚洲中文字幕日韩无码

資訊專欄INFORMATION COLUMN

上云采購季！| 2核2G4M爆款云服務(wù)器低至59元/年，更有多臺、長期優(yōu)惠，快來選購！

Python 爬蟲實戰(zhàn)（一）：使用 requests 和 BeautifulSoup

相關(guān)文章

Python 從零開始爬蟲(三)——實戰(zhàn)：requests+BeautifulSoup實現(xiàn)靜態(tài)爬取

Python 爬蟲實戰(zhàn)（二）：使用 requests-html

Python爬蟲基礎(chǔ)

python爬蟲實戰(zhàn)：爬取西刺代理的代理ip（二）

發(fā)表評論

0條評論

jokester

男|高級講師

TA的文章

tensorflow怎么下載

用純css實現(xiàn)打星星效果（三）

視覺格式化模型(Visual formatting model)

移動端開發(fā)IOS 6PLUS中表單輸入造成的頁面高度縮小bug

快速判斷瀏覽器是否支持特定css、js功能

字符串replace方法的使用

vue formData上傳圖片以及其他表單數(shù)據(jù)

Ajax詳解

最新活動

資訊專欄INFORMATION COLUMN

上云采購季！| 2核2G4M爆款云服務(wù)器低至59元/年，更有多臺、長期優(yōu)惠，快來選購！

Python 爬蟲實戰(zhàn)（一）：使用 requests 和 BeautifulSoup

相關(guān)文章

發(fā)表評論

0條評論

男|高級講師

TA的文章

最新活動

上云采購季！| 2核2G4M爆款云服務(wù)器低至59元/年，更有多臺、長期優(yōu)惠，快來選購！