Koa2 + Mongo + 爬蟲(chóng) 搭建小說(shuō)微信小程序（本地開(kāi)發(fā)篇）

Kross 發(fā)布于2019-08-23 14:15 / 1279人閱讀

摘要：前言根據(jù)慕課網(wǎng)實(shí)現(xiàn)電影微信公眾號(hào)前后端開(kāi)發(fā)學(xué)習(xí)后的改造由于上下班期間會(huì)看會(huì)小說(shuō)，但是無(wú)奈廣告太多，還要收費(fèi)，于是結(jié)合課程，進(jìn)行開(kāi)發(fā)，并上傳到自己的微信小程序。

前言：根據(jù)慕課網(wǎng) Koa2 實(shí)現(xiàn)電影微信公眾號(hào)前后端開(kāi)發(fā) 學(xué)習(xí)后的改造

由于上下班期間會(huì)看會(huì)小說(shuō)，但是無(wú)奈廣告太多，還要收費(fèi)，于是結(jié)合課程，進(jìn)行開(kāi)發(fā)，并上傳到自己的微信小程序。

github

大致的思路：
1.連接數(shù)據(jù)庫(kù)
2.跑定時(shí)任務(wù)，進(jìn)行數(shù)據(jù)庫(kù)的更新
3.開(kāi)啟接口服務(wù)
4.微信小程序接口調(diào)用

1.連接數(shù)據(jù)庫(kù)

連接本地的mongodb數(shù)據(jù)庫(kù)

const mongoose = require("mongoose")
var db = "mongodb://localhost/story-bookShelf"

exports.connect = () => {
  let maxConnectTimes = 0

  return new Promise((resolve, reject) => {
    if (process.env.NODE_ENV !== "production") {
      mongoose.set("debug", false)
    }

    mongoose.connect(db)

    mongoose.connection.on("disconnected", () => {
      maxConnectTimes++

      if (maxConnectTimes < 5) {
        mongoose.connect(db)
      } else {
        throw new Error("數(shù)據(jù)庫(kù)掛了吧，快去修吧")
      }
    })

    mongoose.connection.on("error", err => {
      console.log(err)
      maxConnectTimes++

      if (maxConnectTimes < 5) {
        mongoose.connect(db)
      } else {
        throw new Error("數(shù)據(jù)庫(kù)掛了吧，快去修吧")
      }
    })

    mongoose.connection.once("open", () => {
      resolve()
      console.log("MongoDB Connected successfully!")
    })
  })
}

然后初始化定義好的Schema

const mongoose = require("mongoose")
const Schema = mongoose.Schema

const bookSchema = new Schema({
  name: {
    type: String
  },
  bookId: {
    unique: true,
    type: Number
  }
})
......
mongoose.model("Book", bookSchema)

2.跑定時(shí)任務(wù)，進(jìn)行數(shù)據(jù)庫(kù)的更新

這一步驟主要是在定時(shí)進(jìn)行數(shù)據(jù)庫(kù)小說(shuō)章節(jié)的更新，用的是 node-schedule進(jìn)行定時(shí)跑任務(wù)。

小說(shuō)章節(jié)數(shù)是否增加，沒(méi)增加不用進(jìn)行爬取。同時(shí)在爬取的時(shí)候需要提前前5章爬取，避免一些作者為了占坑，提前寫(xiě)的預(yù)告。

每一本小說(shuō)就開(kāi)一個(gè)子進(jìn)程child_process去跑，將數(shù)據(jù)存儲(chǔ)到mongo, 同時(shí)存儲(chǔ)子進(jìn)程對(duì)后續(xù)有用。

定時(shí)跑任務(wù)時(shí)候會(huì)遇到上一條任務(wù)還在跑，所以在每一次跑之前都清空一遍儲(chǔ)存的子進(jìn)程，將子進(jìn)程殺掉。

章節(jié)任務(wù)

// chapter.js 

const cp = require("child_process")
const { resolve } = require("path")
const mongoose = require("mongoose")
const { childProcessStore } = require("../lib/child_process_store") // 全局存儲(chǔ)子進(jìn)程

/**
 * 
 * @param {書(shū)本ID} bookId 
 * @param {從哪里開(kāi)始查找} startNum 
 */
exports.taskChapter = async(bookId, startNum = 0) => {
  
  const Chapter = mongoose.model("Chapter")
  
  const script = resolve(__dirname, "../crawler/chapter.js") // 真正執(zhí)行爬蟲(chóng)任務(wù)模塊
  const child = cp.fork(script, []) // 開(kāi)啟IPC通道，傳遞數(shù)據(jù)
  let invoked = false
  
  // 這里等子進(jìn)程將數(shù)據(jù)傳回來(lái)，然后存儲(chǔ)到mongo中（具體爬取看下一段代碼）
  child.on("message", async data => {

    // 先找一下是否有數(shù)據(jù)了
    let chapterData = await Chapter.findOne({
      chapterId: data.chapterId
    })

    // 需要將拿到的章節(jié)與存儲(chǔ)的章節(jié)做對(duì)比  防止作者占坑
    if (!chapterData) {
      chapterData = new Chapter(data)
      await chapterData.save()
      return
    } 
    
    // 進(jìn)行字?jǐn)?shù)對(duì)比 相差50字符
    if ((data.content.length - 50 >= 0) && (data.content.length - 50 > chapterData.content.length)) {
      Chapter.updateOne (
        { chapterId: +data.chapterId },
        { content : data.content }
      );
    }
  })
  
  child.send({ // 發(fā)送給子進(jìn)程進(jìn)行爬取
    bookId, // 哪本小說(shuō)
    startNum // 從哪個(gè)章節(jié)開(kāi)始爬
  })
  // 存儲(chǔ)所有章節(jié)的爬取  用于跑進(jìn)程刪除子進(jìn)程
  childProcessStore.set("chapter", child)
}

真正開(kāi)啟爬蟲(chóng)，用的是 puppeteer，谷歌內(nèi)核的爬取，功能很強(qiáng)大。
分兩步：
1.爬對(duì)應(yīng)小說(shuō)的章節(jié)目錄，拿到章節(jié)數(shù)組
2.根據(jù)傳進(jìn)來(lái)的startNum 進(jìn)行章節(jié)startNum 的往后爬取

// crawler/chapter.js

const puppeteer = require("puppeteer")
let url = `http://www.mytxt.cc/read/` // 目標(biāo)網(wǎng)址

const sleep = time => new Promise(resolve => {
  setTimeout(resolve, time)
})

process.on("message", async book => {
  url = `${url}${book.bookId}/`

  console.log("Start visit the target page --- chapter", url)
  // 找到對(duì)應(yīng)的小說(shuō)，拿到具體的章節(jié)數(shù)組
  const browser = await puppeteer.launch({
    args: ["--no-sandbox"],
    dumpio: false
  }).catch(err => {
    console.log("browser--error:", err)
    browser.close
  })

  const page = await browser.newPage()
  await page.goto(url, {
    waitUntil: "networkidle2"
  })

  await sleep(3000)

  await page.waitForSelector(".story_list_m62topxs") // 找到具體字段的class

  let result = await page.evaluate((book) => {
    let list = document.querySelectorAll(".cp_dd_m62topxs li")
    let reg = new RegExp(`${book.bookId}/(S*).html`)
    let chapter = Array.from(list).map((item, index) => {
      return {
        title: item.innerText,
        chapterId: item.innerHTML.match(reg)[1]
      }
    })
    return chapter
  }, book)

  // 截取從哪里開(kāi)始爬章節(jié)
  let tempResult = result.slice(book.startNum, result.length)

  for (let i = 0; i < tempResult.length; i++) {
    let chapterId = tempResult[i].chapterId
    console.log("開(kāi)始爬url:", `${url}${chapterId}.html`)

    await page.goto(`${url}${chapterId}.html`, {
      waitUntil: "networkidle2"
    })

    await sleep(2000)

    const content = await page.evaluate(() => {
      return document.querySelectorAll(".detail_con_m62topxs p")[0].innerText
    })

    tempResult[i].content = content
    tempResult[i].bookId = book.bookId
    
    process.send(tempResult[i]) // 通過(guò)IPC將數(shù)據(jù)傳回去，觸發(fā)child.on("message")
  }
  browser.close()
  process.exit(0)
})

3.開(kāi)啟接口

做的任務(wù)主要是，拿mongodb的數(shù)據(jù)，同時(shí)通過(guò)koa-router發(fā)布路由

先定義好路由裝飾器，方便后續(xù)使用具體看 decorator.js

底層拿到數(shù)據(jù)庫(kù)的數(shù)據(jù)

service/book.js // 拿到數(shù)據(jù)庫(kù)存儲(chǔ)的值

const Chapter = mongoose.model("Chapter")

// 獲取具體的章節(jié)內(nèi)容
export const getDetailChapter = async (data) => {
  const chapter = await Chapter.findOne({
    chapterId: data.chapterId,
    bookId: data.bookId
  }, {
    content: 1,
    title: 1,
    chapterId: 1
  })
  // console.log("getDetailChapter::", chapter)
  return chapter
}
...

路由定義 后續(xù)的接口就是 ‘/api/book/chapter’

@controller("/api/book")
export class bookController {
  @post("/chapter")
  async getDetailChapter (ctx, next) {
    const { chapterId, bookId } = ctx.request.body.data
    const list = await getDetailChapter({ 
      chapterId, 
      bookId 
    })

    ctx.body = {
      success: true,
      data: list
    }
  }
}

4.微信小程序

使用wepy進(jìn)行開(kāi)發(fā)，功能也是很簡(jiǎn)單，具體開(kāi)發(fā)可以參見(jiàn)小程序代碼，這里不做詳細(xì)講述。
支持記錄每一章的進(jìn)度，與全局設(shè)置。后續(xù)可以自己發(fā)揮。
在目標(biāo)網(wǎng)站找到小說(shuō)的Id之后就能進(jìn)行查找了。
接下來(lái)講解部署到服務(wù)器細(xì)節(jié)。

最后，在這里特別感謝@汪江江哥的幫助，我前后琢磨了兩個(gè)月，而他就用了三天，謝謝你不厭其煩的幫助，與你共事很開(kāi)心。
以上只是我的不成熟的技術(shù)，歡迎各位留言指教。

云服務(wù)器 GPU云服務(wù)器微信小程序搭建微信小程序免費(fèi)開(kāi)發(fā) 微信小程序怎樣開(kāi)發(fā) 微信小程序的開(kāi)發(fā)

文章版權(quán)歸作者所有，未經(jīng)允許請(qǐng)勿轉(zhuǎn)載,若此文章存在違規(guī)行為，您可以聯(lián)系管理員刪除。

轉(zhuǎn)載請(qǐng)注明本文地址：http://m.hztianpu.com/yun/100334.html

發(fā)表評(píng)論

登陸后可評(píng)論

0條評(píng)論

Kross

男|高級(jí)講師

我要關(guān)注我要私信

TA的文章

C++程序運(yùn)行過(guò)程中發(fā)生異常閃退，很有可能是這三個(gè)原因?qū)е碌?/a>

閱讀 9733·2021-11-18 10:02

EasyUI datagrid問(wèn)題整理

閱讀 2701·2019-08-30 15:43
前端：手機(jī)移動(dòng)端重寫(xiě)網(wǎng)頁(yè)alert（隱藏網(wǎng)址提示以及樣式參照iphone無(wú)標(biāo)題樣式）

閱讀 2717·2019-08-30 13:50
16進(jìn)制顏色代碼#FF000000

閱讀 1454·2019-08-30 11:20
Codepen 每日精選（2018-4-18）

閱讀 2765·2019-08-29 15:03
css過(guò)度與動(dòng)畫(huà)

閱讀 3693·2019-08-29 12:36
深拷貝和淺拷貝

閱讀 979·2019-08-23 17:04
snabbdom源碼解析（七）事件處理

閱讀 670·2019-08-23 14:18

成人无码视频,亚洲精品久久久久av无码,午夜精品久久久久久毛片,亚洲中文字幕日韩无码

資訊專欄INFORMATION COLUMN

上云采購(gòu)季！| 2核2G4M爆款云服務(wù)器低至59元/年，更有多臺(tái)、長(zhǎng)期優(yōu)惠，快來(lái)選購(gòu)！

Koa2 + Mongo + 爬蟲(chóng) 搭建小說(shuō)微信小程序（本地開(kāi)發(fā)篇）

相關(guān)文章