Use Node. js to develop information crawlers and node. js Crawlers
Recent projects require some information. Because the projects are written using Node. js, it is natural to use Node. js to write crawlers.
Project address: github.com/mrtanweijie... The project crawls information from Readhub, open-source China, developer headlines, and 36Kr websites, and does not process multiple pages at the moment, because crawlers run once a day, now, each time we get the latest one, we can meet our needs and we will continue to improve it later.
The crawling process is summarized to download the HTML of the target website to the local device for data extraction.
I. Download Page
Node. js has many http request libraries. The request code is as follows:
requestDownloadHTML () { const options = { url: this.url, headers: { 'User-Agent': this.randomUserAgent() } } return new Promise((resolve, reject) => { request(options, (err, response, body) => { if (!err && response.statusCode === 200) { return resolve(body) } else { return reject(err) } }) }) }
Use Promise for packaging to facilitate the use of async/await later. Because many websites are rendered on the client, the downloaded page does not necessarily contain the desired HTML content. We can use Google puppeteer to download the website pages rendered on the client. As we all know, during npm I, puppeteer may fail to install because it needs to download the Chrome kernel. Just try it several times :)
puppeteerDownloadHTML () { return new Promise(async (resolve, reject) => { try { const browser = await puppeteer.launch({ headless: true }) const page = await browser.newPage() await page.goto(this.url) const bodyHandle = await page.$('body') const bodyHTML = await page.evaluate(body => body.innerHTML, bodyHandle) return resolve(bodyHTML) } catch (err) { console.log(err) return reject(err) } }) }
Of course, it is best to directly use the interface request method for the page rendered by the client. In this way, the HTML parsing at the end is not required. perform a simple encapsulation and then use it like this: # Funny :)
await new Downloader('http://36kr.com/newsflashes', DOWNLOADER.puppeteer).downloadHTML()
Ii. HTML Content Extraction
Of course cheerio is used for HTML content extraction. cheerio exposes the same interface as jQuery, which is very easy to use. Open page F12 in the browser to view extracted page element nodes, and then extract content as needed.
readHubExtract () { let nodeList = this.$('#itemList').find('.enableVisited') nodeList.each((i, e) => { let a = this.$(e).find('a') this.extractData.push( this.extractDataFactory( a.attr('href'), a.text(), '', SOURCECODE.Readhub ) ) }) return this.extractData }
Iii. scheduled tasks
Cron runs the function job () {let cronJob = new cron. cronJob ({cronTime: cronConfig. cronTime, onTick: () =>{ spider ()}, start: false}) cronJob. start ()}
Iv. Data Persistence
In theory, data persistence should not be within the scope of crawler's interest. Use mongoose to create a Model.
import mongoose from 'mongoose'const Schema = mongoose.Schemaconst NewsSchema = new Schema( { title: { type: 'String', required: true }, url: { type: 'String', required: true }, summary: String, recommend: { type: Boolean, default: false }, source: { type: Number, required: true, default: 0 }, status: { type: Number, required: true, default: 0 }, createdTime: { type: Date, default: Date.now } }, { collection: 'news' })export default mongoose.model('news', NewsSchema)
Basic operations
import { OBJ_STATUS } from '../../Constants'class BaseService { constructor (ObjModel) { this.ObjModel = ObjModel } saveObject (objData) { return new Promise((resolve, reject) => { this.ObjModel(objData).save((err, result) => { if (err) { return reject(err) } return resolve(result) }) }) }}export default BaseService
Information
import BaseService from './BaseService'import News from '../models/News'class NewsService extends BaseService {}export default new NewsService(News)
Store Data happily
await newsService.batchSave(newsListTem)
For more information, go to Github and clone the project.
Summary