Use Node. js to develop information crawlers and node. js Crawlers

Source: Internet
Author: User

Use Node. js to develop information crawlers and node. js Crawlers

Recent projects require some information. Because the projects are written using Node. js, it is natural to use Node. js to write crawlers.

Project address: github.com/mrtanweijie... The project crawls information from Readhub, open-source China, developer headlines, and 36Kr websites, and does not process multiple pages at the moment, because crawlers run once a day, now, each time we get the latest one, we can meet our needs and we will continue to improve it later.

The crawling process is summarized to download the HTML of the target website to the local device for data extraction.

I. Download Page

Node. js has many http request libraries. The request code is as follows:

requestDownloadHTML () { const options = {  url: this.url,  headers: {  'User-Agent': this.randomUserAgent()  } } return new Promise((resolve, reject) => {  request(options, (err, response, body) => {  if (!err && response.statusCode === 200) {   return resolve(body)  } else {   return reject(err)  }  }) }) }

Use Promise for packaging to facilitate the use of async/await later. Because many websites are rendered on the client, the downloaded page does not necessarily contain the desired HTML content. We can use Google puppeteer to download the website pages rendered on the client. As we all know, during npm I, puppeteer may fail to install because it needs to download the Chrome kernel. Just try it several times :)

puppeteerDownloadHTML () { return new Promise(async (resolve, reject) => {  try {  const browser = await puppeteer.launch({ headless: true })  const page = await browser.newPage()  await page.goto(this.url)  const bodyHandle = await page.$('body')  const bodyHTML = await page.evaluate(body => body.innerHTML, bodyHandle)  return resolve(bodyHTML)  } catch (err) {  console.log(err)  return reject(err)  } }) }

Of course, it is best to directly use the interface request method for the page rendered by the client. In this way, the HTML parsing at the end is not required. perform a simple encapsulation and then use it like this: # Funny :)

await new Downloader('http://36kr.com/newsflashes', DOWNLOADER.puppeteer).downloadHTML()

Ii. HTML Content Extraction

Of course cheerio is used for HTML content extraction. cheerio exposes the same interface as jQuery, which is very easy to use. Open page F12 in the browser to view extracted page element nodes, and then extract content as needed.

readHubExtract () { let nodeList = this.$('#itemList').find('.enableVisited') nodeList.each((i, e) => {  let a = this.$(e).find('a')  this.extractData.push(  this.extractDataFactory(   a.attr('href'),   a.text(),   '',   SOURCECODE.Readhub  )  ) }) return this.extractData }

Iii. scheduled tasks

Cron runs the function job () {let cronJob = new cron. cronJob ({cronTime: cronConfig. cronTime, onTick: () =>{ spider ()}, start: false}) cronJob. start ()}

Iv. Data Persistence

In theory, data persistence should not be within the scope of crawler's interest. Use mongoose to create a Model.

import mongoose from 'mongoose'const Schema = mongoose.Schemaconst NewsSchema = new Schema( { title: { type: 'String', required: true }, url: { type: 'String', required: true }, summary: String, recommend: { type: Boolean, default: false }, source: { type: Number, required: true, default: 0 }, status: { type: Number, required: true, default: 0 }, createdTime: { type: Date, default: Date.now } }, { collection: 'news' })export default mongoose.model('news', NewsSchema)

Basic operations

import { OBJ_STATUS } from '../../Constants'class BaseService { constructor (ObjModel) { this.ObjModel = ObjModel } saveObject (objData) { return new Promise((resolve, reject) => {  this.ObjModel(objData).save((err, result) => {  if (err) {   return reject(err)  }  return resolve(result)  }) }) }}export default BaseService

Information

import BaseService from './BaseService'import News from '../models/News'class NewsService extends BaseService {}export default new NewsService(News)

Store Data happily

await newsService.batchSave(newsListTem)

For more information, go to Github and clone the project.

Summary

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.