Use Node. js to develop information crawlers and node. js Crawlers

Last Update:2018-01-09 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Recent projects require some information. Because the projects are written using Node. js, it is natural to use Node. js to write crawlers.

Project address: github.com/mrtanweijie... The project crawls information from Readhub, open-source China, developer headlines, and 36Kr websites, and does not process multiple pages at the moment, because crawlers run once a day, now, each time we get the latest one, we can meet our needs and we will continue to improve it later.

The crawling process is summarized to download the HTML of the target website to the local device for data extraction.

I. Download Page

Node. js has many http request libraries. The request code is as follows:

requestDownloadHTML () { const options = {  url: this.url,  headers: {  'User-Agent': this.randomUserAgent()  } } return new Promise((resolve, reject) => {  request(options, (err, response, body) => {  if (!err && response.statusCode === 200) {   return resolve(body)  } else {   return reject(err)  }  }) }) }

Use Promise for packaging to facilitate the use of async/await later. Because many websites are rendered on the client, the downloaded page does not necessarily contain the desired HTML content. We can use Google puppeteer to download the website pages rendered on the client. As we all know, during npm I, puppeteer may fail to install because it needs to download the Chrome kernel. Just try it several times :)

puppeteerDownloadHTML () { return new Promise(async (resolve, reject) => {  try {  const browser = await puppeteer.launch({ headless: true })  const page = await browser.newPage()  await page.goto(this.url)  const bodyHandle = await page.$('body')  const bodyHTML = await page.evaluate(body => body.innerHTML, bodyHandle)  return resolve(bodyHTML)  } catch (err) {  console.log(err)  return reject(err)  } }) }

Of course, it is best to directly use the interface request method for the page rendered by the client. In this way, the HTML parsing at the end is not required. perform a simple encapsulation and then use it like this: # Funny :)

await new Downloader('http://36kr.com/newsflashes', DOWNLOADER.puppeteer).downloadHTML()

Ii. HTML Content Extraction

Of course cheerio is used for HTML content extraction. cheerio exposes the same interface as jQuery, which is very easy to use. Open page F12 in the browser to view extracted page element nodes, and then extract content as needed.

readHubExtract () { let nodeList = this.$('#itemList').find('.enableVisited') nodeList.each((i, e) => {  let a = this.$(e).find('a')  this.extractData.push(  this.extractDataFactory(   a.attr('href'),   a.text(),   '',   SOURCECODE.Readhub  )  ) }) return this.extractData }

Iii. scheduled tasks

Cron runs the function job () {let cronJob = new cron. cronJob ({cronTime: cronConfig. cronTime, onTick: () =>{ spider ()}, start: false}) cronJob. start ()}

Iv. Data Persistence

In theory, data persistence should not be within the scope of crawler's interest. Use mongoose to create a Model.

import mongoose from 'mongoose'const Schema = mongoose.Schemaconst NewsSchema = new Schema( { title: { type: 'String', required: true }, url: { type: 'String', required: true }, summary: String, recommend: { type: Boolean, default: false }, source: { type: Number, required: true, default: 0 }, status: { type: Number, required: true, default: 0 }, createdTime: { type: Date, default: Date.now } }, { collection: 'news' })export default mongoose.model('news', NewsSchema)

Basic operations

import { OBJ_STATUS } from '../../Constants'class BaseService { constructor (ObjModel) { this.ObjModel = ObjModel } saveObject (objData) { return new Promise((resolve, reject) => {  this.ObjModel(objData).save((err, result) => {  if (err) {   return reject(err)  }  return resolve(result)  }) }) }}export default BaseService

Information

import BaseService from './BaseService'import News from '../models/News'class NewsService extends BaseService {}export default new NewsService(News)

Store Data happily

await newsService.batchSave(newsListTem)

For more information, go to Github and clone the project.

Summary

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Use Node. js to develop information crawlers and node. js Crawlers

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Use Node. js to develop information crawlers and node. js Crawlers

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support