Full process of making crawlers using NodeJS

Full process of making crawlers using NodeJS _ node. js

Last Update:2017-05-11 Source: Internet

Author: User

Tags emit

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This article mainly introduces the whole process of making crawlers in NodeJS, including project establishment, target website analysis, use superagent to obtain source data, use cheerio to parse, and use eventproxy to concurrently capture the content of each topic. For more information, see. I am going to learn about the crawler tutorial of alsotang today, and then I will simply crawl the CNode.

Create Project craelr-demo
First, create an Express project and delete all the file content of app. js, because we do not need to display the content on the Web. Of course, we can also directlynpm install expressTo use the Express functions we need.

Target website Analysis
This is a part of the p tag on the CNode homepage. We use this series of IDs and classes to locate the information we need.

Use superagent to obtain source data

Superagent is the Http library used by ajax APIs,Its usage is similar to that of jQuery., We use it to initiate a get request and output the result in the callback function.

The Code is as follows:

Var express = require ('express ');
Var url = require ('url'); // parse the operation url
Var superagent = require ('superagent'); // do not forget the npm install
Var cheerio = require ('cheerio ');
Var eventproxy = require ('eventproxy ');
Var targetUrl = 'https: // cnodejs.org /';
Superagent. get (targetUrl)
. End (function (err, res ){
Console. log (res );
});

Its res result is an object that contains the target url Information. The website content is mainly in its text (string.

Use cheerio for parsing

Cheerio acts as the jQuery function on the server. We first use its. load () to load HTML, and then use CSS selector to filter elements.

The Code is as follows:

Var $ = cheerio. load (res. text );
// Use CSS selector to filter data
$ ('# Topic_list. topic_title'). each (function (idx, element ){
Console. log (element );
});

The result is an object called.each(function(index, element))Function to traverse every object. The returned result is html dom Elements.

Outputconsole.log($element.attr('title'));The result isDecember 06, 2014 NodeParty UC venue in Guangzhou
Such as title, outputconsole.log($element.attr('href'));The result is/topic/545c395becbcb78265856eb2Url. Use the NodeJS1 url. resolve () function to complete the url.

The Code is as follows:

Superagent. get (tUrl)
. End (function (err, res ){
If (err ){
Return console. error (err );
}
Var topicUrls = [];
Var $ = cheerio. load (res. text );
// Obtain all links on the home page
$ ('# Topic_list. topic_title'). each (function (idx, element ){
Var $ element = $ (element );
Var href = url. resolve (tUrl, $ element. attr ('href '));
Console. log (href );
// TopicUrls. push (href );
});
});

Use eventproxy to concurrently capture the content of each topic
The tutorial shows an example of the deep nested (Serial) method and counter method. eventproxy uses the event (parallel) method to solve this problem. After all the captures are completed, eventproxy automatically calls the processing function when it receives the event message.

The Code is as follows:

// Step 1: Get An eventproxy instance
Var ep = new eventproxy ();
// Step 2: Define the callback function of the listener event.
// The after method is repeated listening
// Params: eventname (String) event name, times (Number) Listener times, callback function
Ep. after ('topic _ html ', topicUrls. length, function (topics ){
// Topics is an array containing the 40 pair entries in ep. emit ('topic _ html ', pair) for 40 times.
//. Map
Topics = topics. map (function (topicPair ){
// Use cheerio
Var topicUrl = topicPair [0];
Var topicHtml = topicPair [1];
Var $ = cheerio. load (topicHtml );
Return ({
Title: $ ('. topic_full_title'). text (). trim (),
Href: topicUrl,
Comment1: $ ('. reply_content'). eq (0). text (). trim ()
});
});
// Outcome
Console. log ('outcome :');
Console. log (topics );
});
// Step 3: Determine the event message
TopicUrls. forEach (function (topicUrl ){
Superagent. get (topicUrl)
. End (function (err, res ){
Console. log ('fetch' + topicUrl + 'successful ');
Ep. emit ('topic _ html ', [topicUrl, res. text]);
});
});

The result is as follows:

Extended exercises (challenges)

Get message username and points

In the source code of the Article Page, find the user class name for the comment, and classname is reply_author. The first element of console. log$('.reply_author').get(0)As you can see, we need to get everything in the header here.

First, we can crawl an article and obtain all the necessary information at a time.

The Code is as follows:

Var userHref = url. resolve (tUrl, $ ('. reply_author'). get (0). attribs. href );
Console. log (userHref );
Console. log ($ ('. reply_author'). get (0). children [0]. data );

We can usehttps://cnodejs.org/user/usernameCapture Point Information

The Code is as follows:

$ ('. Reply_author'). each (function (idx, element ){
Var $ element = $ (element );
Console. log ($ element. attr ('href '));
});

On the user information page$('.big').text().trim()This is the point information.

Use the cheerio function. get (0) to obtain the first element.

The Code is as follows:

Var userHref = url. resolve (tUrl, $ ('. reply_author'). get (0). attribs. href );
Console. log (userHref );

This is only for the capture of a single article, for 40 there are still to be modified.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More