This article mainly introduces the whole process of making crawlers in NodeJS, including project establishment, target website analysis, use superagent to obtain source data, use cheerio to parse, and use eventproxy to concurrently capture the content of each topic. For more information, see. I am going to learn about the crawler tutorial of alsotang today, and then I will simply crawl the CNode.
Create Project craelr-demo
First, create an Express project and delete all the file content of app. js, because we do not need to display the content on the Web. Of course, we can also directlynpm install express
To use the Express functions we need.
Target website Analysis
This is a part of the p tag on the CNode homepage. We use this series of IDs and classes to locate the information we need.
Use superagent to obtain source data
Superagent is the Http library used by ajax APIs,Its usage is similar to that of jQuery., We use it to initiate a get request and output the result in the callback function.
The Code is as follows:
Var express = require ('express ');
Var url = require ('url'); // parse the operation url
Var superagent = require ('superagent'); // do not forget the npm install
Var cheerio = require ('cheerio ');
Var eventproxy = require ('eventproxy ');
Var targetUrl = 'https: // cnodejs.org /';
Superagent. get (targetUrl)
. End (function (err, res ){
Console. log (res );
});
Its res result is an object that contains the target url Information. The website content is mainly in its text (string.
Use cheerio for parsing
Cheerio acts as the jQuery function on the server. We first use its. load () to load HTML, and then use CSS selector to filter elements.
The Code is as follows:
Var $ = cheerio. load (res. text );
// Use CSS selector to filter data
$ ('# Topic_list. topic_title'). each (function (idx, element ){
Console. log (element );
});
The result is an object called.each(function(index, element))
Function to traverse every object. The returned result is html dom Elements.
Outputconsole.log($element.attr('title'));
The result isDecember 06, 2014 NodeParty UC venue in Guangzhou
Such as title, outputconsole.log($element.attr('href'));
The result is/topic/545c395becbcb78265856eb2
Url. Use the NodeJS1 url. resolve () function to complete the url.
The Code is as follows:
Superagent. get (tUrl)
. End (function (err, res ){
If (err ){
Return console. error (err );
}
Var topicUrls = [];
Var $ = cheerio. load (res. text );
// Obtain all links on the home page
$ ('# Topic_list. topic_title'). each (function (idx, element ){
Var $ element = $ (element );
Var href = url. resolve (tUrl, $ element. attr ('href '));
Console. log (href );
// TopicUrls. push (href );
});
});
Use eventproxy to concurrently capture the content of each topic
The tutorial shows an example of the deep nested (Serial) method and counter method. eventproxy uses the event (parallel) method to solve this problem. After all the captures are completed, eventproxy automatically calls the processing function when it receives the event message.
The Code is as follows:
// Step 1: Get An eventproxy instance
Var ep = new eventproxy ();
// Step 2: Define the callback function of the listener event.
// The after method is repeated listening
// Params: eventname (String) event name, times (Number) Listener times, callback function
Ep. after ('topic _ html ', topicUrls. length, function (topics ){
// Topics is an array containing the 40 pair entries in ep. emit ('topic _ html ', pair) for 40 times.
//. Map
Topics = topics. map (function (topicPair ){
// Use cheerio
Var topicUrl = topicPair [0];
Var topicHtml = topicPair [1];
Var $ = cheerio. load (topicHtml );
Return ({
Title: $ ('. topic_full_title'). text (). trim (),
Href: topicUrl,
Comment1: $ ('. reply_content'). eq (0). text (). trim ()
});
});
// Outcome
Console. log ('outcome :');
Console. log (topics );
});
// Step 3: Determine the event message
TopicUrls. forEach (function (topicUrl ){
Superagent. get (topicUrl)
. End (function (err, res ){
Console. log ('fetch' + topicUrl + 'successful ');
Ep. emit ('topic _ html ', [topicUrl, res. text]);
});
});
The result is as follows:
Extended exercises (challenges)
Get message username and points
In the source code of the Article Page, find the user class name for the comment, and classname is reply_author. The first element of console. log$('.reply_author').get(0)
As you can see, we need to get everything in the header here.
First, we can crawl an article and obtain all the necessary information at a time.
The Code is as follows:
Var userHref = url. resolve (tUrl, $ ('. reply_author'). get (0). attribs. href );
Console. log (userHref );
Console. log ($ ('. reply_author'). get (0). children [0]. data );
We can usehttps://cnodejs.org/user/username
Capture Point Information
The Code is as follows:
$ ('. Reply_author'). each (function (idx, element ){
Var $ element = $ (element );
Console. log ($ element. attr ('href '));
});
On the user information page$('.big').text().trim()
This is the point information.
Use the cheerio function. get (0) to obtain the first element.
The Code is as follows:
Var userHref = url. resolve (tUrl, $ ('. reply_author'). get (0). attribs. href );
Console. log (userHref );
This is only for the capture of a single article, for 40 there are still to be modified.