Full process of making crawlers using NodeJS _ node. js

Source: Internet
Author: User
Tags emit
This article mainly introduces the whole process of making crawlers in NodeJS, including project establishment, target website analysis, use superagent to obtain source data, use cheerio to parse, and use eventproxy to concurrently capture the content of each topic. For more information, see. I am going to learn about the crawler tutorial of alsotang today, and then I will simply crawl the CNode.

Create Project craelr-demo
First, create an Express project and delete all the file content of app. js, because we do not need to display the content on the Web. Of course, we can also directlynpm install expressTo use the Express functions we need.

Target website Analysis
This is a part of the p tag on the CNode homepage. We use this series of IDs and classes to locate the information we need.

Use superagent to obtain source data

Superagent is the Http library used by ajax APIs,Its usage is similar to that of jQuery., We use it to initiate a get request and output the result in the callback function.

The Code is as follows:


Var express = require ('express ');
Var url = require ('url'); // parse the operation url
Var superagent = require ('superagent'); // do not forget the npm install
Var cheerio = require ('cheerio ');
Var eventproxy = require ('eventproxy ');
Var targetUrl = 'https: // cnodejs.org /';
Superagent. get (targetUrl)
. End (function (err, res ){
Console. log (res );
});

Its res result is an object that contains the target url Information. The website content is mainly in its text (string.

Use cheerio for parsing

Cheerio acts as the jQuery function on the server. We first use its. load () to load HTML, and then use CSS selector to filter elements.

The Code is as follows:


Var $ = cheerio. load (res. text );
// Use CSS selector to filter data
$ ('# Topic_list. topic_title'). each (function (idx, element ){
Console. log (element );
});

The result is an object called.each(function(index, element))Function to traverse every object. The returned result is html dom Elements.

Outputconsole.log($element.attr('title'));The result isDecember 06, 2014 NodeParty UC venue in Guangzhou
Such as title, outputconsole.log($element.attr('href'));The result is/topic/545c395becbcb78265856eb2Url. Use the NodeJS1 url. resolve () function to complete the url.

The Code is as follows:


Superagent. get (tUrl)
. End (function (err, res ){
If (err ){
Return console. error (err );
}
Var topicUrls = [];
Var $ = cheerio. load (res. text );
// Obtain all links on the home page
$ ('# Topic_list. topic_title'). each (function (idx, element ){
Var $ element = $ (element );
Var href = url. resolve (tUrl, $ element. attr ('href '));
Console. log (href );
// TopicUrls. push (href );
});
});

Use eventproxy to concurrently capture the content of each topic
The tutorial shows an example of the deep nested (Serial) method and counter method. eventproxy uses the event (parallel) method to solve this problem. After all the captures are completed, eventproxy automatically calls the processing function when it receives the event message.

The Code is as follows:


// Step 1: Get An eventproxy instance
Var ep = new eventproxy ();
// Step 2: Define the callback function of the listener event.
// The after method is repeated listening
// Params: eventname (String) event name, times (Number) Listener times, callback function
Ep. after ('topic _ html ', topicUrls. length, function (topics ){
// Topics is an array containing the 40 pair entries in ep. emit ('topic _ html ', pair) for 40 times.
//. Map
Topics = topics. map (function (topicPair ){
// Use cheerio
Var topicUrl = topicPair [0];
Var topicHtml = topicPair [1];
Var $ = cheerio. load (topicHtml );
Return ({
Title: $ ('. topic_full_title'). text (). trim (),
Href: topicUrl,
Comment1: $ ('. reply_content'). eq (0). text (). trim ()
});
});
// Outcome
Console. log ('outcome :');
Console. log (topics );
});
// Step 3: Determine the event message
TopicUrls. forEach (function (topicUrl ){
Superagent. get (topicUrl)
. End (function (err, res ){
Console. log ('fetch' + topicUrl + 'successful ');
Ep. emit ('topic _ html ', [topicUrl, res. text]);
});
});

The result is as follows:

Extended exercises (challenges)

Get message username and points

In the source code of the Article Page, find the user class name for the comment, and classname is reply_author. The first element of console. log$('.reply_author').get(0)As you can see, we need to get everything in the header here.

First, we can crawl an article and obtain all the necessary information at a time.

The Code is as follows:


Var userHref = url. resolve (tUrl, $ ('. reply_author'). get (0). attribs. href );
Console. log (userHref );
Console. log ($ ('. reply_author'). get (0). children [0]. data );

We can usehttps://cnodejs.org/user/usernameCapture Point Information

The Code is as follows:


$ ('. Reply_author'). each (function (idx, element ){
Var $ element = $ (element );
Console. log ($ element. attr ('href '));
});

On the user information page$('.big').text().trim()This is the point information.

Use the cheerio function. get (0) to obtain the first element.

The Code is as follows:


Var userHref = url. resolve (tUrl, $ ('. reply_author'). get (0). attribs. href );
Console. log (userHref );

This is only for the capture of a single article, for 40 there are still to be modified.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.