Nodejs making reptile whole process

Nodejs making reptile whole process _node.js

Last Update:2017-01-18 Source: Internet

Author: User

Tags emit

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Today to learn Alsotang's reptile tutorial, follow the Cnode simple climb again.

Establish Project Craelr-demo
We first set up a express project and then delete all the App.js file contents, because we don't need to present the content on the web side for the time being. Of course we can also npm install express use the Express function we need directly under the empty folder.

Target site Analysis
As pictured, this is the Cnode home part of the DIV tag, we are through this series of IDs, class to locate the information we need.

Using superagent to get source data

Superagent is the HTTP library used by the Ajax API, which uses the same method as jquery , where we initiate a GET request and output the result in a callback function.

Copy Code code as follows:

var express = require (' Express ');
var url = require (' URL '); Parse Action URL
var superagent = require (' superagent '); These three external dependencies don't forget NPM install
var cheerio = require (' Cheerio ');
var eventproxy = require (' Eventproxy ');
var targeturl = ' https://cnodejs.org/';
Superagent.get (TargetUrl)
. End (function (err, res) {
Console.log (RES);
});

Its res result is an object containing the target URL information, which is mainly in its text (string).

Using Cheerio parsing

Cheerio acts as a server-side jquery feature, we first use its. load () to load HTML, and then filter the elements through CSS selector.

Copy Code code as follows:

var $ = cheerio.load (Res.text);
To filter data through CSS Selector
$ (' #topic_list. Topic_title '). each (function (idx, Element) {
Console.log (Element);
});

The result is a single object that calls .each(function(index, element)) the function to traverse each object and returns the HTML DOM Elements.

console.log($element.attr('title'));The result of the output is广州 2014年12月06日 NodeParty 之 UC 场
such as the title, the output console.log($element.attr('href')); of the result is a /topic/545c395becbcb78265856eb2 url like that. Then use the NodeJS1 url.resolve () function to complement the full URL.

Copy Code code as follows:

Superagent.get (Turl)

. End (function (err, res) {

if (err) {

return Console.error (ERR);

}

var topicurls = [];

var $ = cheerio.load (Res.text);

Get all the links on the home page

$ (' #topic_list. Topic_title '). each (function (idx, Element) {

var $element = $ (element);

var href = url.resolve (Turl, $element. attr (' href '));

Console.log (HREF);

Topicurls.push (HREF);

});

});

Use Eventproxy to crawl the content of each topic concurrently
The tutorial shows examples of deep nesting (serial) methods and counter methods, Eventproxy using the event (parallelism) method to solve the problem. When all the crawls are complete, Eventproxy receives the event message and automatically invokes the handler function.

Copy Code code as follows:

The first step: Get a eventproxy example

var EP = new Eventproxy ();

Step two: Define the callback function for the listener event.

After method for repeated monitoring

Params:eventname (String) event name, times (number) listener count, callback callback function

Ep.after (' topic_html ', topicurls.length, function (topics) {

The topics is a number of the 40 pair in the Ep.emit (' topic_html ', pair), which contains 40 times

. map

topics = Topics.map (function (Topicpair) {

Use Cheerio

var topicurl = topicpair[0];

var topichtml = topicpair[1];

var $ = cheerio.load (topichtml);

Return ({

Title: $ ('. Topic_full_title '). Text (). Trim (),

Href:topicurl,

Comment1: $ ('. Reply_content '). EQ (0). Text (). Trim ()

});

});

Outcome

Console.log (' outcome: ');

Console.log (topics);

});

Step three: Determine the release of the event message

Topicurls.foreach (function (topicurl) {

Superagent.get (Topicurl)

. End (function (err, res) {

Console.log (' fetch ' + Topicurl + ' successful ');

Ep.emit (' topic_html ', [Topicurl, Res.text]);

});

});

The results are as follows

Extended Practice (Challenge)

Get message username and credits

The source of the article page to find the comments of the user class name, classname for Reply_author. Console.log the first element $('.reply_author').get(0) can be seen, we need to get things all over here at the head.

First of all, we first to crawl a piece of paper, the need for a one-time to get.

Copy Code code as follows:

var userhref = Url.resolve (Turl, $ ('. Reply_author '). Get (0). Attribs.href);
Console.log (USERHREF);
Console.log ($ ('. Reply_author '). Get (0). Children[0].data);

We can https://cnodejs.org/user/username grab points by capturing the information

Copy Code code as follows:

$ ('. Reply_author '). each (function (idx, Element) {
var $element = $ (element);
Console.log ($element. attr (' href '));
});

In the User Information page is $('.big').text().trim() the integration information.

A function with Cheerio. Get (0) gets the first element.

Copy Code code as follows:

var userhref = Url.resolve (Turl, $ ('. Reply_author '). Get (0). Attribs.href);
Console.log (USERHREF);

This is only for a single article to crawl, for 40 and still need to modify the place.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More