Nodejs making reptile whole process _node.js

Source: Internet
Author: User
Tags emit

Today to learn Alsotang's reptile tutorial, follow the Cnode simple climb again.

Establish Project Craelr-demo
We first set up a express project and then delete all the App.js file contents, because we don't need to present the content on the web side for the time being. Of course we can also npm install express use the Express function we need directly under the empty folder.

Target site Analysis
As pictured, this is the Cnode home part of the DIV tag, we are through this series of IDs, class to locate the information we need.

Using superagent to get source data

Superagent is the HTTP library used by the Ajax API, which uses the same method as jquery , where we initiate a GET request and output the result in a callback function.

Copy Code code as follows:

var express = require (' Express ');
var url = require (' URL '); Parse Action URL
var superagent = require (' superagent '); These three external dependencies don't forget NPM install
var cheerio = require (' Cheerio ');
var eventproxy = require (' Eventproxy ');
var targeturl = ' https://cnodejs.org/';
Superagent.get (TargetUrl)
. End (function (err, res) {
Console.log (RES);
});

Its res result is an object containing the target URL information, which is mainly in its text (string).

Using Cheerio parsing

Cheerio acts as a server-side jquery feature, we first use its. load () to load HTML, and then filter the elements through CSS selector.

Copy Code code as follows:

var $ = cheerio.load (Res.text);
To filter data through CSS Selector
$ (' #topic_list. Topic_title '). each (function (idx, Element) {
Console.log (Element);
});

The result is a single object that calls .each(function(index, element)) the function to traverse each object and returns the HTML DOM Elements.

console.log($element.attr('title'));The result of the output is广州 2014年12月06日 NodeParty 之 UC 场
such as the title, the output console.log($element.attr('href')); of the result is a /topic/545c395becbcb78265856eb2 url like that. Then use the NodeJS1 url.resolve () function to complement the full URL.

Copy Code code as follows:



Superagent.get (Turl)


. End (function (err, res) {


if (err) {


return Console.error (ERR);


}


var topicurls = [];


var $ = cheerio.load (Res.text);


Get all the links on the home page


$ (' #topic_list. Topic_title '). each (function (idx, Element) {


var $element = $ (element);


var href = url.resolve (Turl, $element. attr (' href '));


Console.log (HREF);


Topicurls.push (HREF);


});


});


Use Eventproxy to crawl the content of each topic concurrently
The tutorial shows examples of deep nesting (serial) methods and counter methods, Eventproxy using the event (parallelism) method to solve the problem. When all the crawls are complete, Eventproxy receives the event message and automatically invokes the handler function.

Copy Code code as follows:



The first step: Get a eventproxy example


var EP = new Eventproxy ();


Step two: Define the callback function for the listener event.


After method for repeated monitoring


Params:eventname (String) event name, times (number) listener count, callback callback function


Ep.after (' topic_html ', topicurls.length, function (topics) {


The topics is a number of the 40 pair in the Ep.emit (' topic_html ', pair), which contains 40 times


. map


topics = Topics.map (function (Topicpair) {


Use Cheerio


var topicurl = topicpair[0];


var topichtml = topicpair[1];


var $ = cheerio.load (topichtml);


Return ({


Title: $ ('. Topic_full_title '). Text (). Trim (),


Href:topicurl,


Comment1: $ ('. Reply_content '). EQ (0). Text (). Trim ()


});


});


Outcome


Console.log (' outcome: ');


Console.log (topics);


});


Step three: Determine the release of the event message


Topicurls.foreach (function (topicurl) {


Superagent.get (Topicurl)


. End (function (err, res) {


Console.log (' fetch ' + Topicurl + ' successful ');


Ep.emit (' topic_html ', [Topicurl, Res.text]);


});


});


The results are as follows

Extended Practice (Challenge)

Get message username and credits

The source of the article page to find the comments of the user class name, classname for Reply_author. Console.log the first element $('.reply_author').get(0) can be seen, we need to get things all over here at the head.

First of all, we first to crawl a piece of paper, the need for a one-time to get.

Copy Code code as follows:

var userhref = Url.resolve (Turl, $ ('. Reply_author '). Get (0). Attribs.href);
Console.log (USERHREF);
Console.log ($ ('. Reply_author '). Get (0). Children[0].data);

We can https://cnodejs.org/user/username grab points by capturing the information

Copy Code code as follows:

$ ('. Reply_author '). each (function (idx, Element) {
var $element = $ (element);
Console.log ($element. attr (' href '));
});

In the User Information page is $('.big').text().trim() the integration information.

A function with Cheerio. Get (0) gets the first element.

Copy Code code as follows:

var userhref = Url.resolve (Turl, $ ('. Reply_author '). Get (0). Attribs.href);
Console.log (USERHREF);

This is only for a single article to crawl, for 40 and still need to modify the place.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.