Today to learn Alsotang's reptile tutorial, follow the Cnode simple climb again.
Establish Project Craelr-demo
We first set up a express project and then delete all the App.js file contents, because we don't need to present the content on the web side for the time being. Of course we can also npm install express
use the Express function we need directly under the empty folder.
Target site Analysis
As pictured, this is the Cnode home part of the DIV tag, we are through this series of IDs, class to locate the information we need.
Using superagent to get source data
Superagent is the HTTP library used by the Ajax API, which uses the same method as jquery , where we initiate a GET request and output the result in a callback function.
Copy Code code as follows:
var express = require (' Express ');
var url = require (' URL '); Parse Action URL
var superagent = require (' superagent '); These three external dependencies don't forget NPM install
var cheerio = require (' Cheerio ');
var eventproxy = require (' Eventproxy ');
var targeturl = ' https://cnodejs.org/';
Superagent.get (TargetUrl)
. End (function (err, res) {
Console.log (RES);
});
Its res result is an object containing the target URL information, which is mainly in its text (string).
Using Cheerio parsing
Cheerio acts as a server-side jquery feature, we first use its. load () to load HTML, and then filter the elements through CSS selector.
Copy Code code as follows:
var $ = cheerio.load (Res.text);
To filter data through CSS Selector
$ (' #topic_list. Topic_title '). each (function (idx, Element) {
Console.log (Element);
});
The result is a single object that calls .each(function(index, element))
the function to traverse each object and returns the HTML DOM Elements.
console.log($element.attr('title'));
The result of the output is广州 2014年12月06日 NodeParty 之 UC 场
such as the title, the output console.log($element.attr('href'));
of the result is a /topic/545c395becbcb78265856eb2
url like that. Then use the NodeJS1 url.resolve () function to complement the full URL.
Copy Code code as follows:
Superagent.get (Turl)
. End (function (err, res) {
if (err) {
return Console.error (ERR);
}
var topicurls = [];
var $ = cheerio.load (Res.text);
Get all the links on the home page
$ (' #topic_list. Topic_title '). each (function (idx, Element) {
var $element = $ (element);
var href = url.resolve (Turl, $element. attr (' href '));
Console.log (HREF);
Topicurls.push (HREF);
});
});
Use Eventproxy to crawl the content of each topic concurrently
The tutorial shows examples of deep nesting (serial) methods and counter methods, Eventproxy using the event (parallelism) method to solve the problem. When all the crawls are complete, Eventproxy receives the event message and automatically invokes the handler function.
Copy Code code as follows:
The first step: Get a eventproxy example
var EP = new Eventproxy ();
Step two: Define the callback function for the listener event.
After method for repeated monitoring
Params:eventname (String) event name, times (number) listener count, callback callback function
Ep.after (' topic_html ', topicurls.length, function (topics) {
The topics is a number of the 40 pair in the Ep.emit (' topic_html ', pair), which contains 40 times
. map
topics = Topics.map (function (Topicpair) {
Use Cheerio
var topicurl = topicpair[0];
var topichtml = topicpair[1];
var $ = cheerio.load (topichtml);
Return ({
Title: $ ('. Topic_full_title '). Text (). Trim (),
Href:topicurl,
Comment1: $ ('. Reply_content '). EQ (0). Text (). Trim ()
});
});
Outcome
Console.log (' outcome: ');
Console.log (topics);
});
Step three: Determine the release of the event message
Topicurls.foreach (function (topicurl) {
Superagent.get (Topicurl)
. End (function (err, res) {
Console.log (' fetch ' + Topicurl + ' successful ');
Ep.emit (' topic_html ', [Topicurl, Res.text]);
});
});
The results are as follows
Extended Practice (Challenge)
Get message username and credits
The source of the article page to find the comments of the user class name, classname for Reply_author. Console.log the first element $('.reply_author').get(0)
can be seen, we need to get things all over here at the head.
First of all, we first to crawl a piece of paper, the need for a one-time to get.
Copy Code code as follows:
var userhref = Url.resolve (Turl, $ ('. Reply_author '). Get (0). Attribs.href);
Console.log (USERHREF);
Console.log ($ ('. Reply_author '). Get (0). Children[0].data);
We can https://cnodejs.org/user/username
grab points by capturing the information
Copy Code code as follows:
$ ('. Reply_author '). each (function (idx, Element) {
var $element = $ (element);
Console.log ($element. attr (' href '));
});
In the User Information page is $('.big').text().trim()
the integration information.
A function with Cheerio. Get (0) gets the first element.
Copy Code code as follows:
var userhref = Url.resolve (Turl, $ ('. Reply_author '). Get (0). Attribs.href);
Console.log (USERHREF);
This is only for a single article to crawl, for 40 and still need to modify the place.