Full process of making crawlers using NodeJS (continued) _ node. js

Source: Internet
Author: User
This article is based on the entire process of making crawlers in NodeJS. it is the most important supplement and optimization. it will be followed up by reference to the next book for relevant partners, we need to modify the program to capture 40 pages in a row. That is to say, we need to output the title, link, first comment, comment user and Forum points of each article.

,$('.reply_author').eq(0).text().trim();The value obtained is the correct first comment user.

{<1>}

After obtaining the comments and user name content from eventproxy, we need to jump to the user interface through the user name to continue capturing the user points

The code is as follows:


Var $ = cheerio. load (topicHtml );
// This URL is the target URL for the next capture
Var userHref = 'https: // cnodejs.org '+ $ ('. reply_author '). eq (0). attr ('href ');
UserHref = url. resolve (tUrl, userHref );
Var title = $ ('. topic_full_title'). text (). trim (). replace (/\ n/g ,"");;
Var href = topicUrl;
Var comment1 = $ ('. reply_content'). eq (0). text (). trim ();
Var author1 = $ ('. reply_author'). eq (0). text (). trim ();
// Pass parameters to the next Concurrent capture
Ep. emit ('User _ html ', [userHref, title, href, comment1, author1]);

In eventproxy this time, we need to find where the score is (class = "big ").

{<2>}

It's easy to find the classname. let's try to output the result first.

The code is as follows:


Var outcome = superagent. get (userUrl)
. End (function (err, res ){
If (err ){
Return console. error (err );
}
Var $ = cheerio. load (res. text );
Var score = $ ('. big'). text (). trim ();
Console. log (user [1]);
Console. log (user [2]);
Console. log (user [3]);
Console. log (user [4]);
Console. log ($ ('. big'). text (). trim ());
Return ({
Title: user [1],
Href: user [2],
Comment1: user [3],
Author1: user [4],
Score1: score
});
});
});

Run the program and the result of this code.

{<3>}

But the problem arises. we can output the result correctly in the. end () callback function, but cannot output outcome correctly. Take a closer look, the outcome to be output is a Request object. This is because of careless mistakes. the. end () function does not pass the return value to the Request object. you need to return the result to the previous layer (users ).

The code is as follows:


// Find userDetails
Ep. after ('User _ html ', topicUrls. length, function (users ){
Users = users. map (function (user ){
Var userUrl = user [0];
Var score;
Superagent. get (userUrl)
. End (function (err, res ){
If (err ){
Return console. error (err );
}
// Console. log (res. text );
Var $ = cheerio. load (res. text );
Score = $ ('. big'). text (). trim ();
});
Return ({
Title: user [1],
Href: user [2],
Comment1: user [3],
Author1: user [4],
Score1: score
});
});

Output users well and find that score1 is the correct value. After careful debugging, it is found that the program first runs console. log () and then performs. map (). More accurately, in the. map () function, the. get () callback function does not execute the value score, and the return value is returned. This is the asynchronous callback function, and the outer synchronization operation will not wait until the callback function completes the operation.

{<4>}

In my practice, eventproxy re-emit a layer of message, along with the message to pass the required data to receive the message operation. after (): only when all messages are received, the passed parameters (results) are printed ).

The code is as follows:


Score = $ ('. big') text (). trim ();
// Newly added
Ep. emit ('got _ score ', [user [1], user [2], user [3], user [4], score]);
.....
Ep. after ('got _ score ', 10, function (users ){
Console. log (users );
});

{<6>}

This problem is solved, but the score1 value seems too big. Again, there are two original class = 'Big ', and the topic favorites belong to this class. We have to use cheerio. slice (start, [end]) to cut the first element and change the score to score = $ ('. big '). slice (0 ). eq (0 ). text (). trim ();. Correct results.

{<7>}

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.