Asynchronous concurrency control _node.js of Nodejs crawler Advanced Tutorials

Source: Internet
Author: User

I wrote a little reptile that seems so imperfect right now, many places are not handled well, for example, when you know the point to open a problem, all of its answers are not all loaded well, when you pull to the end of the answer, click Load More, the answer will be loaded again, so if you send a question directly to the request link, The page obtained is not complete. And that's when we download pictures by sending links, is a picture of the next, if the number of pictures is too much, it is really down to you sleep it still under, and we use Nodejs to write the crawler, but unexpectedly did not use the Nodejs the most brilliant asynchronous concurrency characteristics, too wasteful ah.

Ideas

This time the crawler is the last upgrade version, however, the last time, although it is simple, but very suitable for beginners to learn AH. This time the crawler code can find =>nodespider on my github.

The whole idea of the reptile is this: at the beginning we crawled to some of the page data by asking for a link to the problem, and then we simulated the AJAX request in code to intercept the remaining pages of data, which, of course, can also be implemented asynchronously, and for small-scale asynchronous Process Control, this module can be => Eventproxy, but I have no use here! We analyze the acquired page to intercept all the links to the pictures, and then asynchronous concurrency to achieve the bulk download of these pictures.

Crawl the initial data of the page is very simple ah, there is no long explanation

/* Get first screen all picture links/
var getiniturllist=function () {
request.get ("https://www.zhihu.com/question/")
. End ( function (err,res) {
if (err) {
console.log (err);
} else{
var $=cheerio.load (res.text);
var answerlist=$ (". Zm-item-answer");
Answerlist.map (function (i,answer) {
var images=$ (answer). Find ('. Zm-item-rich-text img ');
Images.map (function (i,image) {
Photos.push ($ (image). attr ("src"));
Console.log ("The link of the +photos.length+" has been successfully crawled);
Getiajaxurllist ();
}
);

Simulate AJAX request Get full page

Next is how to simulate click load more when the AJAX request issued, to know to see it!

With this information, you can simulate sending the same request to get the data.

/* Simulate sending AJAX requests every millisecond, and get all the picture links in the request results * * var getiajaxurllist=function (offset) {Request.post ("Https://www.zhihu.com/node/QuestionAnswerListV"). Set (config). Send ("method=next&params=%b%url_token%%a%c% Pagesize%%a%c%offset%%a "+offset+"%d&_xsrf=adfdeee "). End (function (err,res) {if (err) {Console.log (err);} else{var response=json.parse (res.text);/* to serialize JSON with JSON, the JSON must be deserialized by submitting JSON/if (response.msg&& Response.msg.length) {var $=cheerio.load (Response.msg.join (""));//* Put all the array elements together, separated by whitespace, do not join (), it will default array elements separated by commas *
/var answerlist=$ (". Zm-item-answer"); Answerlist.map (function (i,answer) {var images=$ (answer). Find ('. Zm-item-rich-text img '); Images.map (function (i,
Image) {Photos.push ($ (image). attr ("src"));
});
settimeout (function () {offset+=; Console.log ("Successfully crawled" +photos.length+ "link to Picture"); Getiajaxurllist (offset);
}else{Console.log ("The picture link is all obtained, altogether has" +photos.length+ "the picture Link");//Console.log (Photos); return downloadimg ();}}
}); } 

Post this request https://www.zhihu.com/node/QuestionAnswerListV2 in code, copy the original request header and request parameters, as our request headers and request parameters, The set method of superagent can be used to set the request header, which can be used to send request parameters. We have the offset in the request parameter initially 20, offset at a certain time plus 20, and then resend the request, which is equivalent to every time we send an AJAX request, get the latest 20 data, each get the data, we have to deal with the data, Let them become an entire paragraph of HTML, easy to follow the extraction link processing. Asynchronous concurrent control Download pictures and then get all the pictures linked, that is, when the decision response.msg is empty, we will have to download the pictures, it is not possible to go under the right, because as you can see, we have a full picture

Yes, more than 20,000, but luckily Nodejs has a fantastic single-threaded asynchronous feature that we can download at the same time. But this time the problem comes, I heard at the same time send too much request will be the site IP! Is that true? I don't know, I haven't tried, because I don't want to try ( ̄ー ̄〃), so this time we need to have some control over the number of asynchronous concurrency.

Here we have a magical modular =>async that not only helps us to please the difficult to maintain callback pyramid demon, but also can easily help us to manage the asynchronous process. Look at the document, because I do not use it, here is only a powerful Async.maplimit method. It's really powerful.

 var requestandwrite=function (url,callback) {request.get (URL). End (function (err,res) { if (err) {Console.log (err); Console.log ("A picture request failed ...");} else{var filename=path.basename (URL); Fs.writefile ("./img/" +filename,res.body,function (Err) {if (err) {Console.log (
ERR); Console.log ("There is a picture write failed ...");}
else{console.log ("Picture download succeeded"); Callback (NULL, "Successful!");
/*callback seemingly must be invoked, the second parameter will pass to the next callback function Result,result is an array/}});
}
}); The Var downloadimg=function (asyncnum) {/* Some picture link addresses are not complete without the "http:" header, help them stitch the full */for (Var i=;i<photos.length;i++) {if (
Photos[i].indexof ("http") ===-) {photos[i]= ' http: ' +photos[i];}}
Console.log ("is about to download the picture asynchronously, the current concurrency number is:" +asyncnum); Async.maplimit (Photos,asyncnum,function (photo,callback) {console.log ("+asyncnum+" picture entered the download queue); requestandwrite
(Photo,callback); },function (Err,result) {if (err) {Console.log (err);} else{//Console.log (result); <= outputs an array of multiple "successful" strings Console.log ("all downloaded!"). ");
}
});
};

First look here =>


The first parameter of the Maplimit method photos is an array of all picture links, also the object of our concurrent request, Asyncnum is the number of concurrent requests, and if there is no such argument, there will be more than 20,000 requests sent over, well, your IP will be successfully sealed off, But when we have this parameter, such as its value is 10, it will only help us take 10 links from the array, execute concurrent requests, all 10 requests are answered, and then send the next 10 requests. Tell Mud Meng, concurrent to at the same time 100 no matter, download speed super fast, and then don't know slightly, you come to tell me ...

The above describes the Nodejs Crawler Advanced course of the asynchronous concurrency control of the relevant knowledge, hope to be helpful to everyone.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.