Asynchronous concurrency control of Nodejs Crawler Advanced Tutorial

Source: Internet
Author: User
Tags website ip

Asynchronous concurrency control of Nodejs Crawler Advanced Tutorial

Before writing a very imperfect now seems to be a small reptile, a lot of places have not been handled well, for example, when the point of view to open a problem, all its answers are not loaded well, when you pull to the end of the answer, click Load More, the answer will be loaded part, so that if you send a question directly the request link, The page that was obtained is incomplete. There is we send a link to download the picture, is a one to the next, if the number of pictures too much, it is really down to you sleep it is still under, and we use Nodejs write crawler, but unexpectedly did not use the Nodejs the best asynchronous concurrency characteristics, too wasted ah.

Ideas

This time the crawler is the last version of the upgrade, but, the last time that although is simple, but very suitable for novice learning ah. This reptile code can be found on my GitHub =>nodespider.

The idea of the whole crawler is this: in the beginning we crawl through the link of the request problem to some of the page data, and then we in the code to simulate the AJAX request to intercept the remaining pages of data, of course, here can also be asynchronous to achieve concurrency, for small-scale asynchronous process control, you can use this module = Eventproxy, but I don't have any use here! We extract the links from all the images through the analysis of the retrieved pages, and then implement the bulk download of the images by asynchronous concurrency.

Crawl page The initial data is very simple Ah, here is not long explanation

Simulating AJAX requests for a full page

Next is how to simulate the click load more when the AJAX request, to know to look at it!

With this information, you can simulate sending the same request to get the data.

/* simulates sending Ajax requests every millisecond and gets all the picture links in the request result */var getiajaxurllist=function (offset) {Request.post (" Https://www.zhihu.com/node/QuestionAnswerListV "). Set (config). Send (" method=next¶ms=%b%url_token%%a%c%pagesize%% A%c%offset%%a "+offset+"%d&_xsrf=adfdeee "). End (function (err,res) {if (err) {Console.log (err);} Else{var Response=json.parse (Res.text);/* JSON can be serialized using JSON, the JSON will need to be deserialized */if (response.msg&& Response.msg.length) {var $=cheerio.load (Response.msg.join (""));/* All array elements are stitched together, separated by whitespace, not so join (), which will default array elements separated by commas * /var answerlist=$ (". Zm-item-answer"); Answerlist.map (function (i,answer) {var images=$ (answer). Find ('. Zm-item-rich-text img '); Images.map (function (i,image) {Photos.push (image). attr ("src"));}); SetTimeout (function () {Offset+=;console.log ("successfully crawled" +photos.length+ "picture Link"); getiajaxurllist (offset);},);} Else{console.log ("Picture link all get completed, there is a total of" +photos.length+ "picture link");//Console.log (Photos); return downloadimg ();}});} 

In the code post this request https://www.zhihu.com/node/QuestionAnswerListV2, the original request header and request parameters are copied down, as our request header and request parameters, The superagent set method can be used to set the request header, and the Send method can be used to send request parameters. We set the offset in the request parameter to 20, offset by 20 at a certain time, and then resend the request, so that we send an AJAX request at a certain time, get to the latest 20 data, each fetch the data, we will do some processing of these data, Let them become a whole paragraph of HTML, easy to follow the extraction link processing. Asynchronous concurrency control download pictures and then after all the image links, that is, the decision response.msg is empty, we will have to download these images, it is not possible to do a right, because as you can see, our pictures have

Yes, more than 20,000, but fortunately Nodejs has a magical single-threaded asynchronous feature that allows us to download these images at the same time. But this time the problem came, I heard that at the same time sending too many words will be the website IP da! Is that true? I don't know, I haven't tried, because I don't want to try ( ̄ー ̄〃), so this time we need to have some control over the number of asynchronous concurrency.

A magical module =>async is used here, which not only helps us to ask for the difficult-to-maintain callback pyramid demon, but also helps us to manage asynchronous processes easily. Look at the document specifically, because I do not use it very much, here is only a powerful Async.maplimit method. It's really awesome.

var requestandwrite=function (url,callback) {request.get (URL). End (function (err,res) {if (err) {Console.log (err); Console.log ("A picture request failed ...");} Else{var filename=path.basename (URL); Fs.writefile ("./img/" +filename,res.body,function (Err) {if (err) {Console.log ( ERR); Console.log ("There is a picture that failed to write ..."); Else{console.log ("image download succeeded"); Callback (NULL, "Successful!"); *callback appears to have to be called, the second parameter will be passed to the next callback function, Result,result is an array */}});}); var downloadimg=function (asyncnum) {/* Some picture link addresses are not complete without "http:" header, help them to splice the full */for (Var i=;i<photos.length;i++) {if ( Photos[i].indexof ("http") ===-) {photos[i]= "http:" +photos[i];}} Console.log ("Asynchronous concurrent download of picture, current concurrency:" +asyncnum); Async.maplimit (Photos,asyncnum,function (photo,callback) {Console.log ("already has" +asyncnum+ "picture enters the download queue"); Requestandwrite (Photo,callback);},function (Err,result) {if (err) {Console.log (err);} else{//Console.log (Result), <= outputs an array with multiple "successful" strings Console.log ("all downloaded!) ");}});};

See here first

The first parameter of the Maplimit method photos is an array of all the image links, and also the object of our concurrent request, Asyncnum is to limit the number of concurrent requests, if there is no such parameter, there will be more than 20,000 requests sent in the past, well, your IP will be successfully sealed off, But when we have this parameter, such as its value is 10, then it will only help us to take 10 links from the array, execute the concurrent request, the 10 requests are responded, and then send the next 10 requests. Tell Mud Meng, concurrent to at the same time 100 nothing, download speed super fast, then go up don't know, you come to tell me ...

The above mentioned to you about the Nodejs Crawler Advanced Tutorial Asynchronous concurrency control knowledge, I hope to be helpful to everyone.

Asynchronous concurrency control of Nodejs Crawler Advanced Tutorial

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.