Make a little reptile out of Nodejs.

Last Update:2016-07-13 Source: Internet

Author: User

Tags emit response code

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Beijing start technology front-end researcher, focus on sharing HTML5 app rapid development tools WeX5 black Magic and the corresponding front-end technology.

Objective

The use of reptiles can do a lot of things, single men can use reptiles to collect a variety of sister intelligence, the girls can use reptiles to collect small things sister want, make money to analyze the microblog speech and stock ups and downs of the relationship and so on, simply to heaven.
You feel a little bit of me:

Stir

Throw away machine learning this seemingly very tall data processing technology, simple to do a crawler to get data is very simple. For the former ER people, born in the era of Nodejs is not too happy, the following will use Nodejs to do a reptile it.

This time we first take csdn to practice practiced hand, climb sister or something, in fact, very similar. The function to be implemented is to crawl the posts in the front-end section and output the author's information.

The first is the preparation of the tool, Nodejs certainly must be installed, I use the current latest version of the 6.2.0. In addition, because we need to crawl content through various HTTP requests, so a good HTTP debugging tool is also required, here I recommend to use Postman.

Build the Environment

Determine the third-party libraries we want to use based on the functionality that crawlers need:

Back-end Services:express
Make an HTTP request:superagent
Controlling concurrent Requests: async +eventproxy
Analyze Web content:cheerio

Related library API please go to the corresponding GitHub project page to understand, here I will not repeat, the corresponding Package.json:

{  "name":"Spider",  "version":"0.0.0",  "Description":"Learn Nodejs on GitHub",  "Scripts":{"start": "Node App.js"  },  "Dependencies":{"async": "^2.0.0-rc.6", "cheerio": "^0.20.0", "eventproxy": 
     
       "^0.3.4", "
      Express": 
      "^4.9.5", "
      superagent": 
      "^2.0.0"  }
     ,}

The well-written package remembers to be installed, so that we can build a good development environment.

Crawler body

Following the crawler needs to follow the function of a step by step to write the crawler body.

Background Services Section

The implementation of the function is to receive the front-end request to start the crawler, after the completion of information crawling information back to the front. Background Services Section I used the express framework here, and it's easier to use native http modules. The simple framework is as follows:

varrequire(‘express‘);var app = express();app.get(‘/‘function (req, res, next) {    // your code here});app.listen(3000function (req, res) {    console.log(‘app is running at port 3000‘);});

getInsert Our response code in the process, including the startup crawler, the result information output and so on.

Crawling of article links

Here we use superagent this library to achieve, this library author is a prolific God, we use the library is basically he wrote, Trembling, Young!

superagent.get(Url).end(function (err, res) {    ifreturn next(err); }    // your code here});

UrlFor the address we request, the get way you request it, the effect is the same as the effect you open with the browser, the Url returned data are put in res , the res analysis can get the data we want.

Processing of data

Here we use cheerio This library, he can let us in jQuery the way to manipulate the returned data, it is really sweet.

// 提取作者博客链接，注意去重var $ = cheerio.load(sres.text);$(‘.blog_list‘).each(function (i, e) {    var u = $(‘.user_name‘, e).attr(‘href‘);    if (authorUrls.indexOf(u) === -1) {        authorUrls.push(u);    }});

Is it a familiar feeling? That's exactly jQuery the syntax.

Crawling of author information in the article

Here we have to go to the author's homepage to crawl the corresponding message, the same as the previous crawl link.

Superagent.get (Authorurl). End ( function (err, ssres) {        if(ERR) {Callback (err, Authorurl +' Error happened! '); }var$ = cheerio.load (Ssres.text);varresult = {UserId:url.parse (myurl). Pathname.substring (1), Blogtitle: $ ("#blog_title a"). Text (), Visitcount:parseint($(' #blog_rank >li '). EQ (0). Text (). Split (/[::]/)[1]), Score:parseint($(' #blog_rank >li '). EQ (1). Text (). Split (/[::]/)[1]), Oricount:parseint($(' #blog_statistics >li '). EQ (0). Text (). Split (/[::]/)[1]), Copycount:parseint($(' #blog_statistics >li '). EQ (1). Text (). Split (/[::]/)[1]), Trscount:parseint($(' #blog_statistics >li '). EQ (2). Text (). Split (/[::]/)[1]), Cmtcount:parseint($(' #blog_statistics >li '). EQ (3). Text (). Split (/[::]/)[1])        }; CallbackNULL, result); });

Here we are using callback the return result.

concurrency control

Because our requests are asynchronous, we need to perform the next step in the successful callback function, and in the case of multiple concurrency, a counter is required to determine if all concurrency has been successfully executed. This library is used here eventproxy to manage concurrency results for us.

CSDN has 3 pages on the Web front end, so we're going to do 3 crawl article links, using eventproxy the following notation:

varBASEURL =' http://blog.csdn.net/web/index.html ';varPageurls = []; for(var_i =1; _i <4; _i++) {Pageurls.push (BaseUrl +'? &page= '+ _i);} Ep.after (' get_topic_html ', Pageurls.length, function (EPS) {    //article links have been crawled and finished});p Ageurls.foreach ( function (page) {Superagent.get (page). End ( function (err, sres) {        //Crawl of article linksEp.emit (' get_topic_html ',' Get authorurls successful '); });});

Simply put, it is ‘get_topic_html‘ the emit event that will be detected, and the function is called after the specified number of occurrences ep.after .

Concurrent Request Count Control

Originally, it was over, but we crawled through the author's information asynchronously, so there could be dozens of or even hundreds of requests sent to the target site at the same time. For security reasons, the target site may reject our request, so we have to control the number of concurrent, and here we use async this library to implement.

// 控制最大并发数为5，在结果中取出callback返回来的整个结果数组。5function (myurl, callback) {    // 请求作者信息function (err, result) {    console.log(‘=========== result: ===========\n‘, result);    res.send(result);});

Here authorUrls is an array of the authors ' links that we crawled in the previous step, async executed sequentially based on the length of the array. Previously in the Author Information Crawl section we used the callback function to return the data, this is also the async interface provided. After all the elements in the final array have been executed once, the callback returned data is put into the result array and the array is returned to the front end.

Effect

By node app.js performing a daemon, enter the http://localhost:3000 view results in postman:

You can see that the data we need is returned in the body of the return.
At this point, our little crawler is finished, is it very simple?

The complete code

/*** * Created by Justeptech on 2016/7/11. * *varCheerio =require(' Cheerio ');varSuperagent =require(' superagent ');varAsync =require(' Async ');varURL =require(' URL ');varExpress =require(' Express ');varApp = Express ();varEventproxy =require(' Eventproxy ');varEP = Eventproxy ();varBASEURL =' http://blog.csdn.net/web/index.html ';varPageurls = []; for(var_i =1; _i <4; _i++) {Pageurls.push (BaseUrl +'? &page= '+ _i);} App.get ('/', function (req, res, next) {    varAuthorurls = [];//Command EP repeat monitor emit event (get_topic_html) 3 times to move againEp.after (' get_topic_html ', Pageurls.length, function (EPS) {        varConcurrencycount =0;//Use the callback function to return the result, and then remove the entire result array in the result.         varFetchurl = function (Myurl, callback) {            varFetchstart =New Date(). GetTime ();            concurrencycount++; Console.log (' Now the number of concurrent is ', Concurrencycount,' And what is being crawled ', Myurl); Superagent.get (Myurl). End ( function (err, ssres) {                    if(ERR) {Callback (err, Myurl +' Error happened! '); }varTime =New Date(). GetTime ()-Fetchstart; Console.log (' Crawl '+ Myurl +' Success ',' time-consuming '+ Time +' milliseconds '); concurrencycount--;var$ = cheerio.load (Ssres.text);varresult = {UserId:url.parse (myurl). Pathname.substring (1), Blogtitle: $ ("#blog_title a"). Text (), Visitcount:parseint($(' #blog_rank >li '). EQ (0). Text (). Split (/[::]/)[1]), Score:parseint($(' #blog_rank >li '). EQ (1). Text (). Split (/[::]/)[1]), Oricount:parseint($(' #blog_statistics >li '). EQ (0). Text (). Split (/[::]/)[1]), Copycount:parseint($(' #blog_statistics >li '). EQ (1). Text (). Split (/[::]/)[1]), Trscount:parseint($(' #blog_statistics >li '). EQ (2). Text (). Split (/[::]/)[1]), Cmtcount:parseint($(' #blog_statistics >li '). EQ (3). Text (). Split (/[::]/)[1])                    }; CallbackNULL, result);        }); };//Control the maximum number of concurrent numbers is 5, and in the results, the entire result array returned by callback is removed. Async.maplimit (Authorurls,5, function (Myurl, callback) {Fetchurl (Myurl, callback); }, function (err, result) {Console.log (' =========== result: ===========\n ', result);        Res.send (result);    }); });//Get an array of links for each page, don't return it with emit, because we've got an array already. Pageurls.foreach ( function (page) {Superagent.get (page). End ( function (err, sres) {            //General error handling            if(ERR) {returnNext (ERR); }//Extract the Author blog link, pay attention to the weight            var$ = cheerio.load (Sres.text); $('. Blog_list '). each ( function (i, E) {                varU = $ ('. User_name ', e). attr (' href ');if(Authorurls.indexof (u) = = =-1) {Authorurls.push (U);            }            }); Console.log (' Get authorurls successful!\n ', Authorurls); Ep.emit (' get_topic_html ',' Get authorurls successful ');    }); });}); App.listen ( the, function (req, res) {Console.log (' app is running at Port ');});

About the Nodejs to build a small reptile introduced here, the code word is not easy, convenient point like ha!

Make a little reptile out of Nodejs.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More