Beijing start technology front-end researcher, focus on sharing HTML5 app rapid development tools WeX5 black Magic and the corresponding front-end technology.
Objective
The use of reptiles can do a lot of things, single men can use reptiles to collect a variety of sister intelligence, the girls can use reptiles to collect small things sister want, make money to analyze the microblog speech and stock ups and downs of the relationship and so on, simply to heaven.
You feel a little bit of me:
Stir
Throw away machine learning this seemingly very tall data processing technology, simple to do a crawler to get data is very simple. For the former ER people, born in the era of Nodejs is not too happy, the following will use Nodejs to do a reptile it.
This time we first take csdn to practice practiced hand, climb sister or something, in fact, very similar. The function to be implemented is to crawl the posts in the front-end section and output the author's information.
The first is the preparation of the tool, Nodejs certainly must be installed, I use the current latest version of the 6.2.0. In addition, because we need to crawl content through various HTTP requests, so a good HTTP debugging tool is also required, here I recommend to use Postman.
Build the Environment
Determine the third-party libraries we want to use based on the functionality that crawlers need:
- Back-end Services:
express
- Make an HTTP request:
superagent
- Controlling concurrent Requests:
async
+eventproxy
- Analyze Web content:
cheerio
Related library API please go to the corresponding GitHub project page to understand, here I will not repeat, the corresponding Package.json:
{ "name":"Spider", "version":"0.0.0", "Description":"Learn Nodejs on GitHub", "Scripts":{"start": "Node App.js" }, "Dependencies":{"async": "^2.0.0-rc.6", "cheerio": "^0.20.0", "eventproxy":
"^0.3.4", "
Express":
"^4.9.5", "
superagent":
"^2.0.0" }
,}
The well-written package remembers to be installed, so that we can build a good development environment.
Crawler body
Following the crawler needs to follow the function of a step by step to write the crawler body.
Background Services Section
The implementation of the function is to receive the front-end request to start the crawler, after the completion of information crawling information back to the front. Background Services Section I used the express
framework here, and it's easier to use native http
modules. The simple framework is as follows:
varrequire(‘express‘);var app = express();app.get(‘/‘function (req, res, next) { // your code here});app.listen(3000function (req, res) { console.log(‘app is running at port 3000‘);});
get
Insert Our response code in the process, including the startup crawler, the result information output and so on.
Crawling of article links
Here we use superagent
this library to achieve, this library author is a prolific God, we use the library is basically he wrote, Trembling, Young!
superagent.get(Url).end(function (err, res) { ifreturn next(err); } // your code here});
Url
For the address we request, the get
way you request it, the effect is the same as the effect you open with the browser, the Url
returned data are put in res
, the res
analysis can get the data we want.
Processing of data
Here we use cheerio
This library, he can let us in jQuery
the way to manipulate the returned data, it is really sweet.
// 提取作者博客链接,注意去重var $ = cheerio.load(sres.text);$(‘.blog_list‘).each(function (i, e) { var u = $(‘.user_name‘, e).attr(‘href‘); if (authorUrls.indexOf(u) === -1) { authorUrls.push(u); }});
Is it a familiar feeling? That's exactly jQuery
the syntax.
Crawling of author information in the article
Here we have to go to the author's homepage to crawl the corresponding message, the same as the previous crawl link.
Superagent.get (Authorurl). End ( function (err, ssres) { if(ERR) {Callback (err, Authorurl +' Error happened! '); }var$ = cheerio.load (Ssres.text);varresult = {UserId:url.parse (myurl). Pathname.substring (1), Blogtitle: $ ("#blog_title a"). Text (), Visitcount:parseint($(' #blog_rank >li '). EQ (0). Text (). Split (/[::]/)[1]), Score:parseint($(' #blog_rank >li '). EQ (1). Text (). Split (/[::]/)[1]), Oricount:parseint($(' #blog_statistics >li '). EQ (0). Text (). Split (/[::]/)[1]), Copycount:parseint($(' #blog_statistics >li '). EQ (1). Text (). Split (/[::]/)[1]), Trscount:parseint($(' #blog_statistics >li '). EQ (2). Text (). Split (/[::]/)[1]), Cmtcount:parseint($(' #blog_statistics >li '). EQ (3). Text (). Split (/[::]/)[1]) }; CallbackNULL, result); });
Here we are using callback
the return result.
concurrency control
Because our requests are asynchronous, we need to perform the next step in the successful callback function, and in the case of multiple concurrency, a counter is required to determine if all concurrency has been successfully executed. This library is used here eventproxy
to manage concurrency results for us.
CSDN has 3 pages on the Web front end, so we're going to do 3 crawl article links, using eventproxy
the following notation:
varBASEURL =' http://blog.csdn.net/web/index.html ';varPageurls = []; for(var_i =1; _i <4; _i++) {Pageurls.push (BaseUrl +'? &page= '+ _i);} Ep.after (' get_topic_html ', Pageurls.length, function (EPS) { //article links have been crawled and finished});p Ageurls.foreach ( function (page) {Superagent.get (page). End ( function (err, sres) { //Crawl of article linksEp.emit (' get_topic_html ',' Get authorurls successful '); });});
Simply put, it is ‘get_topic_html‘
the emit event that will be detected, and the function is called after the specified number of occurrences ep.after
.
Concurrent Request Count Control
Originally, it was over, but we crawled through the author's information asynchronously, so there could be dozens of or even hundreds of requests sent to the target site at the same time. For security reasons, the target site may reject our request, so we have to control the number of concurrent, and here we use async
this library to implement.
// 控制最大并发数为5,在结果中取出callback返回来的整个结果数组。5function (myurl, callback) { // 请求作者信息function (err, result) { console.log(‘=========== result: ===========\n‘, result); res.send(result);});
Here authorUrls
is an array of the authors ' links that we crawled in the previous step, async
executed sequentially based on the length of the array. Previously in the Author Information Crawl section we used the callback function to return the data, this is also the async
interface provided. After all the elements in the final array have been executed once, the callback
returned data is put into the result
array and the array is returned to the front end.
Effect
By node app.js
performing a daemon, enter the http://localhost:3000
view results in postman:
You can see that the data we need is returned in the body of the return.
At this point, our little crawler is finished, is it very simple?
The complete code
/*** * Created by Justeptech on 2016/7/11. * *varCheerio =require(' Cheerio ');varSuperagent =require(' superagent ');varAsync =require(' Async ');varURL =require(' URL ');varExpress =require(' Express ');varApp = Express ();varEventproxy =require(' Eventproxy ');varEP = Eventproxy ();varBASEURL =' http://blog.csdn.net/web/index.html ';varPageurls = []; for(var_i =1; _i <4; _i++) {Pageurls.push (BaseUrl +'? &page= '+ _i);} App.get ('/', function (req, res, next) { varAuthorurls = [];//Command EP repeat monitor emit event (get_topic_html) 3 times to move againEp.after (' get_topic_html ', Pageurls.length, function (EPS) { varConcurrencycount =0;//Use the callback function to return the result, and then remove the entire result array in the result. varFetchurl = function (Myurl, callback) { varFetchstart =New Date(). GetTime (); concurrencycount++; Console.log (' Now the number of concurrent is ', Concurrencycount,' And what is being crawled ', Myurl); Superagent.get (Myurl). End ( function (err, ssres) { if(ERR) {Callback (err, Myurl +' Error happened! '); }varTime =New Date(). GetTime ()-Fetchstart; Console.log (' Crawl '+ Myurl +' Success ',' time-consuming '+ Time +' milliseconds '); concurrencycount--;var$ = cheerio.load (Ssres.text);varresult = {UserId:url.parse (myurl). Pathname.substring (1), Blogtitle: $ ("#blog_title a"). Text (), Visitcount:parseint($(' #blog_rank >li '). EQ (0). Text (). Split (/[::]/)[1]), Score:parseint($(' #blog_rank >li '). EQ (1). Text (). Split (/[::]/)[1]), Oricount:parseint($(' #blog_statistics >li '). EQ (0). Text (). Split (/[::]/)[1]), Copycount:parseint($(' #blog_statistics >li '). EQ (1). Text (). Split (/[::]/)[1]), Trscount:parseint($(' #blog_statistics >li '). EQ (2). Text (). Split (/[::]/)[1]), Cmtcount:parseint($(' #blog_statistics >li '). EQ (3). Text (). Split (/[::]/)[1]) }; CallbackNULL, result); }); };//Control the maximum number of concurrent numbers is 5, and in the results, the entire result array returned by callback is removed. Async.maplimit (Authorurls,5, function (Myurl, callback) {Fetchurl (Myurl, callback); }, function (err, result) {Console.log (' =========== result: ===========\n ', result); Res.send (result); }); });//Get an array of links for each page, don't return it with emit, because we've got an array already. Pageurls.foreach ( function (page) {Superagent.get (page). End ( function (err, sres) { //General error handling if(ERR) {returnNext (ERR); }//Extract the Author blog link, pay attention to the weight var$ = cheerio.load (Sres.text); $('. Blog_list '). each ( function (i, E) { varU = $ ('. User_name ', e). attr (' href ');if(Authorurls.indexof (u) = = =-1) {Authorurls.push (U); } }); Console.log (' Get authorurls successful!\n ', Authorurls); Ep.emit (' get_topic_html ',' Get authorurls successful '); }); });}); App.listen ( the, function (req, res) {Console.log (' app is running at Port ');});
About the Nodejs to build a small reptile introduced here, the code word is not easy, convenient point like ha!
Make a little reptile out of Nodejs.