Actually write this article, I am very disturbed, because crawl content is the blog park, in case of which naughty little partner take to do bad things, I would not become an accomplice?
Okay, go to the subject.
First, the modules that the crawler needs to use are:
Express
Ejs
Superagent (a very convenient client request proxy module in Nodejs)
Cheerio (Nodejs version of jquery)
Front layout using bootstrap
Paging plugin using twbspagination.js
Complete crawler code that can be downloaded in my GitHub . The main logic code is in Router.js.
1.Crawl data on page 1th of a column
Analysis Process:
Open the homepage of the blog Park: http://www.cnblogs.com/
The left navigation bar displays the category information for all columns, which can be viewed in the developer tools .
Each column of the URL is also very regular, are www.cnblogs.com/cate/ column name. According to this URL, you can crawl the 1th page of a section of the blog ~
Paste the following code:
App.js (entrance file)
1 //Loading Modules2 varExpress = require (' Express ');3 varApp =Express ();4 varRouter = require ('./router/router '));5 6 //setting up the template engine7App.set (' View engine ', ' Ejs ');8 9 //Static Resource middlewareTenApp.use (Express.static ('./public '))); One A //Blog Park -App.get ('/cnblogs ', router.cnblogs); - //Section theApp.get ('/cnblogs/cate/:cate/', router.cnblogs_cate); - - -App.listen (1314,function(err) { + if(err) Console.log (' 1314 Port occupied '); -});
Router.js
varRequest = require (' superagent ');varCheerio = require (' Cheerio '));//SectionvarCate = [ ' Java ', ' cpp ', ' php ', ' Delphi ', ' Python ', ' Ruby ', ' Web ', ' javascript ', ' jquery ', ' HTML5 '];//Show PageExports.cnblogs =function(req, res) {Res.render (' Cnblogs ', {cate:cate});};//Crawl Column DataExports.cnblogs_cate =function(req, res) {//Section varCate = req.params[' Cate ']; Request. Get (' http://www.cnblogs.com/cate/' +Cate). End (function(Err, sres) {var$ =cheerio.load (Sres.text); varArticle = []; $('. Titlelnk '). each (function(index, ele) {varEle =$ (ele); varhref = ele.attr (' href ');//Blog Links vartitle = Ele.text ();//Blog ContentArticle.push ({href:href, title:title}); }); Res.json ({title:cate, cnblogs:article}); });};
Cnblogs.ejs
Just post the core code
1 <div class= "col-lg-6" >2 <select class= "Form-control" id= "Cate" >3 <option value= "0" > Please select Category </option>4for (var i=0; i<cate.length; i++) {% >5 <option value= "<%= cate[i]%>" ><%= cate[i]%></option>6 <%}%>7 </select>8 </div>
JS Template
<script type= "Text/template" id= "Cnblogs" > <ul class= "List-group" > <li class= " List-group-item "> <a href=" {{= href}} "target=" _blank ">{{= title}}</a> </ul> < /script>
Ajax requests
$ (' #cate '). On (' Change ',function(){ varCate = $ ( This). Val (); if(cate = = 0)return; $('. Artic '). HTML ('); $.ajax ({URL:'/cnblogs/cate/' +Cate, type:' GET ', DataType:' JSON ', Success:function(data) {varCnblogs =Data.cnblogs; for(vari=0; i<cnblogs.length; i++){ varCompiled = _.template ($ (' #cnblogs ')). HTML ()); varArt =compiled (cnblogs[i]); $('. Artic '). append (art); } } });});
Input: http://localhost:1314/cnblogs/, you can see, successfully get the 1th page of JavaScript data.
2. Paging function
Take http://www.cnblogs.com/cate/javascript/as an example:
First, the paging data is returned by the Ajax Call back-end interface.
in Chrome's developer tools, you can see that when paging, two requests are sent to the server ,
Postlist.aspx request data for a specific page.
Load.aspx returns the paging string.
We focus on analyzing the Postlist.aspx interface:
You can see that the request method is post.
The point of the question is how is the data of the POST request assembled?
Analyze the source code and discover that each paging string is bound to an event --aggsite.loadcategorypostlist ()
View the page source code, found that the function is defined in the Aggsite.js file .
This is the function below .
The focus is on this line of code, using AJAX to send requests back to the end.
this. Loadpostlist ("/mvc/aggsite/" + Aggsitemodel.itemlistactionname + ". aspx").
Analyzing the loadpostlist function, you can find that the POST data is the value of the variable aggsitemodel .
And the definition of Aggsitemodel in the page:
This concludes the analysis of the front end. All we have to do is use Nodejs to simulate the browser sending the request.
Router.js
1Exports.cate_page =function(req, res) {2 3 varCate =req.query.cate;4 varpage =Req.query.page;5 6 varurl = ' http://www.cnblogs.com/cate/' +Cate;7 8 Request9 . Get (URL)Ten. End (function(Err, sres) { One A //constructing parameters for a POST request - var$ =cheerio.load (sres.text); - varPOST_DATA_STR = $ (' #pager_bottom '). Prev (). html (). Trim (); the varPost_data_obj = Json.parse (Post_data_str.slice (post_data_str.indexof (' = ') +2, 1)); - - //Paging Interface - varPage_url = ' http://www.cnblogs.com/mvc/AggSite/PostList.aspx '; + //Modify the current page -Post_data_obj. PageIndex =page; + A Request at . Post (Page_url) -. Set (' Origin ', ' http://www.cnblogs.com ')//Forged Source -. Set (' Referer ', ' http://www.cnblogs.com/cate/' +cate+ '/')//Forged Referer -. Send (Post_data_obj)//Post Data -. End (function(Err, ssres) { - varArticle = []; in var$$ =cheerio.load (ssres.text); -$$ ('. Titlelnk '). each (function(index, ele) { to varEle =$$ (ele); + varhref = ele.attr (' href '); - vartitle =Ele.text (); the Article.push ({ * Href:href, $ Title:titlePanax Notoginseng }); - }); the Res.json ({ + Title:cate, A cnblogs:article the }); + }); - }); $};
Cate.ejs Pagination Code
1$ ('. Pagination ')). twbspagination ({2TOTALPAGES:20,//display 20 pages by default3Startpage:1,4Visiblepages:5,5Initiatestartpageclick:false,6First: ' Home ',7Prev: ' prev ',8Next: ' Next page ',9Last: ' End ',TenOnpageclick:function(evt, page) { One $.ajax ({ AURL: '/cnblogs/cate_page?cate= ' + cate + ' &page= ' +page, -Type: ' GET ', -DataType: ' JSON ', theSuccessfunction(data) { -$ ('. Artic '). HTML ('); - varCnblogs =data.cnblogs; - for(vari=0; i<cnblogs.length; i++){ + varCompiled = _.template ($ (' #cnblogs ')). HTML ()); - varArt =compiled (cnblogs[i]); +$ ('. Artic ')). append (art); A } at } - }); - } -});
Input: Http://localhost:1314/cnblogs/cate, you can see:
JavaScript column page 1th data
2nd page Data
Something
At this point, a simple crawler is complete. In fact, the crawler itself is not difficult, the difficulty is to analyze the page structure, and some business logic processing.
Complete code that I've placed on GitHub, Welcome to Starn (☆▽☆).
Because it is the first time to write technical blog, writing is limited, the study of shallow, if there is not the right place, welcome Bo friends to correct me.
Resources:
"Superagent Chinese Use document"
"Read through Cheerio API"
Nodejs Crawl blog post