Nodejs Crawl blog post

Source: Internet
Author: User
Tags call back

Actually write this article, I am very disturbed, because crawl content is the blog park, in case of which naughty little partner take to do bad things, I would not become an accomplice?

Okay, go to the subject.

First, the modules that the crawler needs to use are:

Express

Ejs

Superagent (a very convenient client request proxy module in Nodejs)

Cheerio (Nodejs version of jquery)

Front layout using bootstrap

Paging plugin using twbspagination.js

Complete crawler code that can be downloaded in my GitHub . The main logic code is in Router.js.

1.Crawl data on page 1th of a column

Analysis Process:

Open the homepage of the blog Park: http://www.cnblogs.com/

The left navigation bar displays the category information for all columns, which can be viewed in the developer tools .

Each column of the URL is also very regular, are www.cnblogs.com/cate/ column name. According to this URL, you can crawl the 1th page of a section of the blog ~

Paste the following code:

App.js (entrance file)

1 //Loading Modules2 varExpress = require (' Express ');3 varApp =Express ();4 varRouter = require ('./router/router '));5 6 //setting up the template engine7App.set (' View engine ', ' Ejs ');8 9 //Static Resource middlewareTenApp.use (Express.static ('./public '))); One  A //Blog Park -App.get ('/cnblogs ', router.cnblogs); - //Section theApp.get ('/cnblogs/cate/:cate/', router.cnblogs_cate); -  -  -App.listen (1314,function(err) { +     if(err) Console.log (' 1314 Port occupied '); -});

Router.js

varRequest = require (' superagent ');varCheerio = require (' Cheerio '));//SectionvarCate = [    ' Java ', ' cpp ', ' php ', ' Delphi ', ' Python ', ' Ruby ',    ' Web ', ' javascript ', ' jquery ', ' HTML5 '];//Show PageExports.cnblogs =function(req, res) {Res.render (' Cnblogs ', {cate:cate});};//Crawl Column DataExports.cnblogs_cate =function(req, res) {//Section    varCate = req.params[' Cate ']; Request. Get (' http://www.cnblogs.com/cate/' +Cate). End (function(Err, sres) {var$ =cheerio.load (Sres.text); varArticle = []; $('. Titlelnk '). each (function(index, ele) {varEle =$ (ele); varhref = ele.attr (' href ');//Blog Links            vartitle = Ele.text ();//Blog ContentArticle.push ({href:href, title:title});        });    Res.json ({title:cate, cnblogs:article}); });};

Cnblogs.ejs

Just post the core code

1 <div class= "col-lg-6" >2           <select class= "Form-control" id= "Cate" >3             <option value= "0" > Please select Category </option>4for             (var i=0; i<cate.length; i++) {% >5               <option value= "<%= cate[i]%>" ><%= cate[i]%></option>6             <%}%>7           </select>8 </div>

JS Template

    <script type= "Text/template" id= "Cnblogs" >      <ul class= "List-group" >          <li class= " List-group-item ">            <a href=" {{= href}} "target=" _blank ">{{= title}}</a>      </ul>    < /script>

Ajax requests

$ (' #cate '). On (' Change ',function(){ varCate = $ ( This). Val (); if(cate = = 0)return; $('. Artic '). HTML ('); $.ajax ({URL:'/cnblogs/cate/' +Cate, type:' GET ', DataType:' JSON ', Success:function(data) {varCnblogs =Data.cnblogs;  for(vari=0; i<cnblogs.length; i++){       varCompiled = _.template ($ (' #cnblogs ')). HTML ()); varArt =compiled (cnblogs[i]); $('. Artic '). append (art); }   } });});

Input: http://localhost:1314/cnblogs/, you can see, successfully get the 1th page of JavaScript data.

2. Paging function

Take http://www.cnblogs.com/cate/javascript/as an example:

First, the paging data is returned by the Ajax Call back-end interface.

in Chrome's developer tools, you can see that when paging, two requests are sent to the server ,

Postlist.aspx request data for a specific page.

Load.aspx returns the paging string.

We focus on analyzing the Postlist.aspx interface:

You can see that the request method is post.

The point of the question is how is the data of the POST request assembled?

Analyze the source code and discover that each paging string is bound to an event --aggsite.loadcategorypostlist ()

View the page source code, found that the function is defined in the Aggsite.js file .

This is the function below .

The focus is on this line of code, using AJAX to send requests back to the end.

this. Loadpostlist ("/mvc/aggsite/" + Aggsitemodel.itemlistactionname + ". aspx").

Analyzing the loadpostlist function, you can find that the POST data is the value of the variable aggsitemodel .

And the definition of Aggsitemodel in the page:

This concludes the analysis of the front end. All we have to do is use Nodejs to simulate the browser sending the request.

Router.js

1Exports.cate_page =function(req, res) {2 3     varCate =req.query.cate;4     varpage =Req.query.page;5 6     varurl = ' http://www.cnblogs.com/cate/' +Cate;7 8 Request9 . Get (URL)Ten. End (function(Err, sres) { One  A         //constructing parameters for a POST request -         var$ =cheerio.load (sres.text); -         varPOST_DATA_STR = $ (' #pager_bottom '). Prev (). html (). Trim (); the         varPost_data_obj = Json.parse (Post_data_str.slice (post_data_str.indexof (' = ') +2, 1));  -  -         //Paging Interface -         varPage_url = ' http://www.cnblogs.com/mvc/AggSite/PostList.aspx '; +         //Modify the current page -Post_data_obj. PageIndex =page; +  A Request at . Post (Page_url) -. Set (' Origin ', ' http://www.cnblogs.com ')//Forged Source -. Set (' Referer ', ' http://www.cnblogs.com/cate/' +cate+ '/')//Forged Referer -. Send (Post_data_obj)//Post Data -. End (function(Err, ssres) { -             varArticle = []; in             var$$ =cheerio.load (ssres.text); -$$ ('. Titlelnk '). each (function(index, ele) { to                 varEle =$$ (ele); +                 varhref = ele.attr (' href '); -                 vartitle =Ele.text (); the Article.push ({ * Href:href, $ Title:titlePanax Notoginseng                 });  -             }); the Res.json ({ + Title:cate, A cnblogs:article the             }); +         });  -     }); $};

Cate.ejs Pagination Code

1$ ('. Pagination ')). twbspagination ({2TOTALPAGES:20,//display 20 pages by default3Startpage:1,4Visiblepages:5,5Initiatestartpageclick:false,6First: ' Home ',7Prev: ' prev ',8Next: ' Next page ',9Last: ' End ',TenOnpageclick:function(evt, page) { One $.ajax ({ AURL: '/cnblogs/cate_page?cate= ' + cate + ' &page= ' +page, -Type: ' GET ', -DataType: ' JSON ', theSuccessfunction(data) { -$ ('. Artic '). HTML ('); -         varCnblogs =data.cnblogs; -           for(vari=0; i<cnblogs.length; i++){ +            varCompiled = _.template ($ (' #cnblogs ')). HTML ()); -            varArt =compiled (cnblogs[i]); +$ ('. Artic ')). append (art); A          } at       } -     }); -   } -});

Input: Http://localhost:1314/cnblogs/cate, you can see:

JavaScript column page 1th data

2nd page Data

Something

At this point, a simple crawler is complete. In fact, the crawler itself is not difficult, the difficulty is to analyze the page structure, and some business logic processing.

Complete code that I've placed on GitHub, Welcome to Starn (☆▽☆).

Because it is the first time to write technical blog, writing is limited, the study of shallow, if there is not the right place, welcome Bo friends to correct me.

Resources:

"Superagent Chinese Use document"

"Read through Cheerio API"

Nodejs Crawl blog post

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.