[JS Master's Road] node. JS implements a simple crawler-crawl All blog post list information

Source: Internet
Author: User

Crawl target: That's my own blog: http://www.cnblogs.com/ghostwu/

Features that need to be implemented:

Crawl Blog All article title, hyperlinks, article summaries, release time

Libraries that need to be used:

node. js comes with the HTTP library

Third party libraries: Cheerio, this library is used to handle the DOM node, his usage is almost identical to jquery usage, so with this tool, writing a crawler is very simple

Preparatory work:

1,NPM Init--yes Initialization Package.json

2, install Cheerio:npm install Cheerio--save-dev

The goal is to organize the parts that each article needs to crawl (crawl article title, hyperlinks, article summaries, release times) into an object, put in an array, such as:

[{title: '[Top][js Master's Road] build a JavaScript open source framework from scratch Gdom and plug-in development free video tutorial serial‘, URL:' Http://www.cnblogs.com/ghostwu/p/7470038.html ', entry:‘Abstract: Baidu network disk: Https://pan.baidu.com/s/1kULNXOF Youku Tudou View address: http://v.youku.com/v_show/id_XMzAwNTY2MTE0MA==.html?spm= A2h0j.8191423.playlist_content.5!3~5~5~a&&f‘, Listtime:' 2017-09-05 17:08 '}, {title:' [JS Master Road]vue2.0 based on Vue-cli+webpack vuex usage details ', URL:' Http://www.cnblogs.com/ghostwu/p/7521097.html ', entry:‘Absrtact: Before this, I have shared the communication mechanism between components and components and the communication mechanism between parent and child components, and our vuex is to solve the problem of component communication Vuex what is it? The essence of component communication is to pass the state of the data or component between components (the data and state are collectively referred to as States), but you can see that if we communicate in the most basic way, once we need to manage more states, the code will‘, Listtime:' 2017-09-14 15:51 '}, {title:' [JS Master Road]vue2.0 based on Vue-cli+webpack Peer component Communication Tutorial ', URL:' Http://www.cnblogs.com/ghostwu/p/7518158.html ', entry:‘Summary: Let's go on to the above, we'll explain the communication of the sibling components, the project structure is the same as above. Create a file Eventhandler.js in the Src/assets directory that is intended to pass events between sibling components Eventhandler.js Code: 2, create a new component Brother1.vue in the components directory. Through Eve‘, Listtime:' 2017-09-13 22:49 ' },   ]

explanation of ideas:

1, get the destination address: http://www.cnblogs.com/ghostwu/All HTML content

2, extract all the HTML content of the article

3, extract each article below the corresponding (article title, hyperlink, article Summary, release time)

1 varHTTP = require (' http ');2 varCheerio = require (' Cheerio '));3 4 varurl = ' http://www.cnblogs.com/ghostwu/';5 6 functionfilterhtml (HTML) {7     var$ =cheerio.load (HTML);8     varArclist = [];9     varApost = $ ("#content"). Find (". Post-list-item"));TenApost.each (function () { One         varEle = $ ( This); A         vartitle = Ele.find ("H2 a"). text (); -         varurl = ele.find ("H2 a"). attr ("href")); -Ele.find (". C_b_p_desc a"). Remove (); the         varEntry = Ele.find (". C_b_p_desc"). text (); -Ele.find ("Small A"). Remove (); -         varListtime = Ele.find ("small"). text (); -         varRe =/\d{4}-\d{2}-\d{2}\s*\d{2}[:]\d{2}/; +Listtime = Listtime.match (re) [0]; - Arclist.push ({ + Title:title, A Url:url, at Entry:entry, - Listtime:listtime -         }); -     }); -     returnarclist; - } in  -Http.get (URL,function(res) { to     varhtml = ' '; +     varArclist = []; -     //var arcInfo = {}; theRes.on (' Data ',function(chunk) { *HTML + =Chunk; $     });Panax NotoginsengRes.on (' End ',function () { -Arclist =filterhtml (HTML); the Console.log (arclist); +     }); A});

There are several key areas to be explained below:

1,res.on (' Data ', function () {})

After the HTTP module sends a GET request, it will continuously crawl the source content of the target Web page, so I listen to the data event in on, chunk is the transmission, the data is added to the HTML variable, and when the data is finished, the end event is triggered, You can print Console.log (HTML) in the end event to find out that he is all the HTML source code for the destination address, which solves our first problem: Get the target address: HTTP://WWW.CNBLOGS.COM/GHOSTWU /All HTML content

2, with the full HTML content, I then encapsulated a function filterhtml to filter the results I needed (information for each article)

3,var $ = cheerio.load (HTML); Loading HTML content through the Cheerio load method, you can use the Cheerio node operation, in order to pro-jquery operation, I saved the Document object with the dollar sign $

4,var Apost = $ ("#content"). Find (". Post-list-item"); This is all the article node information, after getting, through each method to traverse and crawl the required information, organized into objects, and then placed in an array

1 Arclist.push ({2              title:title,3              url:url,4 23              entry:entry,5              listtime:listtime6         });

This is done, the results have been shown above, if the blog style and my blog style, should be able to crawl,

Then perfect the paging crawl, so that the entire blog can be crawled down

1 varHTTP = require (' http ');2 varCheerio = require (' Cheerio '));3 4 varurl = ' http://www.cnblogs.com/ghostwu/';5 6 functionfilterhtml (HTML) {7     var$ =cheerio.load (HTML);8     varArclist = [];9     varApost = $ ("#content"). Find (". Post-list-item"));TenApost.each (function () { One         varEle = $ ( This); A         vartitle = Ele.find ("H2 a"). text (); -         varurl = ele.find ("H2 a"). attr ("href")); -Ele.find (". C_b_p_desc a"). Remove (); the         varEntry = Ele.find (". C_b_p_desc"). text (); -Ele.find ("Small A"). Remove (); -         varListtime = Ele.find ("small"). text (); -         varRe =/\d{4}-\d{2}-\d{2}\s*\d{2}[:]\d{2}/; +Listtime = Listtime.match (re) [0]; - Arclist.push ({ + Title:title, A Url:url, at Entry:entry, - Listtime:listtime -         }); -     }); -     returnarclist; - } in  - functionnextPage (HTML) { to     var$ =cheerio.load (HTML); +     varNexturl = $ ("#pager a:last-child"). attr (' href ')); -     if(!nexturl)return ; the     varCurpage = $ ("#pager. Current"). text (); *     if(!curpage) Curpage = 1; $     varNextPage = nexturl.substring (nexturl.indexof (' = ') + 1 );Panax Notoginseng     if(Curpage <nextPage) crawler (nexturl); - } the  + functioncrawler (URL) { AHttp.get (URL,function(res) { the         varhtml = ' '; +         varArclist = []; -Res.on (' Data ',function(chunk) { $HTML + =Chunk; $         }); -Res.on (' End ',function () { -Arclist =filterhtml (HTML); the Console.log (arclist); - nextPage (HTML);Wuyi         }); the     }); - } WuCrawler (URL);

[JS Master's Road] node. JS implements a simple crawler-crawl All blog post list information

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.