Crawl target: That's my own blog: http://www.cnblogs.com/ghostwu/
Features that need to be implemented:
Crawl Blog All article title, hyperlinks, article summaries, release time
Libraries that need to be used:
node. js comes with the HTTP library
Third party libraries: Cheerio, this library is used to handle the DOM node, his usage is almost identical to jquery usage, so with this tool, writing a crawler is very simple
Preparatory work:
1,NPM Init--yes Initialization Package.json
2, install Cheerio:npm install Cheerio--save-dev
The goal is to organize the parts that each article needs to crawl (crawl article title, hyperlinks, article summaries, release times) into an object, put in an array, such as:
[{title: '[Top][js Master's Road] build a JavaScript open source framework from scratch Gdom and plug-in development free video tutorial serial‘, URL:' Http://www.cnblogs.com/ghostwu/p/7470038.html ', entry:‘Abstract: Baidu network disk: Https://pan.baidu.com/s/1kULNXOF Youku Tudou View address: http://v.youku.com/v_show/id_XMzAwNTY2MTE0MA==.html?spm= A2h0j.8191423.playlist_content.5!3~5~5~a&&f‘, Listtime:' 2017-09-05 17:08 '}, {title:' [JS Master Road]vue2.0 based on Vue-cli+webpack vuex usage details ', URL:' Http://www.cnblogs.com/ghostwu/p/7521097.html ', entry:‘Absrtact: Before this, I have shared the communication mechanism between components and components and the communication mechanism between parent and child components, and our vuex is to solve the problem of component communication Vuex what is it? The essence of component communication is to pass the state of the data or component between components (the data and state are collectively referred to as States), but you can see that if we communicate in the most basic way, once we need to manage more states, the code will‘, Listtime:' 2017-09-14 15:51 '}, {title:' [JS Master Road]vue2.0 based on Vue-cli+webpack Peer component Communication Tutorial ', URL:' Http://www.cnblogs.com/ghostwu/p/7518158.html ', entry:‘Summary: Let's go on to the above, we'll explain the communication of the sibling components, the project structure is the same as above. Create a file Eventhandler.js in the Src/assets directory that is intended to pass events between sibling components Eventhandler.js Code: 2, create a new component Brother1.vue in the components directory. Through Eve‘, Listtime:' 2017-09-13 22:49 ' }, ]
explanation of ideas:
1, get the destination address: http://www.cnblogs.com/ghostwu/All HTML content
2, extract all the HTML content of the article
3, extract each article below the corresponding (article title, hyperlink, article Summary, release time)
1 varHTTP = require (' http ');2 varCheerio = require (' Cheerio '));3 4 varurl = ' http://www.cnblogs.com/ghostwu/';5 6 functionfilterhtml (HTML) {7 var$ =cheerio.load (HTML);8 varArclist = [];9 varApost = $ ("#content"). Find (". Post-list-item"));TenApost.each (function () { One varEle = $ ( This); A vartitle = Ele.find ("H2 a"). text (); - varurl = ele.find ("H2 a"). attr ("href")); -Ele.find (". C_b_p_desc a"). Remove (); the varEntry = Ele.find (". C_b_p_desc"). text (); -Ele.find ("Small A"). Remove (); - varListtime = Ele.find ("small"). text (); - varRe =/\d{4}-\d{2}-\d{2}\s*\d{2}[:]\d{2}/; +Listtime = Listtime.match (re) [0]; - Arclist.push ({ + Title:title, A Url:url, at Entry:entry, - Listtime:listtime - }); - }); - returnarclist; - } in -Http.get (URL,function(res) { to varhtml = ' '; + varArclist = []; - //var arcInfo = {}; theRes.on (' Data ',function(chunk) { *HTML + =Chunk; $ });Panax NotoginsengRes.on (' End ',function () { -Arclist =filterhtml (HTML); the Console.log (arclist); + }); A});
There are several key areas to be explained below:
1,res.on (' Data ', function () {})
After the HTTP module sends a GET request, it will continuously crawl the source content of the target Web page, so I listen to the data event in on, chunk is the transmission, the data is added to the HTML variable, and when the data is finished, the end event is triggered, You can print Console.log (HTML) in the end event to find out that he is all the HTML source code for the destination address, which solves our first problem: Get the target address: HTTP://WWW.CNBLOGS.COM/GHOSTWU /All HTML content
2, with the full HTML content, I then encapsulated a function filterhtml to filter the results I needed (information for each article)
3,var $ = cheerio.load (HTML); Loading HTML content through the Cheerio load method, you can use the Cheerio node operation, in order to pro-jquery operation, I saved the Document object with the dollar sign $
4,var Apost = $ ("#content"). Find (". Post-list-item"); This is all the article node information, after getting, through each method to traverse and crawl the required information, organized into objects, and then placed in an array
1 Arclist.push ({2 title:title,3 url:url,4 23 entry:entry,5 listtime:listtime6 });
This is done, the results have been shown above, if the blog style and my blog style, should be able to crawl,
Then perfect the paging crawl, so that the entire blog can be crawled down
1 varHTTP = require (' http ');2 varCheerio = require (' Cheerio '));3 4 varurl = ' http://www.cnblogs.com/ghostwu/';5 6 functionfilterhtml (HTML) {7 var$ =cheerio.load (HTML);8 varArclist = [];9 varApost = $ ("#content"). Find (". Post-list-item"));TenApost.each (function () { One varEle = $ ( This); A vartitle = Ele.find ("H2 a"). text (); - varurl = ele.find ("H2 a"). attr ("href")); -Ele.find (". C_b_p_desc a"). Remove (); the varEntry = Ele.find (". C_b_p_desc"). text (); -Ele.find ("Small A"). Remove (); - varListtime = Ele.find ("small"). text (); - varRe =/\d{4}-\d{2}-\d{2}\s*\d{2}[:]\d{2}/; +Listtime = Listtime.match (re) [0]; - Arclist.push ({ + Title:title, A Url:url, at Entry:entry, - Listtime:listtime - }); - }); - returnarclist; - } in - functionnextPage (HTML) { to var$ =cheerio.load (HTML); + varNexturl = $ ("#pager a:last-child"). attr (' href ')); - if(!nexturl)return ; the varCurpage = $ ("#pager. Current"). text (); * if(!curpage) Curpage = 1; $ varNextPage = nexturl.substring (nexturl.indexof (' = ') + 1 );Panax Notoginseng if(Curpage <nextPage) crawler (nexturl); - } the + functioncrawler (URL) { AHttp.get (URL,function(res) { the varhtml = ' '; + varArclist = []; -Res.on (' Data ',function(chunk) { $HTML + =Chunk; $ }); -Res.on (' End ',function () { -Arclist =filterhtml (HTML); the Console.log (arclist); - nextPage (HTML);Wuyi }); the }); - } WuCrawler (URL);
[JS Master's Road] node. JS implements a simple crawler-crawl All blog post list information