See a few days of Nodejs, indeed is good, all when is practiced hand, wrote a crawler tool.
Crawler ideas are consistent, first crawl the page data, and then analyze the page, get to the required data, and finally get the data, is written to the hard disk, or display to the Web page, see for themselves.
The hardest part is analyzing pages, and if you don't use other tools, you can only use regular expressions to analyze them. Here the use of Cheerio this module, this is very useful, do not have a resistance to the psychological (because at first I was more resistant to this, want to use no other things, the result of their own collapse, or use it). Cheerio online, it can be based on the Div, according to class, href and other HTML tags, get the data inside.
This is the grasp of my blog home page (http://www.cnblogs.com/juepei/) article title, Hope Stationmaster don't angry, field walkthrough.
First, how to get the data under this page.
The code below, a simple one (but the Nodejs is very concise):
Request (URL,function(error,res,body) { if(!error && res.statuscode = = 200 { console.log (body); } });
The body is the data,
And then start analyzing the data.
View the first page of the article section code, as follows:
<Divclass= "Day"> <Divclass= "Daytitle"> <aID= "Homepage1_homepagedays_dayslist_ctl00_imagelink"href= "http://www.cnblogs.com/juepei/archive/2015/01/09.html">January 9, 2015</a> </Div> <Divclass= "Posttitle"> <aID= "Homepage1_homepagedays_dayslist_ctl00_daylist_titleurl_0"class= "PostTitle2"href= "http://www.cnblogs.com/juepei/p/4212595.html">Common git commands</a> </Div> <Divclass= "Postcon"><Divclass= "C_b_p_desc">Summary: (1) Git branch view local branch (2) Git branch-a View Remote branch (3) git checkout branchname Switch branch (4) git add yourfile (5) Git commit-a-M"Describe"Submit your current development to staging area, which can be understood as you ...<ahref= "http://www.cnblogs.com/juepei/p/4212595.html"class= "C_b_p_desc_readmore">Read the full text</a></Div></Div> <Divclass= "Clear"></Div> <Divclass= "Postdesc">Posted @ 2015-01-09 10:06 Schrödinger's Cat _ Read (4) Comments (0)<ahref= "http://i.cnblogs.com/EditPosts.aspx?postid=4212595"rel= "nofollow">Edit</a></Div> <Divclass= "Clear"></Div> </Div>
.....
Many articles, are the loops of those things above.
I want something here:
< div class = "Posttitle" > < a id = "..." class = "PostTitle2" href = "http://www.cnblogs.com/juepei/p/4212595.html" > git common command </ a > </ div >
It's wrapped in <div class= ' Posttitle ' >. To take out of it, this time cheerio on the show, it is recommended to look at the Cheerio API.
The code is as follows:
var $=cheerio.load (body); $ (' div '). Filter (function(i,e) { if($ ( this). attr (' class ') = = = ' Posttitle ') { Console.log ($ (this). Text (). Trim ());} );
Here we use DIV to locate, so we can get the data. The code is so simple.
You can then dispose of the data you've got, I've saved it in a local text document. The middle uses an array to dump a bit of data
The full code is as follows:
varFs=require (' FS ');varBuffer=require (' buffer ');varUrl= ' http://www.cnblogs.com/juepei/';varresult=NewArray ();functionGetdatas () {request (URL,function(error,res,body) {if(!error && Res.statuscode = = 200){ var$=Cheerio.load (body); varJ=0; $(' Div '). Filter (function(i,e) {if($( This). attr (' class ') = = = ' Posttitle ') {J++; //Console.log ($ (this). Text (). Trim ());Result.push ($ ( This). Text (). Trim ()); } }); Console.log (Result.tostring ()); Fs.appendfile ('/home/wang/data.txt ', result.tostring (),function(err) {if(Err) {Console.log (' File: ' +err); }Else{Console.log (' Write OK '); } }); } Else{console.log (error); } });} Getdatas ();
Run this code,/home/wang directory has data.txt generated, the Web page encoding is UTF8, the system environment is UTF8, so there is no garbled production, if it is other systems, coding is not the same, processing under the code.
At this point, you are done. PS: I also learned a few days nodejs, just began to compare tangled, find not good information, feel difficult. I hope you see more API, this is the way.
Nodejs Writing crawler Tools