Nodejs Writing crawler Tools

Source: Internet
Author: User
Tags git commands

See a few days of Nodejs, indeed is good, all when is practiced hand, wrote a crawler tool.

Crawler ideas are consistent, first crawl the page data, and then analyze the page, get to the required data, and finally get the data, is written to the hard disk, or display to the Web page, see for themselves.

The hardest part is analyzing pages, and if you don't use other tools, you can only use regular expressions to analyze them. Here the use of Cheerio this module, this is very useful, do not have a resistance to the psychological (because at first I was more resistant to this, want to use no other things, the result of their own collapse, or use it). Cheerio online, it can be based on the Div, according to class, href and other HTML tags, get the data inside.

This is the grasp of my blog home page (http://www.cnblogs.com/juepei/) article title, Hope Stationmaster don't angry, field walkthrough.

First, how to get the data under this page.

The code below, a simple one (but the Nodejs is very concise):

  Request (URL,function(error,res,body) {                if(!error && res.statuscode = = 200 {                     console.log (body);                }        });    

The body is the data,

And then start analyzing the data.

View the first page of the article section code, as follows:

<Divclass= "Day">    <Divclass= "Daytitle">        <aID= "Homepage1_homepagedays_dayslist_ctl00_imagelink"href= "http://www.cnblogs.com/juepei/archive/2015/01/09.html">January 9, 2015</a>                      </Div>                <Divclass= "Posttitle">                <aID= "Homepage1_homepagedays_dayslist_ctl00_daylist_titleurl_0"class= "PostTitle2"href= "http://www.cnblogs.com/juepei/p/4212595.html">Common git commands</a>            </Div>            <Divclass= "Postcon"><Divclass= "C_b_p_desc">Summary: (1) Git branch view local branch (2) Git branch-a View Remote branch (3) git checkout branchname Switch branch (4) git add yourfile (5) Git commit-a-M&quot;Describe&quot;Submit your current development to staging area, which can be understood as you ...<ahref= "http://www.cnblogs.com/juepei/p/4212595.html"class= "C_b_p_desc_readmore">Read the full text</a></Div></Div>            <Divclass= "Clear"></Div>            <Divclass= "Postdesc">Posted @ 2015-01-09 10:06 Schrödinger's Cat _ Read (4) Comments (0)<ahref= "http://i.cnblogs.com/EditPosts.aspx?postid=4212595"rel= "nofollow">Edit</a></Div>            <Divclass= "Clear"></Div>        </Div>
.....

Many articles, are the loops of those things above.

I want something here:

 <  div  class  = "Posttitle"                 >  <  a  id  = "..."   class  = "PostTitle2"   href  = "http://www.cnblogs.com/juepei/p/4212595.html"  >  git common command </ a             >  </ div  >  

It's wrapped in <div class= ' Posttitle ' >. To take out of it, this time cheerio on the show, it is recommended to look at the Cheerio API.

The code is as follows:

var $=cheerio.load (body); $ (' div '). Filter (function(i,e) {                                if($ ( this). attr (' class ') = = = ' Posttitle ') {                                        Console.log ($ (this). Text (). Trim ());}                        );

Here we use DIV to locate, so we can get the data. The code is so simple.

You can then dispose of the data you've got, I've saved it in a local text document. The middle uses an array to dump a bit of data

The full code is as follows:

varFs=require (' FS ');varBuffer=require (' buffer ');varUrl= ' http://www.cnblogs.com/juepei/';varresult=NewArray ();functionGetdatas () {request (URL,function(error,res,body) {if(!error && Res.statuscode = = 200){                        var$=Cheerio.load (body); varJ=0; $(' Div '). Filter (function(i,e) {if($( This). attr (' class ') = = = ' Posttitle ') {J++; //Console.log ($ (this). Text (). Trim ());Result.push ($ ( This). Text (). Trim ());                        }                        });                        Console.log (Result.tostring ()); Fs.appendfile ('/home/wang/data.txt ', result.tostring (),function(err) {if(Err) {Console.log (' File: ' +err); }Else{Console.log (' Write OK ');                }                        }); }                Else{console.log (error); }        });} Getdatas ();

Run this code,/home/wang directory has data.txt generated, the Web page encoding is UTF8, the system environment is UTF8, so there is no garbled production, if it is other systems, coding is not the same, processing under the code.

At this point, you are done. PS: I also learned a few days nodejs, just began to compare tangled, find not good information, feel difficult. I hope you see more API, this is the way.

Nodejs Writing crawler Tools

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.