Nodejs Crawler Primer

Source: Internet
Author: User
Tags http authentication

1. Write in front

Always use the Python/.net language to implement the crawler, but now as a front-end developers, naturally need to skilled NodeJS. The following uses the NodeJS language to achieve a embarrassing encyclopedia crawler. In addition, some of the code used in this article is the ES6 syntax.

The dependent libraries required to implement the crawler are as follows.

    1. Request: Use the Get or POST method to obtain the source code of the webpage.
    2. Cheerio: The source of the Web page to parse, to obtain the required data.

Firstly, this paper introduces the library and its usage, then uses these dependent libraries to realize a web crawler for embarrassing encyclopedia.

2. Request Library

Request is a lightweight HTTP library that is powerful and simple to use. You can use it to implement HTTP requests, and to support HTTP authentication, to customize the request first. Some of the features in the request library are described below.

Install the request module as follows:

npm install request

After installing the request, you can use, the following use request for Baidu Web page.

const=require('request');req('http://www.baidu.com', (error, response,=>{  if (!&&response.statusCode==200{    console.log(body)  }})

When the options parameter is not set, the request method defaults to a GET request. And I like to use the specific method of the request object, using the following:

req.get({  url:'http://www.baidu.com'},(err, res,=>{  if (!&&res.statusCode==200{    console.log(body)  }});

However, many times, directly to request a URL to obtain the HTML source code, often do not get the information we need. In general, you need to take into account the request header and page encoding.

    1. Request headers for Web pages
    2. Encoding of Web pages

Here's how to add a Web request header and set the correct encoding at the time of the request.

req.Get({    URL:Url,    Headers: {        "User-agent" : "mozilla/5.0 (Windows NT 10.0; Win64; x64) applewebkit/537.36 (khtml, like Gecko) chrome/62.0.3202.94 safari/537.36 ",        "Host" : "Www.zhihu.com",        "Upgrade-insecure-requests" : "1"    },    encoding: ' Utf-8 '},(Err,Res,Body=>{    if(!ErrConsole.Log(body);})

Set the options parameter, add headers properties to implement the request header settings, add encoding properties to set the page encoding. It is important to note that if encoding:null , then, the GET request obtains the content is an Buffer object, namely body is a Buffer object.

The above features are sufficient to meet the needs of the following, more functions please refer to the Official document request

3. Cheerio Library

Cheerio is a server-side Jquery that is loved by developers for its light, fast, easy-to-learn features. It's easy to learn the Cheerio library with the foundation of Jquery. It can quickly navigate to elements in a Web page with the same rules as the Jquery positioning element, and it can also modify the content of elements in the HTML in a very convenient way, as well as get their data. The following is an introduction to the elements in the Cheerio Quick Location page and the contents of getting them.

First install the Cheerio library

npm install cheerio

Here is a piece of code, and then the code to explain the use of the Cheerio library. Analyze the homepage of the blog, and then extract the title of the article in each page.

First of all, the blog home page analysis. Such as:

After parsing the HTML source code, first by .post_item getting all the headings and then analyzing each one .post_item , you can use the a.titlelnk a tag for each title. The following is implemented through code.

ConstReq= require(' request ');ConstCheerio= require(' Cheerio ');req.Get({    URL: ' https://www.cnblogs.com/'  },(Err,Res,Body= {    if(!Err&& Res.StatusCode ==  $){       LetCnbloghtmlstr=Body;       Let$= Cheerio.Load(CNBLOGHTMLSTR);      $('. Post_item '). each(Index,Ele= {         LetTitle= $(Ele).Find(' A.titlelnk ');         LetTitleText= title.text();         LetTitleturl= title.attr(' href ');        Console.Log(TitleText,Titleturl);      });    }  });

Of course, the Cheerio library also supports chained calls, and the above code can be rewritten as well:

 LetCnbloghtmlstr=Body; Let$= Cheerio.Load(CNBLOGHTMLSTR); LetTitles= $('. Post_item ').Find(' A.titlelnk ');titles. each(Index,Ele= {     LetTitleText= $(Ele).text();     LetTitleturl= $(Ele).attr(' href ');    Console.Log(TitleText,Titleturl);

The above code is very simple, no longer use the text to repeat. Here's a summary of some of the more important points you think.

    1. The find() node collection A, obtained by using the method, locates its child nodes as the root node again with the elements in the A collection, and gets the contents and attributes of the child elements, as in the $(A[i]) above $(ele) .
    2. In the above code $(ele) , it can be used, $(this) but since I am using the es6 arrow function, I have changed the this pointer of the each callback function in the method, so I use $(ele) ;
    3. The Cheerio library also supports chained calls, as above, and it is important to $(‘.post_item‘).find(‘a.titlelnk‘) note that the Cheerio object A calls the method, and find() if A is a collection, then each child element in the A collection invokes the find() method and puts back a result in combination. If a text() is called, then each child element in the a collection invokes text() and returns a string that is a merge of the contents of all child elements (directly merged, without delimiters).

Finally, I summarize some of my more commonly used methods.

    1. First ()
    2. Last ()
    3. Children ([selector]): This method is similar to find except that the method searches only child nodes, and find searches the entire descendant node.

For more information on the use of Cheerio libraries, refer to the documentation Cheerio

4. Embarrassing Encyclopedia crawler

Through the above request and the introduction of the cheerio class library, the following use these two class library to crawl the page of the embarrassing encyclopedia.

1, in the project directory, the new httpHelper.js file, through the URL to get embarrassing Encyclopedia of the Web page source code, as follows:

//CrawlerConstReq= require(' request ');function gethtml(URL){    return New Promise((Resolve,Reject= {        req.Get({            URL:Url,            Headers: {                "User-agent" : "mozilla/5.0 (Windows NT 10.0; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/55.0.2883.87 safari/537.36 ",                "Referer" : "https://www.qiushibaike.com/"            },            encoding: ' Utf-8 '        },(Err,Res,Body=>{            if(ERR)Reject(ERR);            Else Resolve(body);        })});}exports.gethtml =Gethtml;

2. In the project directory, create a new Splider.js file, analyze the Web code of the embarrassing encyclopedia, extract the information you need, and create a logic to crawl the data of different pages by changing the ID of the URL.

ConstCheerio= require(' Cheerio ');ConstHttphelper= require('./httphelper ');function Getqbjok(HTMLSTR){     Let$= Cheerio.Load(HTMLSTR);     LetJoklist= $(' #content-left ').Children(' div ');     LetRst=[];    joklist. each((i,Item=>{         LetNode= $(item);         LetTitlenode= node.Find(' H2 ');         LetTitle=Titlenode? Titlenode.text().Trim() :' anonymous user ';         LetContent= node.Find('. Content span ').text().Trim();         LetLikenumber= node.Find(' i[class=number] ').text().Trim();        rst.Push({            title:Title,            content:Content,            Likenumber:Likenumber});    });    returnRst;}Asyncfunction Splider(Index= 1){     LetUrl= ' https://www.qiushibaike.com/8hr/page/${Index}/`;     LetHtmlstr=AwaitHttphelper.gethtml(URL);     LetRst= Getqbjok(HTMLSTR);    returnRst;}Splider(1);

In the access to embarrassing things Wikipedia page information, the first in the browser to the source code analysis, positioning to their own needs tags, and then extract the label text or attribute values, so that the completion of the page parsing.

Splider.jsThe file entrance is the splider method, first of all, according to the index of the index of the method, constructs the URL of the embarrassing encyclopedia, and then obtains the URL of the Web page source code, and finally gets the source code incoming getQBJok method, parsing, this article only parses each text joke author, content and likes number.

Run Splider.js the file directly to crawl the first page of joke information. You can then change splider the parameters of the method to implement information that crawls different pages.

On the basis of the code above, use koa and vue2.0 build a page to browse the text, the effect is as follows:

The source code has been uploaded to GitHub. : Https://github.com/StartAction/SpliderQB;

The project runs on node v7.6.0 above, first cloning the entire project from Github.

git clone https://github.com/StartAction/SpliderQB.git

After cloning, go to the project directory and run the command below.

node app.js
5. Summary

Through the realization of a complete crawler function, deepen their Node understanding, and the implementation of some of the language is es6 the use of the grammar, so that they accelerate the progress of the es6 grammar learning. In addition, in this implementation, encountered Node the knowledge of asynchronous control, this article is the use of the async keyword and is await also my favorite one, however, in the Node implementation of asynchronous control there are several ways. There is time to summarize the specific ways and principles.

Nodejs Crawler Primer

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.