Use Nodejs as a reptile.

Source: Internet
Author: User

Welcome to visit my blog yangchen ' s blog

Introduction

The easiest thing to think of is python, because Python gives the feeling that it is capable of anything, but the previous process of using Python as a reptile is very uncomfortable, and the main problem comes from the following: The first is to crawl the page Dom operation, the second is the processing of the Code, The third is multi-threading, so Python is not really cool, there is no more cool way? Of course, that's node.js!.

The merits of Nodejs as a reptile first of all, the advantage of node as a reptile

The first one is that his driver language is JavaScript. JavaScript is a scripting language that runs on a browser before Nodejs is born, with the advantage of manipulating DOM elements on a Web page, which is unmatched by other languages in web operations.

The second is that Nodejs is single-threaded and asynchronous. It sounds strange, how can a single thread be asynchronous? Why is a single-core CPU capable of multitasking when you want to work on the operating system? The truth is similar, in the operating system in the process of CPU occupy time slicing, each process occupies a very short time, but all the process loop many times, so look like multiple tasks at the same time processing. JS is also the same, JS has an event pool, the CPU will be in the event Pool loop processing has responded to the event, the unhandled event will not be placed in the event pool, so do not block subsequent operations. The advantage on crawlers is that on a concurrent crawl page, a page that does not return does not block the subsequent page from continuing to load, so it does not need to be multi-threaded like python.

Next is the disadvantage of node

The first is asynchronous concurrency. Good treatment is very convenient, bad handling will be very troublesome. For example, to crawl 10 pages, and node does not do asynchronous processing, the results returned may not necessarily be 1, 2, 3, 4 ... This order is likely to be random. The solution is to add a page sequence stamp, let the crawled data generate a CSV file, and then reorder.

The second is the disadvantage of data processing, this is not as good as python, if it is only a simple crawl data, with node of course, but if the data to continue to do statistical analysis, to do a regression analysis of the cluster, then you can not use node one step to the end.

How to make a reptile with Nodejs

Let's talk about how to use Nodejs as a reptile.

1. Initialize project files

Execute under the corresponding project folder npm init to initialize a Package.json file

2. Install request and Cheerio dependent packages

The request sounds familiar, just like the request function in Python. Its function is to set up a link to the target page, and return the corresponding data, this is not difficult to understand.

Cheerio's function is to manipulate the DOM elements, he can convert the data returned from the request to the DOM operation of the data, the more important Cheerio API, like jquery, with $ to select the corresponding DOM node, is not very convenient? For a front-end programmer, this is more convenient than Python's XPath and Beautisoup, and I don't know how much. haha

The installation command is also very simple, npm install request --save andnpm install cheerio

3. Introduce a dependency package and use

Next, write a crawler with request and Cherrio!

First introduce the dependency

varrequire("request");varrequire("cheerio");

Let's take a crawl of our school's news page for example, our school News page link is http://news.shu.edu.cn/Default.aspx?tabid=446

Then call the request's interface

request(‘http://news.shu.edu.cn/Default.aspx?tabid=446‘,function(err,result){    if(err){        console.log(err);    }    console.log(result.body);})

Run it, that's the result.

is not very excited haha, the HTML returned back. This is still not enough, the next is to deal with the returned data, and to extract the information we want to obtain, this is the turn to Cheerio debut

Pass the results returned by the request to Cheerio and get the information you want to get, see if the code wants to feel like writing a script?

request(‘http://news.shu.edu.cn/Default.aspx?tabid=446‘,function(err,result){    if(err){        console.log(err);    }    var $ = cheerio.load(result.body);   $(‘a[id^="dnn"]‘).each(function(index,element){       console.log($(element).text());   })})

The following results are run:

Such a simple crawler is finished, is not very simple ah, of course, this is far from enough.

4. Set the request header

As we all know, the HTTP protocol, the establishment of a connection to send the request header headers, for some dynamic Web page crawl sometimes need to set up the user agent, cookies and so on, then how to use these settings?
The specific case code is as follows:

var options = {    url: startUrl+‘?page=1‘,    ‘GET‘,    "utf-8",    headers: {        "User-Agent""Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.93 Safari/537.36",        "cookie": cookies    }};request(options,function(err,response,body){//...})
5 Concurrency control

Crawl a page is OK, if the page is more than unrestricted concurrency, that must be sealed, so there is a concurrency control, here to introduce is async. The same as the above to be npm install async --save installed and passed to var async = require("async") introduce.

Specifically, in a way that restricts concurrency.

async.mapLimit(5,function(url,callback)){//...fetch(url,callback)})

This 5 is the limit of the number of concurrent, can be free to play, and finally do not forget to execute after the callback, because if not the words will be blocked, async does not know that his limited function is completed, and therefore will not be released.

Summarize

At this point, the core of the Nodejs crawler has been introduced, the rest is completely free to play, and finally attach a self-made simple Sina Weibo crawler bar Https://github.com/Fazich/nodeSpider

Use Nodejs as a reptile.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.