Use Nodejs as a reptile.

Last Update:2016-05-13 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Welcome to visit my blog yangchen ' s blog

Introduction

The easiest thing to think of is python, because Python gives the feeling that it is capable of anything, but the previous process of using Python as a reptile is very uncomfortable, and the main problem comes from the following: The first is to crawl the page Dom operation, the second is the processing of the Code, The third is multi-threading, so Python is not really cool, there is no more cool way? Of course, that's node.js!.

The merits of Nodejs as a reptile first of all, the advantage of node as a reptile

The first one is that his driver language is JavaScript. JavaScript is a scripting language that runs on a browser before Nodejs is born, with the advantage of manipulating DOM elements on a Web page, which is unmatched by other languages in web operations.

The second is that Nodejs is single-threaded and asynchronous. It sounds strange, how can a single thread be asynchronous? Why is a single-core CPU capable of multitasking when you want to work on the operating system? The truth is similar, in the operating system in the process of CPU occupy time slicing, each process occupies a very short time, but all the process loop many times, so look like multiple tasks at the same time processing. JS is also the same, JS has an event pool, the CPU will be in the event Pool loop processing has responded to the event, the unhandled event will not be placed in the event pool, so do not block subsequent operations. The advantage on crawlers is that on a concurrent crawl page, a page that does not return does not block the subsequent page from continuing to load, so it does not need to be multi-threaded like python.

Next is the disadvantage of node

The first is asynchronous concurrency. Good treatment is very convenient, bad handling will be very troublesome. For example, to crawl 10 pages, and node does not do asynchronous processing, the results returned may not necessarily be 1, 2, 3, 4 ... This order is likely to be random. The solution is to add a page sequence stamp, let the crawled data generate a CSV file, and then reorder.

The second is the disadvantage of data processing, this is not as good as python, if it is only a simple crawl data, with node of course, but if the data to continue to do statistical analysis, to do a regression analysis of the cluster, then you can not use node one step to the end.

How to make a reptile with Nodejs

Let's talk about how to use Nodejs as a reptile.

1. Initialize project files

Execute under the corresponding project folder npm init to initialize a Package.json file

2. Install request and Cheerio dependent packages

The request sounds familiar, just like the request function in Python. Its function is to set up a link to the target page, and return the corresponding data, this is not difficult to understand.

Cheerio's function is to manipulate the DOM elements, he can convert the data returned from the request to the DOM operation of the data, the more important Cheerio API, like jquery, with $ to select the corresponding DOM node, is not very convenient? For a front-end programmer, this is more convenient than Python's XPath and Beautisoup, and I don't know how much. haha

The installation command is also very simple, npm install request --save andnpm install cheerio

3. Introduce a dependency package and use

Next, write a crawler with request and Cherrio!

First introduce the dependency

varrequire("request");varrequire("cheerio");

Let's take a crawl of our school's news page for example, our school News page link is http://news.shu.edu.cn/Default.aspx?tabid=446

Then call the request's interface

request(‘http://news.shu.edu.cn/Default.aspx?tabid=446‘,function(err,result){    if(err){        console.log(err);    }    console.log(result.body);})

Run it, that's the result.

is not very excited haha, the HTML returned back. This is still not enough, the next is to deal with the returned data, and to extract the information we want to obtain, this is the turn to Cheerio debut

Pass the results returned by the request to Cheerio and get the information you want to get, see if the code wants to feel like writing a script?

request(‘http://news.shu.edu.cn/Default.aspx?tabid=446‘,function(err,result){    if(err){        console.log(err);    }    var $ = cheerio.load(result.body);   $(‘a[id^="dnn"]‘).each(function(index,element){       console.log($(element).text());   })})

The following results are run:

Such a simple crawler is finished, is not very simple ah, of course, this is far from enough.

4. Set the request header

As we all know, the HTTP protocol, the establishment of a connection to send the request header headers, for some dynamic Web page crawl sometimes need to set up the user agent, cookies and so on, then how to use these settings?
The specific case code is as follows:

var options = {    url: startUrl+‘?page=1‘,    ‘GET‘,    "utf-8",    headers: {        "User-Agent""Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.93 Safari/537.36",        "cookie": cookies    }};request(options,function(err,response,body){//...})

5 Concurrency control

Crawl a page is OK, if the page is more than unrestricted concurrency, that must be sealed, so there is a concurrency control, here to introduce is async. The same as the above to be npm install async --save installed and passed to var async = require("async") introduce.

Specifically, in a way that restricts concurrency.

async.mapLimit(5,function(url,callback)){//...fetch(url,callback)})

This 5 is the limit of the number of concurrent, can be free to play, and finally do not forget to execute after the callback, because if not the words will be blocked, async does not know that his limited function is completed, and therefore will not be released.

Summarize

At this point, the core of the Nodejs crawler has been introduced, the rest is completely free to play, and finally attach a self-made simple Sina Weibo crawler bar Https://github.com/Fazich/nodeSpider

Use Nodejs as a reptile.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Use Nodejs as a reptile.

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Use Nodejs as a reptile.

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support