Sample Code for http crawlers and node crawlers Based on node

Last Update:2018-01-18 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Every moment, no matter whether you are asleep or not, there will be a massive amount of data on the Internet, from the customer service to the server, to the server. The http get and request completed roles are data acquisition and submission. Next we will write a simple crawler to crawl the course interface of the node chapter in the cainiao tutorial.

Crawl all data on the homepage of the Node. js tutorial

Create a node-http.js, the Code is as follows, the Code has a detailed comment, self understand the HA

Var http = require ('http'); // obtain the http module var url = 'HTTP: // response (url, function (res) {var html = ''; // data events will be triggered here, and new html will be continuously triggered until res is completed. on ('data', function (data) {html + = data}) // when data is obtained, the end event is triggered. Here, the html res on the node official website is printed. on ('end', function () {console. log (html )})}). on ('error', function () {console. log ('error occurred when retrieving node official website data ')})

Terminal execution results show that all html on this page is crawled.

G: \ node-http> node node-http.js <! Doctype html> 
Of course, crawling HTML is useless for us. Now we need to perform some filtering. For example, in this node tutorial, I want to know what course directories are available, so that I can choose to take a look at learning. Directly go to the code or:
However, before that, we need to download the cheerio module (cheerio is the node. js page capture module, which is specially customized for the server. It is a quick, flexible, and implemented jQuery core implementation. Applicable to various Web crawlers .) For details, you can search for it by yourself. cheerio is used in a similar way as jquery, so you don't have to worry about getting started.
PS G:\node\node-http> npm install cheerio
Create a node-http-more.js with the following code:
Var http = require ('http'); // obtain the http module var cheerio = require ('cheerio '); // introduce the cheerio module var url = 'HTTP: // www.runoob.com/nodejs/nodejs-tutorial.html'#//define node // filer node chapterfunction filerNodeChapter (html) {// load the crawler HTML to var $ = cheerio. load (html); // get each directory in the left sidebar var nodeChapter =ter ('# leftcolumn '); // here I hope that the final data format I can obtain will look like this, so that we can know the address and title of each directory/*** [{id:, title:}] */var chapterData = []; nodeChapter. each (function (item) {// obtain the address and title of each item var id = $ (this ). attr ('href '); var title = $ (this ). text (); chapterData. push ({id: id, title: title}) return chapterData;} // obtain each data function getChapterData (nodeChapter) {nodeChapter. forEach (function (item) {console. log ('[' + item. id + ']' + item. title + '\ n')});} http. get (url, function (res) {var html = ''; // The data event will be triggered here, and the new html will be continuously triggered until the res is completed. on ('data', function (data) {html + = data}) // when data is obtained, the end event is triggered. Here, the html res on the node official website is printed. on ('end', function () {// console. log (html) // filter out node. js course directory var nodeChapter = filerNodeChapter (html); // print the retrieved data in a loop (getChapterData (nodeChapter )})}). on ('error', function () {console. log ('error occurred when retrieving node official website data ')})
Terminal execution results and print out the course directory
G: \ node-http> node node-http-more.js [/nodejs/nodejs-tutorial.html] Node. js tutorial [/nodejs/nodejs-install-setup.html] Node. js installation configuration "/nodejs/nodejs-http-server.html" Node. js to create the first application of [nodejs-npm.html] NPM use Introduction [nodejs-repl.html] Node. js REPL [nodejs-callback.html] Node. js callback function [nodejs-event-loop.html] Node. javascript event loop [nodejs-event.html] Node. js EventEmitter [nodejs-buffer.html] Node. js Buffer [nodejs-stream.html] Node. js Stream [/nodejs/nodejs-module-system.html] Node. js module system ........... Not all are provided here. You can view all the results by running the operation on your own.
Now we have finished writing a simple crawler. Please try it on your own. I hope it will be helpful to everyone's learning, and I hope you can support the house of helping customers a lot.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Sample Code for http crawlers and node crawlers Based on node

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Sample Code for http crawlers and node crawlers Based on node

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support