Sample Code for http crawlers and node crawlers Based on node

Source: Internet
Author: User

Sample Code for http crawlers and node crawlers Based on node

Every moment, no matter whether you are asleep or not, there will be a massive amount of data on the Internet, from the customer service to the server, to the server. The http get and request completed roles are data acquisition and submission. Next we will write a simple crawler to crawl the course interface of the node chapter in the cainiao tutorial.

Crawl all data on the homepage of the Node. js tutorial

Create a node-http.js, the Code is as follows, the Code has a detailed comment, self understand the HA

Var http = require ('http'); // obtain the http module var url = 'HTTP: // response (url, function (res) {var html = ''; // data events will be triggered here, and new html will be continuously triggered until res is completed. on ('data', function (data) {html + = data}) // when data is obtained, the end event is triggered. Here, the html res on the node official website is printed. on ('end', function () {console. log (html )})}). on ('error', function () {console. log ('error occurred when retrieving node official website data ')})

Terminal execution results show that all html on this page is crawled.

G: \ node-http> node node-http.js <! Doctype html> 

Of course, crawling HTML is useless for us. Now we need to perform some filtering. For example, in this node tutorial, I want to know what course directories are available, so that I can choose to take a look at learning. Directly go to the code or:

However, before that, we need to download the cheerio module (cheerio is the node. js page capture module, which is specially customized for the server. It is a quick, flexible, and implemented jQuery core implementation. Applicable to various Web crawlers .) For details, you can search for it by yourself. cheerio is used in a similar way as jquery, so you don't have to worry about getting started.

PS G:\node\node-http> npm install cheerio

Create a node-http-more.js with the following code:

Var http = require ('http'); // obtain the http module var cheerio = require ('cheerio '); // introduce the cheerio module var url = 'HTTP: // www.runoob.com/nodejs/nodejs-tutorial.html'#//define node // filer node chapterfunction filerNodeChapter (html) {// load the crawler HTML to var $ = cheerio. load (html); // get each directory in the left sidebar var nodeChapter =ter ('# leftcolumn '); // here I hope that the final data format I can obtain will look like this, so that we can know the address and title of each directory/*** [{id:, title:}] */var chapterData = []; nodeChapter. each (function (item) {// obtain the address and title of each item var id = $ (this ). attr ('href '); var title = $ (this ). text (); chapterData. push ({id: id, title: title}) return chapterData;} // obtain each data function getChapterData (nodeChapter) {nodeChapter. forEach (function (item) {console. log ('[' + item. id + ']' + item. title + '\ n')});} http. get (url, function (res) {var html = ''; // The data event will be triggered here, and the new html will be continuously triggered until the res is completed. on ('data', function (data) {html + = data}) // when data is obtained, the end event is triggered. Here, the html res on the node official website is printed. on ('end', function () {// console. log (html) // filter out node. js course directory var nodeChapter = filerNodeChapter (html); // print the retrieved data in a loop (getChapterData (nodeChapter )})}). on ('error', function () {console. log ('error occurred when retrieving node official website data ')})

Terminal execution results and print out the course directory

G: \ node-http> node node-http-more.js [/nodejs/nodejs-tutorial.html] Node. js tutorial [/nodejs/nodejs-install-setup.html] Node. js installation configuration "/nodejs/nodejs-http-server.html" Node. js to create the first application of [nodejs-npm.html] NPM use Introduction [nodejs-repl.html] Node. js REPL [nodejs-callback.html] Node. js callback function [nodejs-event-loop.html] Node. javascript event loop [nodejs-event.html] Node. js EventEmitter [nodejs-buffer.html] Node. js Buffer [nodejs-stream.html] Node. js Stream [/nodejs/nodejs-module-system.html] Node. js module system ........... Not all are provided here. You can view all the results by running the operation on your own.

Now we have finished writing a simple crawler. Please try it on your own. I hope it will be helpful to everyone's learning, and I hope you can support the house of helping customers a lot.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.