Sample Code for http crawlers and node crawlers Based on node
Every moment, no matter whether you are asleep or not, there will be a massive amount of data on the Internet, from the customer service to the server, to the server. The http get and request completed roles are data acquisition and submission. Next we will write a simple crawler to crawl the course interface of the node chapter in the cainiao tutorial.
Crawl all data on the homepage of the Node. js tutorial
Create a node-http.js, the Code is as follows, the Code has a detailed comment, self understand the HA
Var http = require ('http'); // obtain the http module var url = 'HTTP: // response (url, function (res) {var html = ''; // data events will be triggered here, and new html will be continuously triggered until res is completed. on ('data', function (data) {html + = data}) // when data is obtained, the end event is triggered. Here, the html res on the node official website is printed. on ('end', function () {console. log (html )})}). on ('error', function () {console. log ('error occurred when retrieving node official website data ')})
Terminal execution results show that all html on this page is crawled.
G: \ node-http> node node-http.js <! Doctype html>
Of course, crawling HTML is useless for us. Now we need to perform some filtering. For example, in this node tutorial, I want to know what course directories are available, so that I can choose to take a look at learning. Directly go to the code or:
However, before that, we need to download the cheerio module (cheerio is the node. js page capture module, which is specially customized for the server. It is a quick, flexible, and implemented jQuery core implementation. Applicable to various Web crawlers .) For details, you can search for it by yourself. cheerio is used in a similar way as jquery, so you don't have to worry about getting started.
PS G:\node\node-http> npm install cheerio
Create a node-http-more.js with the following code:
Var http = require ('http'); // obtain the http module var cheerio = require ('cheerio '); // introduce the cheerio module var url = 'HTTP: // www.runoob.com/nodejs/nodejs-tutorial.html'#//define node // filer node chapterfunction filerNodeChapter (html) {// load the crawler HTML to var $ = cheerio. load (html); // get each directory in the left sidebar var nodeChapter =ter ('# leftcolumn '); // here I hope that the final data format I can obtain will look like this, so that we can know the address and title of each directory/*** [{id:, title:}] */var chapterData = []; nodeChapter. each (function (item) {// obtain the address and title of each item var id = $ (this ). attr ('href '); var title = $ (this ). text (); chapterData. push ({id: id, title: title}) return chapterData;} // obtain each data function getChapterData (nodeChapter) {nodeChapter. forEach (function (item) {console. log ('[' + item. id + ']' + item. title + '\ n')});} http. get (url, function (res) {var html = ''; // The data event will be triggered here, and the new html will be continuously triggered until the res is completed. on ('data', function (data) {html + = data}) // when data is obtained, the end event is triggered. Here, the html res on the node official website is printed. on ('end', function () {// console. log (html) // filter out node. js course directory var nodeChapter = filerNodeChapter (html); // print the retrieved data in a loop (getChapterData (nodeChapter )})}). on ('error', function () {console. log ('error occurred when retrieving node official website data ')})
Terminal execution results and print out the course directory
G: \ node-http> node node-http-more.js [/nodejs/nodejs-tutorial.html] Node. js tutorial [/nodejs/nodejs-install-setup.html] Node. js installation configuration "/nodejs/nodejs-http-server.html" Node. js to create the first application of [nodejs-npm.html] NPM use Introduction [nodejs-repl.html] Node. js REPL [nodejs-callback.html] Node. js callback function [nodejs-event-loop.html] Node. javascript event loop [nodejs-event.html] Node. js EventEmitter [nodejs-buffer.html] Node. js Buffer [nodejs-stream.html] Node. js Stream [/nodejs/nodejs-module-system.html] Node. js module system ........... Not all are provided here. You can view all the results by running the operation on your own.
Now we have finished writing a simple crawler. Please try it on your own. I hope it will be helpful to everyone's learning, and I hope you can support the house of helping customers a lot.