Nodejs crawler System

Source: Internet
Author: User

Nodejs crawler System
Express is the Server framework request, which is equivalent to the front-end ajax request cheerio, which is equivalent to jq. First, we need to create a new crawler directory to execute the npm install express-g command and the npm install express-generator-g command and then cd go to the crawler directory and execute npm install request -- save-dev and npm install cheerio -- save-dev. Then, create an express project in our directory and run the command line to execute expressOK. Our project directory is changed to this appearance: next, we will first install the dependencies of the project, execute npm install, and then we will do our preliminary work well. Then we open app. js and modify it. 1 var express = require ('express '); 2 var app = express (); 3 4 app. get ('/', function (req, res) {5 res. send ('Hello express '); 6}); 7 8 app. listen (3000, function () {9 console. log ('listening on 100'); 10}); the terminal executes the supervisor app. js (Note: supervisor is used by monitoring processes in nodejs. For example, if app. js is modified, the supervisor automatically restarts the file. You do not need to manually go to node app. js. you can install the supervisor-g through npm. This is also a commonly used tool in nodejs development.) OK. Open 127.0.0.1: 3000 and you will see hello express on the page. If everything is normal, let's take a look at the request. We went to the official website of the request in npm to see how to use it. We put it down: modify our app. js.

 1 var express = require('express'); 2 var app = express(); 3 var request = require('request'); 4  5 app.get('/', function(req, res) { 6   request('http://www.cnblogs.com/galenyip', function (error, response, body) { 7     if (!error && response.statusCode == 200) { 8       console.log(body);// Show the HTML for the Google homepage.  9       res.send('hello express');10     }11   });12 });13 14 app.listen(3000, function() {15   console.log('listening on 3000');16 });

 

Change the address to my blog address. Click here to crawl my blog and refresh our page. Wait for a while, you will see the terminal prints html-related information. Next, we use cheerio in app. js, And we enter var cherrio = require ('cherrio ');
1 var express = require ('express '); 2 var app = express (); 3 var request = require ('request '); 4 var cheerio = require ('cheerio '); 5 6 apps. get ('/', function (req, res) {7 request ('HTTP: // www.cnblogs.com/galenyip', function (error, response, body) {8 if (! Error & response. statusCode == 200) {9 $ = cheerio. load (body); // get the body as the selector 10} 11}); 12}); 13 14 app. listen (3000, function () {15 console. log ('listening on 100'); 16 });

 

At the same time, we can see that cheerio. load (body) is the page we get and uses it as the total selector. Then, we can operate on this page like jq. Its api is very similar to jq, so we will not introduce it. In fact, our entire crawler is almost the same. The rest is that the audience crawls the dom of the page, filters and so on according to their own needs ....

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.