One, crawler and robots protocol
Crawler, is an automatic access to Web content of the program. is an important part of the search engine, so the search engine optimization is to a large extent the optimization of the crawler.
Robots.txt is a text file, robots is a protocol, not a command. Robots.txt is the first file to be viewed by a crawler. The robots.txt file tells the crawler what files can be viewed on the server, and the search robot will follow the contents of the file to determine the scope of the access.
For example, we can access the robots.txt file directly from the website to view the files that the site prohibits access and allows access to.
Second, use Nodejs crawl to the Web page need to install module
Express
Express is a minimalist, flexible Web application Development framework based on the node. JS platform that provides a range of powerful features to help you create a wide variety of web and mobile device applications.
Chinese api:http://www.expressjs.com.cn/
Request
simplifies HTTP requests.
Api:https://www.npmjs.com/package/request
Cheerio
The crawled Web page is processed in a manner similar to JQ.
Api:https://www.npmjs.com/package/cheerio
After installing Nodejs, these three modules can be installed using the NPM command.
Three, simple crawl page example
var express = require (' Express '), var app = Express (), var request = require (' request '), var cheerio = require (' Cheerio '); app . Get ('/', function (req, res) { request (' http://blog.csdn.net/lhc1105 ', function (error, response, body) { if (! Error && Response.statuscode = = $) { $ = cheerio.load (body);//current $, it is the front-end selector that gets the entire body Console.log ($ ( '. User_name '). text ()); My blog's get username }else{ console.log ("Si smecta, did not crawl to the user name, one more Time");}} ); App.listen (3000);
After
Then access in the browser: http://localhost:3000/, you can see the output of the user name.
Feel more convenient than Python crawl, mainly on the page element parsing, save a lot of regular expressions.
By the the-by, Happy New Year ~ ~ ~
Nodejs Writing Small Reptiles