Nodejs Writing Small Reptiles

Source: Internet
Author: User

One, crawler and robots protocol

Crawler, is an automatic access to Web content of the program. is an important part of the search engine, so the search engine optimization is to a large extent the optimization of the crawler.


Robots.txt is a text file, robots is a protocol, not a command. Robots.txt is the first file to be viewed by a crawler. The robots.txt file tells the crawler what files can be viewed on the server, and the search robot will follow the contents of the file to determine the scope of the access.






For example, we can access the robots.txt file directly from the website to view the files that the site prohibits access and allows access to.

Second, use Nodejs crawl to the Web page need to install module

Express
Express is a minimalist, flexible Web application Development framework based on the node. JS platform that provides a range of powerful features to help you create a wide variety of web and mobile device applications.
Chinese api:http://www.expressjs.com.cn/

Request
simplifies HTTP requests.
Api:https://www.npmjs.com/package/request


Cheerio
The crawled Web page is processed in a manner similar to JQ.
Api:https://www.npmjs.com/package/cheerio

After installing Nodejs, these three modules can be installed using the NPM command.

Three, simple crawl page example
var express = require (' Express '), var app = Express (), var request = require (' request '), var cheerio = require (' Cheerio '); app . Get ('/', function (req, res) {    request (' http://blog.csdn.net/lhc1105 ', function (error, response, body) {      if (! Error && Response.statuscode = = $) {        $ = cheerio.load (body);//current $, it is the front-end selector that gets the entire body      Console.log ($ ( '. User_name '). text ()); My blog's get username      }else{         console.log ("Si smecta, did not crawl to the user name, one more Time");}}      ); App.listen (3000);


After



Then access in the browser: http://localhost:3000/, you can see the output of the user name.

Feel more convenient than Python crawl, mainly on the page element parsing, save a lot of regular expressions.

By the the-by, Happy New Year ~ ~ ~

Nodejs Writing Small Reptiles

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.