Node+express to make a reptile tutorial _node.js

Source: Internet
Author: User

Recently began to learn node.js, have forgotten to learn. So prepare to learn again, so start with a simple reptile.

What is a reptile?

Baidu Encyclopedia of Interpretation:

Crawler is the web crawler, is an automatic access to the content of the Web page program. is an important part of the search engine, so the search engine optimization is to a large extent to the crawler to make the optimization.

In layman's terms:

Get the information from someone else's website down and get it on your computer. Then do some filtering, such as screening ah, sorting ah, extract pictures ah, links and so on. Get the information you need.

If the data volume is very large, and your algorithm and compare Diao, and can give others to search services, then your crawler is a small Baidu or small Google

What is a robots protocol

After knowing what a reptile is, let's learn more about the reptile's agreement, which is what has been crawled.

The full name of the robot Protocol (also known as the Reptile Protocol, the Robotics Protocol, etc.) is the "web Crawler Exclusion Standard" (exclusion Protocol), which is used by the website to tell the search engine which pages can crawl and which pages cannot crawl.

The robots.txt file is a text file, and it is a protocol, not a command. It is the first file to see when you visit a Web site in a search engine. The robots.txt file tells the spider what files can be viewed on the server.

When a search spider visits a site, it first checks whether there is a robots.txt in the root directory of the site, and if so, the search robot will determine the scope of the visit according to the contents of the file;

If the file does not exist, all search spiders will be able to access all pages on the site that are not password protected. Baidu official proposal, only if your website contains content that does not want to be included in search engine, only need to use robots.txt file. If you want search engines to include everything on your site, do not create robots.txt files.

If the site as a hotel room, Robots.txt is the owner of the door hanging "Do Not Disturb" or "Welcome to clean" the prompt card. This document tells visiting search engines which rooms can be entered and visited, which rooms are not open to search engines because they store valuables or may involve residents and visitors ' privacy. But robots.txt is not an order, nor a firewall, like a gatekeeper who cannot stop a rogue intruder such as a burglar.

Environment construction

Required Environment: node environment

What needs to be installed: Express, require, Cherrio

You can find the usage of the module here: https://www.npmjs.com, directly enter the module name, such as: Require

1, Express here will not do the introduction, the Chinese web site here, you can view: http://www.expressjs.com.cn/

2. The request module makes HTTP requests simpler. One of the simplest examples:

var express = require (' Express ');
var app = Express ();

App.get ('/', function (req, res) {
 res.send (' Hello World ');
};

App.listen (3000);

Installation:npm install request

3, Cherrio is specially customized for the server, fast, flexible, implementation of the jquery core implementation.

By Cherrio, we can use the content that we crawl into, like using jquery. Click here to view: https://cnodejs.org/topic/5203a71844e76d216a727d2e

var cheerio = require (' Cheerio '),
$ = cheerio.load (' 

Installation:npm install cherrio

Reptile Combat

Let's say you have node and express installed on your computer. So now we're going to start our Reptilian applet:

1, first casually into a hard disk, if it is f disk, CMD environment to execute:express mySpider

Then you find that you have a Myspider folder and some files on your f disk, go to file, cmd executenpm install

2, then install our, and then require ==》npm installrequire --save install ourcherrio==》npm install cherrio --save

3, after the installation, the implementation of NPM start, if you want to monitor the changes in the window, you can perform: Supervisor start App.js, and then in the browser input: localhost:3000, so that we can see in the browser Express some of the welcome words AH or what

4, open App.js file, you will find that there is a lot of things, because it is a reptile applet, so are not required to drop, delete, in the Express API has this code, pasted in the app.js inside

App.js

var express = require (' Express ');
var app = Express ();

App.get ('/', function (req, res) {
 res.send (' Hello World ');
};

App.listen (3000);

5, our require debut. Continue to modify the app.js to read:

var express = require (' Express ');
var app = Express ();
var request = require (' request ');

App.get ('/', function (req, res) {
 request (' http://www.cnblogs.com ', function (error, response, body) {
  if (!) Error && Response.statuscode = = {
    res.send (' Hello World ');}
  }
);
App.listen (3000);

The request of the link is that we want to crawl the Web site, to join us to crawl is the blog Park site, so the input is the blog Park URL

6, the introduction of Cherrio, so that we can do crawl to the content of the site, continue to modify the App.js

var express = require (' Express ');
var app = Express ();
var request = require (' request ');
var cheerio = require (' Cheerio ');

App.get ('/', function (req, res) {
 request (' http://www.cnblogs.com ', function (error, response, body) {
  if (!) Error && Response.statuscode = = {
   //return body for capture HTML content of Web page
   var $ = cheerio.load (body);//Current $ Character equivalent to get all the body inside the selector
   var navtext=$ ('. Post_nav_block '). html ();//Get the contents of the navigation bar
   res.send (navtext);
  }
 )
});
App.listen (3000);

The contents we caught were returned to the body of the request. Cherrio can get all the DOM selectors. If we want to get the content of navigation: The UL class is: Post_nav_block

Then we can show what's inside:

This shows that our crawler applet was successful. Of course, it's a simple reptile that can't be simpler. But today's article is temporarily introduced here, but probably understand the reptilian process just.

The next second article will be the crawler to upgrade, revision. such as asynchronous, concurrent, time to climb, and so on.

Code Address: Https://github.com/xianyulaodi/mySpider

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.