Recently began to learn node.js, have forgotten to learn. So prepare to learn again, so start with a simple reptile.
What is a reptile?
Baidu Encyclopedia of Interpretation:
Crawler is the web crawler, is an automatic access to the content of the Web page program. is an important part of the search engine, so the search engine optimization is to a large extent to the crawler to make the optimization.
In layman's terms:
Get the information from someone else's website down and get it on your computer. Then do some filtering, such as screening ah, sorting ah, extract pictures ah, links and so on. Get the information you need.
If the data volume is very large, and your algorithm and compare Diao, and can give others to search services, then your crawler is a small Baidu or small Google
What is a robots protocol
After knowing what a reptile is, let's learn more about the reptile's agreement, which is what has been crawled.
The full name of the robot Protocol (also known as the Reptile Protocol, the Robotics Protocol, etc.) is the "web Crawler Exclusion Standard" (exclusion Protocol), which is used by the website to tell the search engine which pages can crawl and which pages cannot crawl.
The robots.txt file is a text file, and it is a protocol, not a command. It is the first file to see when you visit a Web site in a search engine. The robots.txt file tells the spider what files can be viewed on the server.
When a search spider visits a site, it first checks whether there is a robots.txt in the root directory of the site, and if so, the search robot will determine the scope of the visit according to the contents of the file;
If the file does not exist, all search spiders will be able to access all pages on the site that are not password protected. Baidu official proposal, only if your website contains content that does not want to be included in search engine, only need to use robots.txt file. If you want search engines to include everything on your site, do not create robots.txt files.
If the site as a hotel room, Robots.txt is the owner of the door hanging "Do Not Disturb" or "Welcome to clean" the prompt card. This document tells visiting search engines which rooms can be entered and visited, which rooms are not open to search engines because they store valuables or may involve residents and visitors ' privacy. But robots.txt is not an order, nor a firewall, like a gatekeeper who cannot stop a rogue intruder such as a burglar.
Environment construction
Required Environment: node environment
What needs to be installed: Express, require, Cherrio
You can find the usage of the module here: https://www.npmjs.com, directly enter the module name, such as: Require
1, Express here will not do the introduction, the Chinese web site here, you can view: http://www.expressjs.com.cn/
2. The request module makes HTTP requests simpler. One of the simplest examples:
var express = require (' Express ');
var app = Express ();
App.get ('/', function (req, res) {
res.send (' Hello World ');
};
App.listen (3000);
Installation:npm install request
3, Cherrio is specially customized for the server, fast, flexible, implementation of the jquery core implementation.
By Cherrio, we can use the content that we crawl into, like using jquery. Click here to view: https://cnodejs.org/topic/5203a71844e76d216a727d2e
var cheerio = require (' Cheerio '),
$ = cheerio.load ('
Installation:npm install cherrio
Reptile Combat
Let's say you have node and express installed on your computer. So now we're going to start our Reptilian applet:
1, first casually into a hard disk, if it is f disk, CMD environment to execute:express mySpider
Then you find that you have a Myspider folder and some files on your f disk, go to file, cmd executenpm install
2, then install our, and then require ==》npm installrequire --save
install ourcherrio==》npm install cherrio --save
3, after the installation, the implementation of NPM start, if you want to monitor the changes in the window, you can perform: Supervisor start App.js, and then in the browser input: localhost:3000, so that we can see in the browser Express some of the welcome words AH or what
4, open App.js file, you will find that there is a lot of things, because it is a reptile applet, so are not required to drop, delete, in the Express API has this code, pasted in the app.js inside
App.js
var express = require (' Express ');
var app = Express ();
App.get ('/', function (req, res) {
res.send (' Hello World ');
};
App.listen (3000);
5, our require debut. Continue to modify the app.js to read:
var express = require (' Express ');
var app = Express ();
var request = require (' request ');
App.get ('/', function (req, res) {
request (' http://www.cnblogs.com ', function (error, response, body) {
if (!) Error && Response.statuscode = = {
res.send (' Hello World ');}
}
);
App.listen (3000);
The request of the link is that we want to crawl the Web site, to join us to crawl is the blog Park site, so the input is the blog Park URL
6, the introduction of Cherrio, so that we can do crawl to the content of the site, continue to modify the App.js
var express = require (' Express ');
var app = Express ();
var request = require (' request ');
var cheerio = require (' Cheerio ');
App.get ('/', function (req, res) {
request (' http://www.cnblogs.com ', function (error, response, body) {
if (!) Error && Response.statuscode = = {
//return body for capture HTML content of Web page
var $ = cheerio.load (body);//Current $ Character equivalent to get all the body inside the selector
var navtext=$ ('. Post_nav_block '). html ();//Get the contents of the navigation bar
res.send (navtext);
}
)
});
App.listen (3000);
The contents we caught were returned to the body of the request. Cherrio can get all the DOM selectors. If we want to get the content of navigation: The UL class is: Post_nav_block
Then we can show what's inside:
This shows that our crawler applet was successful. Of course, it's a simple reptile that can't be simpler. But today's article is temporarily introduced here, but probably understand the reptilian process just.
The next second article will be the crawler to upgrade, revision. such as asynchronous, concurrent, time to climb, and so on.
Code Address: Https://github.com/xianyulaodi/mySpider