Node + express crawler tutorial, node Crawler

Source: Internet
Author: User

Node + express crawler tutorial, node Crawler

I recently started to learn node. js again, and I forgot everything I learned before. So I want to learn it again, so let's start with a simple crawler.

What is crawler?

Baidu encyclopedia's explanation:

Web Crawler is a program that automatically obtains webpage content. It is an important part of a search engine. Therefore, a search engine optimization is largely a crawler optimization.

To put it simply:

Get the information from other websites and get it to your computer. Then perform some filtering, such as filtering, sorting, image extraction, and link. Obtain the information you need.

If the data size is large and your algorithms are too embarrassing and can be searched by others, your crawler will be a little Baidu or Google.

What is the robots protocol?

After learning about what crawlers are, let's take a look at the crawler protocol, that is, what crawlers have been crawled.

The full name of the Robots Protocol (also known as the crawler Protocol and robot Protocol) is "Robots Exclusion Protocol". The website uses the Robots Protocol to tell the search engine which pages can be crawled, which pages cannot be crawled.

The robots.txt file is a text file, which is a protocol rather than a command. It is the first file to be viewed when a search engine accesses a website. The robots.txt file tells the Spider Program what files can be viewed on the server.

When a search spider finds a site, it first checks that the site root directory contains robots.txt. If so, the search robot determines the access range based on the content in the file;

If the file does not exist, all search spider will be able to access all websites without password protection. Baidu official suggestion: only when your website contains content that is not expected to be searched for by the engine, can you use the robots.txt file. Do not create the robots.txt file if you want to search for all content on the web site.

If you view the website as a room in the room, robots.txt is the "Do Not Disturb" or "Welcome to clean" sign that the host hangs at the door of the room. This document tells the visiting search engine which rooms can be accessed and visited, and which rooms are not open to the search engine because they store valuables or may involve the privacy of residents and visitors. But robots.txt is neither a command nor a firewall, just as a hacker cannot stop thieves and other malicious intruders.

Environment Construction

Required environment: node Environment

Things to install: express, require, cherrio

Here we can find the use of the module: https://www.npmjs.com, directly enter the module name, such as: require

1, express here will not be introduced, Chinese Web site here, you can view: http://www.expressjs.com.cn/

2. The request module simplifies http requests. The simplest example is as follows:

var express = require('express');var app = express();app.get('/', function(req, res){ res.send('hello world');});app.listen(3000);

Installation:npm install request

3. cherrio is a quick, flexible, and implemented jQuery core implementation customized for servers.

Through cherrio, we can use the captured content like jquery. Click here to view: https://cnodejs.org/topic/5203a71844e76d216a727d2e

var cheerio = require('cheerio'),$ = cheerio.load('

Installation:npm install cherrio

Crawler practice

Assume that node and express have been installed on your computer. Now let's start our crawler Applet:

1. First, enter a hard disk. If it is a drive F, run the following command in the cmd environment:express mySpider

Then, you find that there is a mySpider folder and some files on your drive, and enter the file and run the command in cmd.npm install

2. then install ourrequire ==》npm installrequire --saveAnd then install ourcherrio==》npm install cherrio --save

3. After installation, run npm start. If you want to monitor window changes, run the supervisor start app. in the browser, enter: localhost: 3000. In this way, we can see some welcome notes about express in the browser.

4. Open the app. js files, you will find that there are a lot of things in it, because it is a crawler applet, so there is no need to drop, delete, there is this code in the express API, paste in the app. js

App. js

var express = require('express');var app = express();app.get('/', function(req, res){ res.send('hello world');});app.listen(3000);

5. Our require was launched. Change app. js:

var express = require('express');var app = express();var request = require('request');app.get('/', function(req, res){ request('http://www.cnblogs.com', function (error, response, body) {  if (!error && response.statusCode == 200) {    res.send('hello world');  } })});app.listen(3000);

Here, the request link is the website we want to crawl, and the website we want to crawl is the blog site, so we enter the blog site

6. Introduce cherrio so that we can stick to the content of the crawled website and continue to modify app. js.

Var express = require ('express '); var app = express (); var request = require ('request'); var cheerio = require ('cheerio'); app. get ('/', function (req, res) {request ('HTTP: // www.cnblogs.com ', function (error, response, body) {if (! Error & response. statusCode = 200) {// The returned body is the html content of the captured webpage var $ = cheerio. load (body); // the current $ character is equivalent to obtaining the selector var navtext000000000000('.post_nav_block'0000.html () in all bodies; // get the content res in the navigation bar. send (navText) ;}})}); app. listen (0, 3000 );

All the contents we caught are returned to the request body. Cherrio can get all dom selectors. Assume that we want to obtain the navigation content: ul's class is post_nav_block.

Then we can display the content:

This indicates that our crawler applet is successful. Of course, this is a simple web crawler. However, today's article will be introduced here for the moment, just to get to know the crawler process.

The second article will update the crawler. For example, asynchronous, concurrent, and timed crawling.

Address: https://github.com/xianyulaodi/mySpider

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.