International - English

Cart Console

Topic Center

Contact Sales

Home > Others

Development and design of distributed crawler based on Scrapy

Last Update:2018-07-24 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This project is also a first glimpse into the Python crawler project, is also my graduation design, at that time, found that most people choose is the site class, it is common but, are some simple additions and deletions, business class to feel a very common system design, at that time also just in the know to see an answer , how do you use computer technology to solve the practical problems of life, links are not put, interested can search, and then use this topic.

Abstract: Based on the Python distributed data fetching system for the further application of the data is the availability of the recommendation system to do data support. This project is devoted to solving the bottleneck of single process stand-alone crawler and creating a topic crawler based on Redis distributed multi-crawler Shared Queue. The system is developed using the Scrapy framework developed by Python, using Xpath technology to extract and parse the downloaded Web pages, using the Redis database to do the distribution, using the MongoDB database to do the data storage, using the Django web framework and the Semantic UI open source box Frame to the data for friendly visualization, and finally use the Docker to deploy the crawler. The Distributed crawler system is designed and implemented for the rental platform of 58 city cities. I. System function Architecture

system function Architecture diagram

The distributed crawler crawling system mainly includes the following functions:

1. Reptile function:

Design of crawl Strategy

Design of content data fields

Incremental crawl

Request to go heavy

2. Middleware:

Crawler anti-shielding middleware

Web page Non 200 state processing

Crawler Download Exception Handling

3. Data storage:

Grab field Design

Data storage

4. Data visualization Two, System distributed architecture

Distributed using master-slave structure to set up a master server and multiple slave servers, master-side management Redis database and distribution download tasks, slave deployment scrapy crawler extract Web pages and parse extracted data, Finally, the parsed data is stored in the same MongoDB database. The Distributed crawler architecture is shown in the figure.

Distributed Crawler architecture diagram

Application of Redis database to achieve distributed crawl, the basic idea is scrapy crawler to get detail_request URLs are placed in the Redis queue, all reptiles from the designated Redis queue to get requests, Scrapy-redis components in the default use of Spiderpriorityqueue to determine the order of the URL, which is implemented by the sorted set of a non-FIFO, LIFO method. Therefore, the sharing of the queue to be crawled is a key point for the crawler to be deployed on other servers to complete the same crawl task. In addition, in this paper, in order to solve the problem of scrapy single machine limitations, Scrapy will be developed in conjunction with Scrapy-redis components, Scrapy-redis the overall idea is that this project through the rewrite Scrapu Framework scheduler and Spider class, Implementation of the scheduling, spider startup and Redis interaction. The implementation of new Dupefilter and queue classes, to achieve the weight and dispatch of the container and redis interaction, because each host crawler process access to the same Redis database, so the dispatch and weight are unified unified management, to achieve the goal of distributed crawler.

Third, the system realization

1) The design of crawl strategy

According to the structure analysis of Scrapy, the network crawler starts from the initial address, obtains more Web links based on the regular expression or XPath defined in the spider, and joins in the queue to be downloaded, after weight and sort, waiting for the scheduler to dispatch.

In this system, the new links can be divided into two categories, one is the directory page links, that is, we usually see the next page of the link, a class of content page links, that is, we need to parse the page to extract the link to the field, pointing to the actual listing information page. The network needs to be extracted from each directory page link to multiple content page links, to be added to the queue to be downloaded for further crawling. The crawl process is as follows:

Here is the master end of the target link crawl strategy, as the distributed master-slave model is taken, the master-end crawler mainly crawls down the link to download the Content Details page, through the Redis share downloading task to the other slave-end crawler. The slave end is mainly responsible for further parsing of links to the details page to be stored in the database.

This paper to 58 city rental as an example, its initial page links, in fact, is the first page of each category link, mainly (in Guangdong province, for example, several cities):

① Dongguan Rental: (http://dg.58.com/chuzu/),

② shenzhen Rental: (http://sz.58.com/chuzu/),

③ Shanwei Rental: (http://sw.58.com/chuzu/),

④ guangzhou Rental: (http://gz.58.com/chuzu/),

Its directory page links are described as follows:

⑤ second page (HTTP://DG.58.COM/CHUZU/PN2)

⑥ third page (HTTP://DG.58.COM/CHUZU/PN3)

⑦ Fourth page (HTTP://DG.58.COM/CHUZU/PN4)

Its content Details page is as follows:

⑧ (http://taishan.58.com/zufang/29537279701166x.shtml)

To sum up, the network availability crawl system uses the following crawl strategy:

1 for the Master end:

The core module is to resolve page-flipping issues and get links to each page content detail pages. The master side mainly takes the following crawl strategy:

1. Insert initial link to Redis to key for Nest_link, start from initial page link

2. The crawler takes the initial link from the Redis key for the Next_link, starts to run the reptile

3. The response to return the download, reptiles according to Spider defined crawl rules to identify whether there is a next page link, if there is a link, stored into the Redis, save key for Next_link, at the same time according to matching rules are matched to multiple content page links, if matched to, Then store into Redis, save key for detail_request insert download link, to the slave end of the spider use, that is, the slave end of the download task.

4. The crawler continues to take the value from the Redis key for Next_link, if has the value, continues step 2, if is empty, the reptile waits for the new link.

2 for slave end:

The core module is to obtain the download task from Redis and parse the extraction field. The slave end mainly takes the following crawl strategy:

1. The crawler takes the initial link from the Redis key for Detail_request and starts to run the crawler

2. The response that the download returns, the crawler identifies the content field of a matching rule based on the crawl rules defined by spider, and returns the field to the model, waiting for the data store to operate.

Repeat step 1 until the crawl queue is empty, and the crawler waits for a new link.

2) The concrete realization of the Crawler

The reptile program contains four parts, namely the object definition program, the Data Crawler, the processing program and the download Setup program, the composition here is the slave end, master less the object definition program and the data processing program, the master is mainly the crawling of the download link.

(1) Data grabbing program

The data crawl program is divided into the master and the slave, the Data Crawler obtains the initial address from the Redis, the Data Crawler defines the rules of crawling the Web page and the method of extracting the field data using XPath, and the method of extracting the character data by XPath is emphatically introduced. Xapth uses a path expression to select a node or set of nodes in a Web page document. There are several types of elements in XPath: element, attribute, text, namespace, processing instruction, comment, and document node. Web document is treated as a node tree, the tree is referred to as the document node and root node, the XPath expression to locate the target node can extract the Web page document field data, the following in Master crawl content page link and slave extract field data for example.

A. Master-side examples

XPath extracts the next page link method:

Response_selector.xpath (U '//div[contains (@class, "pager")]/a[contains (@class, "next")]/@href '). Extract ()

XPath extracts the Content Detail Page link method:

Response_selector.xpath (U '//div[contains (@class, "ListBox")]/ul[contains (@class, "Listul")]/li/@logr '). Extract () ，

Because the site to the content Detail page to do the crawl, details page Click, get the ID to jump to a domain name, therefore, the construction of their own details page ID, implemented as follows:

response_url[0]+ '/zufang/' + detail_link.split ('_') [3] + ' x.shtml '

A. Slave-end Example:

How the XPath extracts the content page:

Post name:

Response_selector.xpath (U '//div[contains (@class, "House-title")]/h1[contains (@class, "c_333 F20")]/text () '). Extract ()

Post release time:

Response_selector.xpath (U '//div[contains (@class, "House-title")]/p[contains (@class, "House-update-info c_888 F12") ]/text () "). Extract ()

Because some data is not with XPath can be extracted, but also need to be matched, if there is an exception to be processed, the general page matching not to the corresponding fields, should be set to 0, to the item after processing, the Itme filter processing. 3) Go-weight and incremental crawl

The heavy and incremental crawl is of great significance to the server, reducing the pressure on the server and ensuring the accuracy of the data. If you do not take the heavy processing, then grab content will crawl a lot of duplicate content, so that the efficiency of the crawler greatly reduced. In fact, the process is very simple, the core is each request, the first to determine whether the request in the queue has been crawled. If it already exists, the current request is discarded.

Specific steps to achieve:

(1) Get the URL from the queue to be crawled

(2) will be requested to determine whether the URL has been crawled, if crawled, the request is ignored, not crawled, continue other operations and insert the URL into the crawled queue

(3) Repeat step 1 here we use Scrapy-redis to heavy components, so there is no implementation, but the principle is still to understand, specific can see the source.

4) Crawler middleware

The crawler middleware can help us to expand our program freely in the Scrapy crawl process, the following are crawler shielding middleware, download exception state middleware and non 200 state middleware. (1) The realization of the crawler anti-shielding component

Visit a Web site's web page, will give the site a certain load, and the crawler is the simulation of our normal access to the Web page process, but. The massive crawler will add a lot of load to the website, affect the normal user's access. To ensure that Web pages are accessible to most normal users, most sites have a corresponding crawler-proof strategy. Once the visit behavior is identified as a reptile, the site will take certain measures to limit your access, such as prompting you, access too often let you enter the verification code, more serious, will block your IP, prohibit you to visit the site. The system is directed to crawl Web data, will be uninterrupted access to the site content, if not to take camouflage measures, it is easy to be identified by the site as a reptile behavior and shielding.

The system uses the following methods to prevent the crawler from being shielded:

1. Simulate different browser behavior

2. Replace the proxy server and gateway at a certain frequency

3. In line with the gentleman agreement, reduces the crawler crawls the webpage the frequency, reduces the concurrent crawling process, limits each IP to crawl the number of times, sacrifices certain efficiency to exchange the system stability.

4. Disable cookies, the site will be in the user access to the cookie in the cookies inserted some information to determine whether it is a robot, we block the cookie, but also conducive to our identity does not agree to exposure.

5. Manual code, this should be airtight anti-proscribed measures, all systems are also compared to manual operation, but reduced automation, efficiency is not high, but really the most effective measures. When the crawler is banned, it redirects to a Captcha page, enter the code to access the page again, so I added a message to remind the module, when the crawler was banned, send mail to remind the administrator to unlock, while the redirection of the request to be crawled to the download queue to ensure the integrity of the data.

Crawler Anti-web site shielding principle as shown in the following figure:

(a) simulation of different browser behavior implementation ideas and code

Principle: From the introduction of scrapy we can know that Scrapy has a download middleware, in this middleware we can customize the request and response, similar to spring-oriented programming, like a hook embedded in the program before and after the operation. The core is to modify the attributes of the request

The first is to extend the download middleware, first of all, to add middleware on seetings.py,

Second, the extension middleware, mainly writes a useragent list, saves the common browser request header as a list, as follows:

Then let the requested header file randomly take an agent value in the list and download it to the downloader.

In summary, each time a request is made, the simulation uses a different browser to access the target site.

(b) The use of proxy IP to crawl the implementation of ideas and code.

First of all, add middleware to seetings.py, expand Download Component request header file randomly take out a proxy value from the proxy IP pool and then download it to the downloader.

1. The design and development process of proxy IP pool is as follows:

A. The Free proxy IP website is crawled.

B. Storage and validation of proxy IP

C. Validation by storing into the database

D. If the maximum number of IP to meet, then stop crawling, after a certain amount of time to verify the IP validity of the data, will disable the IP delete

E. Until the database IP is less than 0, continue crawling IP, repeat step a. Proxy IP Module This uses the seven-night proxy IP pool open source project

Proxy IP crawler Run screenshot:

The crawler is not shielded running, access to the site is not always 200 request success, but there are a variety of states, such as the crawler was banned, in fact, the return state is 302, to prevent shielding components is to capture the 302 state to achieve. At the same time, the processing of abnormal state is beneficial to the robustness of reptiles.

After extending the middleware to catch exceptions in the settings, the request requests are added back to the download queue as follows:

(d) Data storage module

The data storage module is primarily responsible for storing the slave-side crawling parsed pages. Use MongoDB to store data.

Scrapy supports data storage in formats such as Json,csv and XML, which users can set when they run a crawler, such as: Scrapy crawl Spider-o items.json-t JSON, can also be defined in the Scrapy project file Itempipline file, at the same time, Scrapy also support database storage, such as Monogdb,redis, when the data volume to a certain extent, can do MongoDB or Reids cluster to solve the problem, This system data is stored as shown in the following figure:

(e) Grab field design

This article takes the network availability data as the crawl target, by the slave end to parse crawls the field data, therefore the grasping content must be able to reflect the network property data objectively and accurately.

To crawl 58 of the same city network listing data as an example, by analyzing the Web page structure, define the field details as shown in the following table.

Serial number	Field name	Field meaning
1	Title	Post title
2	Money	Rent
3	Method	Leasing method
4	Area	Area
5	Community	In your community
6	TargetUrl	Post details
7	City	Location of the city
8	Pub_time	Post release time

Field selection is mainly based on the application of the system research to carry out, because the system development stand-alone configuration is relatively low, did not download picture files to this machine. Reduce the pressure on a single machine.

(f) Data processing

1) Object Definition Program

Item is the container that defines the crawl data. Declared by creating a Scrapy.item.Item class. The definition property is a Scrapy.item.Field object that controls the obtained site data by instantiating the desired item. This system defines nine fetching objects, namely: Post title, rent, leasing method, area, location, city, post details page link, release time. The definition of a field here is defined based on the needs of the data processing side. The key code is as follows:

Class Tczufangitem (Item):

#帖子名称

Title=field ()

#租金

Money=field ()

#租赁方式

Method=field ()

#所在区域

Area=field ()

#所在小区

Community=field ()

#帖子详情url

Targeturl=field ()

#帖子发布时间

Pub_time=field ()

#所在城市

City=field ()

2) Data Processing program

The pipeline class defines the method of saving and outputting data, the item returned from the spider parse method, and the data corresponds to the pipeline class in the Item_pipelines list and outputs in the top format. The data returned by the system to the pipeline is stored using MongoDB. The key code is as follows:

def process_item (self, item, spider):

If item[' pub_time '] = = 0: