Quickly build a real-time crawling Cluster

Last Update:2018-12-06 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Definition: First, we define targeted crawling. Targeted crawling is a specific crawling requirement. The target site is known, and the website page is known. This article focuses on how to quickly build a real-time crawling system, and does not include common features such as Link Analysis and site discovery.
In the instance system mentioned in this article, we mainly use Linux + MySQL + redis + Django + scrapy + WebKit, where scrapy + WebKit serves as the capture end, And redis serves as the link library for storage, mySQL is used as the storage of web page information, and Django is used as the crawler management interface to quickly implement the prototype of the distributed crawling system.
Glossary: 1. capture the ring: capture the ring refers to a process in which spider obtains the URL in the storage, downloads the webpage from the Internet, stores the webpage in the database, and finally obtains the next URL from the storage.
2. linkbase: The storage module of the Linked database, which contains general link information. It is the core of the crawling system and is stored in redis.
3. XPath: a language used to search for information in XML documents. XPath can be used to traverse elements and attributes in XML documents. It is the main element of W3C XSLT standards. Use XPath and related tools Lib for Link extraction and information extraction.
4. xpathonclick: a chrome plug-in that allows you to click page elements and obtain the XPath path to edit the configuration template.
5. redis: an open-source kV memory database with good data structure and high access performance. Used to store linkbase Information
6. DJANGO: A crawler management tool for template configuration and system monitoring feedback. Django is mainly used to manage a database and use the Admin Function.
7. pagebase: The page library mainly stores the results of Web Page capture and page extraction, interacts with dump, and implements it using MySQL.
8. scrapy: A single Python crawler with an open-source opportunity twisted framework. This crawler actually contains a toolkit for most web crawlers to download and extract.
9. List page: All pages outside the product page
10. Details page: such as commodity B2C capture, especially the commodity page, such as: http://item.tmall.com/item.htm? Id = 10321272374

System Architecture 1: storing redis + MySQL link library (linkbase) is the core of the crawling system. Based on performance and efficiency considerations, this article uses memory-based redis and disk-based MySQL, for linkbase, it mainly stores the necessary link information, such as URL, anchor, and so on. For MySQL, it stores the captured webpage for easy extraction and processing. A) pagebase: use MySQL database/table sharding to store webpages, such:
B) linkbase uses a redis cluster to store linkbase information.
Several Basic data structures:
1: Candidate List)
The queue is divided into the URL queue to be crawled and the updated URL queue. The queue stores urlhash and uses the List Data Structure of redis. for Newly Extracted URLs, push them to the corresponding list, for the spider capture module, obtain it from list pop. For a site, there are two types of capture Queues: List page capture queue and detail page capture queue.
2: linkbase)
The Link Library is actually the DB that stores the link information. The key is urlhash, and the value is linkinfo, including URL, purl, anchor, and XPath ...; Use Hash Storage in redis and store it directly in redis. KV Linked Library, which does not differentiate page types.
3: crawled_set)
A captured Collection refers to the urlhash of the currently downloaded page, which stores the captured webpage. It is implemented using the redis set. The set key is urlhash, and the score is the timestamp, A captured set is mainly used to record the time when pages have been crawled and crawled, and used for subsequent page scheduling updates and statistics on captured information. Like a crawling queue, each site has two types of captured sets: Details page and list page.

Ii. scheduling module: The scheduling module is the key to capturing the system. The quality of the scheduling system determines the efficiency of the capturing system. This is mainly the data structure on redis linkbase, it mainly consists of data structures such as capture queue, capture set, and capture priority. For a capture loop: Get URL, submit it to the waiting queue of the capture module, and start crawling, after capturing, extract the new link and enter the queue waiting for capturing. Basic configurations of the scheduling system:
A) frequency (How many seconds)
B) ratio of each capture list: get_detail, mod_detail, get_list, mod_list
Link extraction: Extracts links on the page and removes duplicates. For new links, insert them to the list to be crawled.
Content Extraction: Extracts page information according to the module's configuration XPath and writes it to pagebase.
Offline scheduling: According to the update ratio, regularly select a URL from the crawled_set to enter the MOD queue for Refresh.
3. capture module: capture module is a necessary condition for capturing. In terms of capture module, it is important to deal with various problems on the Internet and how to balance the IP address of the other site. Of course, this is closely integrated with the scheduling system. For the capture module, this article mainly uses the download module in the scrapy toolkit. First, the capture module obtains the crawling URL of the corresponding site from linkbase, downloads the page, writes the page information back to pipeline, extracts links and pages, and calls the scheduling module, insert to linkbase and pagebase.
Download end design:
IP Address: Multiple Physical public IP addresses need to be configured for each machine. during download, a random IP address is selected for download.
Adjust the capture frequency: Read the configuration file and select the URL according to the capture frequency of the configuration file.
4. Configuration Interface: the configuration interface mainly manages and configures the capture system, including site feed, page module extraction, and report system feedback.
Similar to the general capture architecture, the capture system architecture mentioned in this article is as follows:

A complete capture data stream: 1: The user provides the seed url2: Seed URL to enter the new URL queue in linkbase 3: The scheduling module selects the URL to enter the capture queue in the capture Module 4: the capture module reads the site configuration file and captures the file according to the execution frequency. 5. The captured result is returned to the pipeline interface, and the connection extraction is completed. 6: the newly discovered connection is dedup in linkbase and pushed to the new URL module of linkbase. 7: The scheduling module selects the URL to enter the waiting queue of the capture module, Goto 48: End
System expansion the capture system mentioned in this Article, the core is the scheduling and storage module. Among them, capture, storage, and scheduling are all interactive through data. Therefore, modules can be horizontally expanded, for the system scale, you only need to expand the storage service cluster of MySQL and redis in parallel and capture the cluster. Of course, simple expansion will bring about some problems, such as the proliferation of junk list pages and the expansion of linked libraries. These problems will be discussed later.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More