Python crawler tutorial -34-distributed crawler Introduction

Source: Internet
Author: User
Tags redis tutorial

Python crawler tutorial -34-distributed crawler Introduction
  • Distributed crawler in the actual application is still many, this article briefly introduces the distributed crawler

    What is a distributed crawler
  • Distributed crawler is more than one computer installed crawler, the focus is the joint acquisition. A single crawler is a crawler that is only on a single computer.
  • In fact, search engines are crawlers, responsible for crawling from all over the world's Web site content, when you search for keywords to the relevant content to show you, but they are often big spiders, crawling content is beyond imagination, can no longer use a single crawler to achieve, but the use of distributed, a server does not, I'll have 1000 units. So many of my servers are distributed around the world to complete the crawler work, each other to cooperate to do, so there is a distributed crawler
  • Single-Machine crawler problems:
    • Efficiency issues for a single computer
    • IO throughput with limited transfer rates
  • Multi-Reptile problem
    • Multi-crawler to realize data sharing
      • For example, a crawl of a site, download what content, other crawlers to know, to avoid repeated crawls and so many problems, so to achieve data sharing
    • Multiple machines in space, can be distributed
  • Multiple crawler conditions:
    • Need to share a queue
    • Go to the heavy, so that many reptiles do not crawl the crawler crawled by other reptiles
  • Understanding distributed Crawlers:
    • Assuming that tens of thousands of URLs need to be crawled, there are more than 100 reptiles in different cities across the country.
    • URLs are divided into different crawlers, but the efficiency of different crawlers is not the same, so say Shared queue, sharing data, so that more efficient crawlers to do tasks, rather than waiting for inefficient crawler
  • Redis
    • Redis is fully open source free, complies with BSD protocol, is a high-performance Key-value database
    • Memory database, data stored in memory
    • can also be saved to the hard drive on the floor
    • can go heavy
    • Redis can be understood as a collective of dict,set,list.
    • Redis can have a life cycle of saved content
    • Redis Tutorial: Redis Tutorial-Rookie Tutorial
  • Content Save Database
    • MongoDB, run in memory, data saved on hard disk
    • Mysql
    • Wait a minute
Installing Scrapy_redis
    • 1. Open "cmd"
    • 2. Enter the Anaconda environment for use
    • 3. Install with PIP
    • 4. Operation

      Structure of distributed crawler
Master-Slave distributed crawler
    • The so-called master-slave mode is that a server acts as master, and several servers act as slave,master responsible for managing all connected slave, including managing slave connections, task scheduling and distribution, result recovery and aggregation, and so on; each slave only needs to be from M Aster there to pick up the task and finish the task on its own last upload results, no need to communicate with other slave. This approach is simple and easy to manage, but it is clear that master needs to communicate with all slave, so master's performance is a bottleneck that restricts the entire system, especially when the number of slave on the connection is large, which can easily lead to the performance degradation of the entire crawler system
    • Master-Slave distributed crawler structure diagram:

      This is the classic master-slave distributed crawler structure diagram, the control node Controlnode is the above mentioned master, crawler node Spidernode is the above mentioned slave. The following diagram shows the execution of the crawler node slave
    • Control node Execution flow graph:
    • These two graphs explain the entire reptile frame very clearly, and we comb it here:
    • 1. The whole distributed crawler system consists of two parts: master control node and slave crawler node
    • 2.master Control node is responsible for: Slave node task scheduling, URL management, result processing
    • 3. Slave crawler node is responsible for: This node crawler scheduling, HTML download management, HTML content resolution management
    • 4. System Workflow: Master distributes tasks (URLs that are not crawled) Slave Pick up the task (URL) through the master's URL manager and complete the corresponding task (URL) HTML content download, content resolution, the parsed content contains the target data and the new URL, after the completion of the work slave the result (target data + The new URL) is submitted to master for the data extraction process (which belongs to the result processing of master), which completes two tasks: extracting a new URL into the URL manager, extracting the target data into the data store process, The URL management process for master receives the URL after it has been validated (whether it has been crawled) and processed (a collection that has not been crawled, added into the crawled URL collection, and crawled to a crawl), and then slave loops from the URL manager to get tasks, perform tasks, submit results ...
    • This article is here, bye
    • This note does not allow any person or organization to reprint

Python crawler tutorial -34-distributed crawler Introduction

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.