Operating mechanism of open-source general crawler frame yaycrawler-Framework

Source: Internet
Author: User

In this section I will introduce you to the operation mechanism of Yaycrawler, first allow me to take a picture:

First of all the components of the boot sequence suggested that the master, Worker, Admin, in fact, do not follow this order is not OK, we have to explain the convenience of the assumption is this boot sequence.

First, master-side analysis

When Master starts, it connects to the Redis Query task queue State, and Master maintains a four-state task queue: Pending task queue, in-execution task queue, successful task queue, and failed task queue. Master has a Task scheduler, master and other workers when the heartbeat package arrives, to see if the worker has the scope of the task allocation (each worker can set their own local task queue Length), if the worker can also receive n tasks, The Task Scheduler assigns up to n tasks from the queue of tasks to be executed to that worker. After the assignment succeeds, master moves the n tasks from the pending queue to the in-execution queue.

Master periodically scans the registered worker, and if a worker's last heartbeat time is now more than twice times the worker's own heartbeat interval, master will assume that the worker has been lost and can no longer be assigned a task. It will therefore remove it from the registered list.

Master periodically scans the in-execution queue, and if it finds that a task has already exceeded a preset for the allotted time, we can assume that the task has gone awry and should be re-executed, so master will re-move the task from the execution queue to the queue to be executed for redistribution.

Second, the analysis of the worker end

The worker's configuration file is configured with the Master's Service communication address, which is registered with the master when the worker starts, including the worker's communication address, heartbeat interval, task quota, and so on. When the worker registration succeeds, the heartbeat is sent to master periodically, and the task is reported to the master at the same time.

The process of a worker performing a task is a routine crawler process, which we will explain in detail, and we only need to be clear about the fact that the worker will have two states and two types of result data after performing a task. If the worker executes the task successfully, the result will consist of two parts: field data and sub-links. The field data informs the persisted component to persist (the default field data is saved to MongoDB, the picture is downloaded to the file server), and the child link notifies master and joins to the queue to be executed. The worker will give master a notification to modify the status of the task, regardless of whether the task is successful or not. Third, admin side analysis

Admin side is mainly to provide users with an interface interface, which is a Web project, the admin side of the configuration file also recorded the master's service communication address. The user can write extraction rules for the target Web page on the admin side, and test the rules until they are saved to the database. The user can view the performance of the task on the interface, such as the successful task, the failure task, the task result, etc., the user can also publish the general Task/Timer task on the interface individually or in bulk, these tasks will be executed on the worker, and the worker will refer to the parsing rules set by the user when parsing.

Iv. Other

The communication between Master, worker and admin is based on HTTP protocol, in order to secure, the communication process uses token, timestamp, nonce to sign and verify the message body, only the signature is correct to communicate successfully.

The queue and persistence in the framework are all based on the interface programming, you can easily replace the original processing with your own implementation.

Open source Generic crawler Framework yaycrawler-framework operation mechanism

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.