Open-source Generic crawler framework yaycrawler-begins

Source: Internet
Author: User

Hello, everyone! From today onwards, I will use a few pages of text to introduce my open source work--yaycrawler, its Web site on GitHub is: Https://github.com/liushuishang/YayCrawler, welcome to the attention and feedback.

Yaycrawler is a distributed generic crawler framework based on WebMagic development, and Java is the development language. We know that there are many crawler frames, simple, complex, lightweight and weight-based. You may ask: where is the advantage of your reptile framework? Well, this is a very important question! In this opening, I first briefly introduce the characteristics of my reptile frame, the later chapter will be described in detail and explain its implementation, a picture wins thousands of words:

1, distributed: Yaycrawler is a Big Brother (Master) multi-brother (Worker) architecture (This structure is the truth of the Universe), of course, the eldest brother has a small secret (Admin) and external contacts.

2, versatility: We need to crawl a lot of different sites of data, the structure and content of the various sites have a great difference, basically most people are met a website to write a code, can not do the reuse of tools. Yaycrawler is trying to change the situation, abstracting the different parts, and using rules to guide the crawler to do things. This means that the user can configure the interface to crawl the data of a page rules, and so on when the crawler crawl this page will use this pre-configured rules to parse the data, and then persist the data.

3. Extensible Task queue: The task queue is implemented by Redis and has four different task queues depending on the state of the task: initial, execute, success, failure. You can also extend different task scheduling algorithms, which are fairly scheduled by default.

4, can be defined persistence mode: In the crawl results, the property data is persisted to MongoDB by default, the picture will be downloaded to the file server, of course you can expand more storage methods.

5, stability and fault tolerance: any one of the failed crawler tasks will be retried and logged, only the task is really successful to be moved to the success queue, failure will have a reason for failure description.

6, anti-monitoring components: website in order to prevent the crawler is also painstaking, think of a series of monitoring means to anti-crawler. As the opposite, we naturally have to have anti-surveillance means to protect our crawler tasks, the main factors currently considered are: cookie invalidation (need to login), brush verification code, IP (auto-agent).

7, you can set a scheduled refresh of tasks, such as updating a site every other day data.

......

It says a lot of advantages of the purpose of only one: I hope you will be interested to continue to watch, haha.

To get back to the beginning, this article is just an overview, now let's tidy up the structure of the following articles:

    1. Operating mechanism of open-source general crawler frame yaycrawler-Framework
    2. Extraction rule definition for open source Generic crawler Framework yaycrawler-page
    3. Open source Generic crawler Framework yaycrawler-task queue
    4. Open source Generic crawler Framework yaycrawler-page Downloader detailed
    5. Open source Generic crawler framework yaycrawler-rules Parser
    6. Open source Generic crawler framework yaycrawler-data persistence
    7. Open source Generic crawler Framework yaycrawler-anti-monitoring components
    8. Open source Generic crawler Framework yaycrawler-case Demo
    9. Open source Generic crawler framework yaycrawler-features to be perfected

Open-source Generic crawler framework yaycrawler-begins

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.