WebMagic Open Source Vertical crawler Introduction

Source: Internet
Author: User


The WebMagic Project code is divided into two parts: core and extension. The core section (Webmagic-core) is a streamlined, modular crawler implementation, while the extended section includes some handy, practical features. WebMagic Architecture design refers to the Scrapy, the goal is to be as modular as possible, and reflect the functional characteristics of the crawler.
This section provides a very simple, flexible API to write a crawler without fundamentally changing the development model.
The extended section (webmagic-extension) provides some handy features, such as annotation pattern writing crawlers. A number of commonly used components are built in to facilitate crawler development.

1. A framework, an area
A good framework must be a cohesive field of knowledge. WebMagic's design refers to the industry's best crawler scrapy, while the implementation of the HttpClient, Jsoup and other Java world's most mature tools, the goal is to do a Java language web crawler-like implementation.
If you are a reptile developer, then WebMagic will be very easy to use, it almost uses the Java native development method, but provides some modular constraints, encapsulates some cumbersome operations, and provides some convenient features.
If you are a novice crawler developer, then using and understanding WebMagic will let you understand the common patterns of crawler development, Toolchain, and how to handle problems. After skillful use, it is not difficult to believe that you are developing a crawler from scratch.
Because of this goal, the core of WebMagic is very simple-in this case, the functionality is to give in to the simplicity of the concession.
2. Micro-core and high scalability
WebMagic consists of four components (Downloader, Pageprocessor, Scheduler, Pipeline), the core code is very simple, mainly to combine these components and complete the multi-threaded task. This means that in webmagic, you can basically customize the functionality of the crawler.
The core of WebMagic in the Webmagic-core package, the other packages you can understand as an extension to webmagic-this is no different from writing an extension as a user.
3. Focus on practicality
While the core needs to be simple enough, WebMagic also implements a number of convenient features that can help develop in an extended way. For example, the development of crawler based on annotation pattern, as well as the extension of XPath syntax xsoup and so on. These features are optional in webmagic, and their development goal is to make the user's crawler as simple as possible and easy to maintain.

Overall architecture
The structure of WebMagic is divided into four components of downloader, Pageprocessor, Scheduler and pipeline, and the spider organizes them together. These four components correspond to the functions of downloading, processing, managing, and persisting in the crawler life cycle. WebMagic's design references the scapy, but the implementation is more Java-style.
And the spider organizes these components so that they can interact with each other and process the execution, and it can be thought that the spider is a large container, which is also the core of webmagic logic.
The WebMagic overall architecture diagram is as follows:

Four components of WebMagic
1.Downloader
Downloader is responsible for downloading pages from the Internet for subsequent processing. WebMagic uses Apache httpclient as the download tool by default.
2.PageProcessor
Pageprocessor is responsible for parsing the page, extracting useful information, and discovering new links. WebMagic uses Jsoup as an HTML parsing tool and develops a tool xsoup for parsing XPath based on it.
Of these four components, Pageprocessor is different for each page of each site and is a custom-made part of the user.
3.Scheduler
Scheduler is responsible for managing the URLs to be crawled, as well as some heavy work. The WebMagic default provides the JDK's memory queue to manage the URLs and use the collection to do the redo. Distributed management with Redis is also supported.
Unless the project has some special distributed requirements, you do not need to customize the scheduler yourself.
4.Pipeline
Pipeline is responsible for the processing of results, including calculation, persistence to file, database and so on. The WebMagic default provides both output to console and save to file result processing scenarios.
Pipeline defines how the results are saved, and if you want to save to the specified database, you need to write the corresponding pipeline. For a class of requirements, you simply write a pipeline.
Objects for Data flow
1. Request
Request is a layer of encapsulation of the URL address, and a request corresponds to a URL address. It is the carrier of pageprocessor interaction with downloader and the only way to control downloader pageprocessor.
In addition to the URL itself, it contains a field extra for a key-value structure. You can save some special attributes in extra and then read them elsewhere to accomplish different functions. For example, attaching some information to the previous page, and so on.
2. Page
The page represents one of the pages downloaded from downloader-either HTML or JSON or other text-formatted content.
Page is the core object of the WebMagic extraction process, which provides methods to extract, save results, and so on. In the example of the fourth chapter, we will describe its use in detail.
3. Reusltitems
Reusltitems is equivalent to a map that holds the results of pageprocessor processing for pipeline use. Its API is similar to map, and it is worth noting that it has a field of skip, and if set to true, it should not be pipeline processed.
The engine that controls the crawler's Operation--spider
Spiders are at the heart of webmagic internal processes. Downloader, Pageprocessor, Scheduler, and pipeline are all properties of the spider, which are freely set and can be implemented by setting this property. Spider is also the entrance of WebMagic operation, it encapsulates the crawler's creation, start, stop, multithreading and other functions.

Reference: Http://webmagic.io/docs/zh

WebMagic Open Source Vertical crawler Introduction

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.