This article reprinted to http://www.tuicool.com/articles/VZBj2e
Original http://itindex.net/detail/52388-Frame
WebMagic is a crawler framework that does not need to be configured and facilitates two development, providing a simple and flexible API that allows a crawler to be implemented in just a small amount of code.
Official website http://webmagic.io/
WebMagic is an open source Java Vertical Crawler framework with the goal of simplifying the crawler's development process and allowing developers to focus on the development of logical functions. The core of WebMagic is very simple, but the whole process of covering the crawler is also a good learning material for crawler development. The author has been in the former company for a year of vertical crawler development, webmagic is to solve the crawler development of some repetitive work generated by the framework.
Web crawler is a technology, WebMagic is committed to the implementation of this technology to reduce the cost, but because of the respect of resource providers, WebMagic will not do anti-blocking things, including: Verification code cracking, proxy switching, automatic login, etc.
Key Features of WebMagic:
- Fully modular design, powerful scalability.
- The core is simple but covers all the processes of the crawler, flexible and powerful, but also a good material to learn how to get started.
- Provides a rich Extract page API.
- No configuration, but a crawler can be implemented through the pojo+ annotations form.
- Support Multithreading.
- Support distributed.
- Supports crawling of pages with JS dynamic rendering.
- No frame-dependent, can be flexibly embedded into the project.
Http://git.oschina.net/flashsword20/webmagic#readme
An easy-to-use crawler frame