WebMagic is a crawler framework that does not need to be configured and facilitates two development, providing a simple and flexible API that allows a crawler to be implemented in just a small amount of code.
Here is a piece of code to crawl the Oschina blog:
?
12 |
Spider.create( new SimplePageProcessor( "http://my.oschina.net/" , "http://my.oschina.net/*/blog/*" )).thread( 5 ).run(); |
The WebMagic features a fully modular design that covers the entire crawler lifecycle (link extraction, page download, content extraction, persistence), multi-threaded crawling, distributed crawling, and support for automatic retries, custom Ua/cookie, and more.
WebMagic contains a powerful page extraction feature that enables developers to easily use CSS selector, XPath, and regular expressions to extract links and content, and to support multiple selector chaining calls. For example:
?
12 |
String extractResult = Html.create(html).$( "div.body" ) .xpath( "//a/@href" ).regex( ".*blog.*" ).toString(); |
WebMagic can also be easily run as a module embedded in a Java project. WebMagic can be used for reference: Oschina OPENAPI application: Blog move
WebMagic documentation for use: http://webmagic.io/docs/
WebMagic Design Document: Design mechanism and principle of webmagic-how to develop a Java crawler
Huang Hua/webmagic Star 458 | Fork 259 WebMagic is a crawler framework that does not need to be configured and facilitates two development, providing a simple and flexible API that allows a crawler to be implemented in just a small amount of code.
Issues:
- #12 Apache HttpClient Cookie rejected processing zhangzuoqiang 2 months ago
- #11 Httpclientdownloader.java Line 206 nullpointerexception Zhangzuoqiang 8 month ago
- #10 methods in processor process thread safety Zhangzuoqiang 8 month ago
- #9 set domain as IP address when bug zhangzuoqiang 8 months ago
- #8 set domain as IP address when bug zhangzuoqiang 8 months ago
recently submitted:
- 4efd47184 Remove duplicate jar Yihua.huang a year ago
- 435922F00 Merge Branch ' stable ' of Github.com:code4craft/webmagic Yihua.huang a year ago
- eb89d6656 fix Test Yihua.huang a year ago
Download zip Master Branch code last update: 2014-06-04
WebMagic Latest update information, total article (Post news, see all»)
- WebMagic 0.5.2 release, Java Crawler framework 1 year ago 16 reviews/2496 reads
- WebMagic 0.5.1 release, Java crawler framework 1 year ago 15 reviews/2419 reads
- WebMagic 0.5.0 release, Java crawler framework 1 year ago 23 reviews/2331 reads
- WebMagic 0.4.3 release, Java crawler framework 1 year ago 13 reviews/3120 reads
- WebMagic 0.4.2 release, Java Crawler framework 2 year ago 8 reviews/1116 reads
WebMagic is a crawler framework that is not configurable and is easy to develop two times.