WebMagic is a crawler framework that is not configurable and is easy to develop two times.

Source: Internet
Author: User
Tags xpath

WebMagic is a crawler framework that does not need to be configured and facilitates two development, providing a simple and flexible API that allows a crawler to be implemented in just a small amount of code.

Here is a piece of code to crawl the Oschina blog:

?
12 Spider.create(newSimplePageProcessor("http://my.oschina.net/","http://my.oschina.net/*/blog/*")).thread(5).run();

The WebMagic features a fully modular design that covers the entire crawler lifecycle (link extraction, page download, content extraction, persistence), multi-threaded crawling, distributed crawling, and support for automatic retries, custom Ua/cookie, and more.

WebMagic contains a powerful page extraction feature that enables developers to easily use CSS selector, XPath, and regular expressions to extract links and content, and to support multiple selector chaining calls. For example:

?
12 String extractResult = Html.create(html).$("div.body").xpath("//a/@href").regex(".*blog.*").toString();

WebMagic can also be easily run as a module embedded in a Java project. WebMagic can be used for reference: Oschina OPENAPI application: Blog move

WebMagic documentation for use: http://webmagic.io/docs/

WebMagic Design Document: Design mechanism and principle of webmagic-how to develop a Java crawler

Huang Hua/webmagic Star 458 | Fork 259 WebMagic is a crawler framework that does not need to be configured and facilitates two development, providing a simple and flexible API that allows a crawler to be implemented in just a small amount of code. Issues:
    • #12 Apache HttpClient Cookie rejected processing zhangzuoqiang 2 months ago
    • #11 Httpclientdownloader.java Line 206 nullpointerexception Zhangzuoqiang 8 month ago
    • #10 methods in processor process thread safety Zhangzuoqiang 8 month ago
    • #9 set domain as IP address when bug zhangzuoqiang 8 months ago
    • #8 set domain as IP address when bug zhangzuoqiang 8 months ago
recently submitted:
    • 4efd47184 Remove duplicate jar Yihua.huang a year ago
    • 435922F00 Merge Branch ' stable ' of Github.com:code4craft/webmagic Yihua.huang a year ago
    • eb89d6656 fix Test Yihua.huang a year ago
Download zip Master Branch code last update: 2014-06-04 WebMagic Latest update information, total article (Post news, see all»)
    • WebMagic 0.5.2 release, Java Crawler framework 1 year ago 16 reviews/2496 reads
    • WebMagic 0.5.1 release, Java crawler framework 1 year ago 15 reviews/2419 reads
    • WebMagic 0.5.0 release, Java crawler framework 1 year ago 23 reviews/2331 reads
    • WebMagic 0.4.3 release, Java crawler framework 1 year ago 13 reviews/3120 reads
    • WebMagic 0.4.2 release, Java Crawler framework 2 year ago 8 reviews/1116 reads

WebMagic is a crawler framework that is not configurable and is easy to develop two times.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.