Java Web crawler Framework

Source: Internet
Author: User

Java Web crawler framework:

Apache Nutch, Heritrix, etc., mainly refer to 40 open source projects provided by the open source community


Article background:

Recently to write a crawler to capture Sina Weibo data, and then use Hadoop storage, analysis, on the Internet to search for relevant information.

It is recommended to use Python, but given that I am good at Java, it takes a certain amount of time to learn Python, or Java to choose. At first I wanted to write from scratch, to search Apache HttpClient, and then think about the open source mature framework to do, Currently think Apache Nutch and Heritrix is a good choice, but I have not started the experiment, so the article to be continued ...


RELATED links:

Apache Nutch Wiki

Nutch Official website:

Heritrix Wiki

GitHub for Nutch

API for Nutch



Other Blogs:

Nutch Introduction and use
Nutch Quick Start (Nutch 2.2.1)

Nutch-2.2.1 Learning


Java Web crawler Framework

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.