Java Web crawler framework:
Apache Nutch, Heritrix, etc., mainly refer to 40 open source projects provided by the open source community
Article background:
Recently to write a crawler to capture Sina Weibo data, and then use Hadoop storage, analysis, on the Internet to search for relevant information.
It is recommended to use Python, but given that I am good at Java, it takes a certain amount of time to learn Python, or Java to choose. At first I wanted to write from scratch, to search Apache HttpClient, and then think about the open source mature framework to do, Currently think Apache Nutch and Heritrix is a good choice, but I have not started the experiment, so the article to be continued ...
RELATED links:
Apache Nutch Wiki
Nutch Official website:
Heritrix Wiki
GitHub for Nutch
API for Nutch
Other Blogs:
Nutch Introduction and use
Nutch Quick Start (Nutch 2.2.1)
Nutch-2.2.1 Learning
Java Web crawler Framework