Java open-source Web Crawler

Last Update:2018-12-03 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Heritrix clicks: 3822

Heritrix is an open-source and scalable Web Crawler project. Heritrixis designed to strictly follow the exclusion instructions and meta robots labels in the robots.txt file.

Websphinx clicks: 2205

Websphinx is an interactive development environment for Java class packages and web crawlers. Web Crawlers (also known as robots or spider) can automatically browse and process web pages. Websphinx consists of two parts: the crawler platform and websphinx class package.

Weblech click count: 1146

Weblech is a powerful web site download and image tool. It supports downloading web sites based on functional requirements and can imitate the behavior of standard Web browsers as much as possible. Weblech has a function console and uses multithreading.

Arale clicks: 995

Arale is mainly designed for personal use, and does not focus on page indexing like other crawlers. Arale can download the entire web site or some resources from the Web site. Arale can also map dynamic pages to static pages.

J-spider clicks: 1432

J-spider: a fully configurable and customized web spider engine. you can use it to check website errors (internal server errors, etc.), check internal and external links of the website, analyze the website structure (you can create a website map), and download the entire website, you can also write a jspider plug-in to expand the functions you need.

Number of clicks on the spindle: 1046

Spindle is a Web index/search tool built on the Lucene toolkit. It includes an HTTP spider used to create indexes and a search class used to search for these indexes. The spindle project provides a set of JSP tag libraries so that JSP-based sites can add search functions without developing any Java classes.

Arachnid clicks: 899

Arachnid: a Java-based web spider framework. it contains a simple HTML Parser capable of analyzing input streams containing HTML content. by implementing the arachnid subclass, you can develop a simple web spiders and add several lines of code to call after each page on the web site is parsed. The Arachnid download package contains two spider application examples to demonstrate how to use the framework.

Larm clicks: 1387

Larm can provide a pure Java search solution for users of the Jakarta Lucene search engine framework. It contains the methods for indexing files, database tables, and web sites.

Jobo clicks: 1091

Jobo is a simple tool for downloading the entire web site. It is essentially a web spider. Compared with other download tools, the main advantage is the ability to automatically fill the form (such as automatic logon) and use cookies to Process sessions. Jobo also has flexible download rules (such as the URL, size, and Mime Type of a webpage) to restrict download.

Snoics-reptile clicks: 454

Snoics-reptile is a Java-only tool used to capture website images. You can use the URL entry provided in the preparation file, capture all the resources on the website that can be obtained by using a browser through get to the local device, including webpages and various types of files, such: images, Flash files, MP3 files, zip files, RAR files, and exe files. The entire website can be completely stored in the hard disk, and the original website structure can be kept accurate. You only need to put the captured website on a Web server (such as APACHE) to implement a complete website image.
:
Snoics-reptile2.0.part1.rar
Snoics-reptile2.0.part2.rar
Snoics-reptile2.0-doc.rar

Web-harvest clicks: 128

Web-harvest is a Java open source web data extraction tool. It can collect specified web pages and extract useful data from these pages. Web-harvest mainly uses technologies such as XSLT, XQuery, and regular expressions to perform text/XML operations.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Java open-source Web Crawler

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Java open-source Web Crawler

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support