Java open-source Web Crawler

Source: Internet
Author: User

Heritrix clicks: 3822

Heritrix is an open-source and scalable Web Crawler project. Heritrixis designed to strictly follow the exclusion instructions and meta robots labels in the robots.txt file.

Websphinx clicks: 2205

Websphinx is an interactive development environment for Java class packages and web crawlers. Web Crawlers (also known as robots or spider) can automatically browse and process web pages. Websphinx consists of two parts: the crawler platform and websphinx class package.

Weblech click count: 1146

Weblech is a powerful web site download and image tool. It supports downloading web sites based on functional requirements and can imitate the behavior of standard Web browsers as much as possible. Weblech has a function console and uses multithreading.

Arale clicks: 995

Arale is mainly designed for personal use, and does not focus on page indexing like other crawlers. Arale can download the entire web site or some resources from the Web site. Arale can also map dynamic pages to static pages.

J-spider clicks: 1432

J-spider: a fully configurable and customized web spider engine. you can use it to check website errors (internal server errors, etc.), check internal and external links of the website, analyze the website structure (you can create a website map), and download the entire website, you can also write a jspider plug-in to expand the functions you need.

Number of clicks on the spindle: 1046

Spindle is a Web index/search tool built on the Lucene toolkit. It includes an HTTP spider used to create indexes and a search class used to search for these indexes. The spindle project provides a set of JSP tag libraries so that JSP-based sites can add search functions without developing any Java classes.

Arachnid clicks: 899

Arachnid: a Java-based web spider framework. it contains a simple HTML Parser capable of analyzing input streams containing HTML content. by implementing the arachnid subclass, you can develop a simple web spiders and add several lines of code to call after each page on the web site is parsed. The Arachnid download package contains two spider application examples to demonstrate how to use the framework.

Larm clicks: 1387

Larm can provide a pure Java search solution for users of the Jakarta Lucene search engine framework. It contains the methods for indexing files, database tables, and web sites.

Jobo clicks: 1091

Jobo is a simple tool for downloading the entire web site. It is essentially a web spider. Compared with other download tools, the main advantage is the ability to automatically fill the form (such as automatic logon) and use cookies to Process sessions. Jobo also has flexible download rules (such as the URL, size, and Mime Type of a webpage) to restrict download.

Snoics-reptile clicks: 454

Snoics-reptile is a Java-only tool used to capture website images. You can use the URL entry provided in the preparation file, capture all the resources on the website that can be obtained by using a browser through get to the local device, including webpages and various types of files, such: images, Flash files, MP3 files, zip files, RAR files, and exe files. The entire website can be completely stored in the hard disk, and the original website structure can be kept accurate. You only need to put the captured website on a Web server (such as APACHE) to implement a complete website image.
:
Snoics-reptile2.0.part1.rar
Snoics-reptile2.0.part2.rar
Snoics-reptile2.0-doc.rar

Web-harvest clicks: 128

Web-harvest is a Java open source web data extraction tool. It can collect specified web pages and extract useful data from these pages. Web-harvest mainly uses technologies such as XSLT, XQuery, and regular expressions to perform text/XML operations.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.