(turn) A few Java open source crawler __java

Source: Internet
Author: User

Turn from: Network,

Original source Unknown

Heritrix

Heritrix is an open source, scalable web crawler project. Heritrix is designed to strictly follow the instructions for robots.txt documents and meta-robots tags.

Websphinx

Ebsphinx is an interactive development environment for Java class packages and web crawlers. Web crawler (also known as a robot or spider) is a program that can automatically browse and process Web pages. The Websphinx consists of two parts: the reptile working platform and the Websphinx class package.

Weblech

Weblech is a powerful web site for downloading and mirroring tools. It supports the ability to download Web sites by functional requirements and to emulate the behavior of standard Web browsers as much as possible. Weblech has a functional console and is multithreaded.

Arale

Arale is designed primarily for personal use, and does not focus on page indexing like other reptiles. Arale can download an entire Web site or some resources from a Web site. Arale can also map dynamic pages to static pages.

J-spider

J-spider: is a fully configurable and customizable Web Spider engine. You can use it to check the site for errors (internal server errors, etc.), site external links check, analyze the structure of the site (can create a site map), download the entire Web site, You can also write a Jspider plugin to expand the functionality you need.

Spindle

Pindle is a Web indexing/search tool built on the Lucene toolkit. It includes an HTTP spider for creating indexes and a search class to search for these indexes. The spindle project provides a set of JSP tag libraries that enable JSP-based sites to increase search functionality without the need to develop any Java classes.

Arachnid

Arachnid: is a java-based web spider framework. It contains a simple HTML parser that can parse an input stream containing HTML content. By implementing arachnid subclasses, you can develop a simple web Spiders and can add a few lines of code calls after each page on the Web site is parsed. The arachnid download package contains two spider application examples to demonstrate how to use the framework.

Larm

Larm can provide a pure Java search solution for users of the Jakarta Lucene search engine framework. It contains methods to index files, database tables, and reptiles to index Web sites.

Jobo

Jobo is a simple tool for downloading the entire Web site. It is essentially a web Spider. The main advantage compared to other download tools is its ability to automatically populate form (e.g. automatic login) and use cookies to process sessions. Jobo also has flexible download rules (such as: URL, size, MIME type, etc.) to limit downloads.


Snoics-reptile

Snoics-reptile is a pure Java developed, used for Web site mirroring Crawl tool, you can use the URL portal provided in the configuration file, all of this site can be used in the browser to get the resources to grab all the local, including Web pages and various types of files, such as: Pictures, Flash, MP3, zip, rar, EXE and other documents. Can be the entire site underground to the hard drive, and to maintain the original site structure accurate unchanged. Just put the crawled Web site on a Web server (such as Apache), and you can achieve a complete site image.

Web-harvest

Web-harvest is a Java open source Web data extraction tool. It can collect the specified Web pages and extract useful data from those pages. Web-harvest mainly uses such techniques as xslt,xquery, regular expression and so on to realize the operation of Text/xml.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.