Turn from: Network,
Original source Unknown
Heritrix
Heritrix is an open source, scalable web crawler project. Heritrix is designed to strictly follow the instructions for robots.txt documents and meta-robots tags.
Websphinx
Ebsphinx is an interactive development environment for Java class packages and web crawlers. Web crawler (also known as a robot or spider) is a program that can automatically browse and process Web pages. The Websphinx consists of two parts: the reptile working platform and the Websphinx class package.
Weblech
Weblech is a powerful web site for downloading and mirroring tools. It supports the ability to download Web sites by functional requirements and to emulate the behavior of standard Web browsers as much as possible. Weblech has a functional console and is multithreaded.
Arale
Arale is designed primarily for personal use, and does not focus on page indexing like other reptiles. Arale can download an entire Web site or some resources from a Web site. Arale can also map dynamic pages to static pages.
J-spider
J-spider: is a fully configurable and customizable Web Spider engine. You can use it to check the site for errors (internal server errors, etc.), site external links check, analyze the structure of the site (can create a site map), download the entire Web site, You can also write a Jspider plugin to expand the functionality you need.
Spindle
Pindle is a Web indexing/search tool built on the Lucene toolkit. It includes an HTTP spider for creating indexes and a search class to search for these indexes. The spindle project provides a set of JSP tag libraries that enable JSP-based sites to increase search functionality without the need to develop any Java classes.
Arachnid
Arachnid: is a java-based web spider framework. It contains a simple HTML parser that can parse an input stream containing HTML content. By implementing arachnid subclasses, you can develop a simple web Spiders and can add a few lines of code calls after each page on the Web site is parsed. The arachnid download package contains two spider application examples to demonstrate how to use the framework.
Larm
Larm can provide a pure Java search solution for users of the Jakarta Lucene search engine framework. It contains methods to index files, database tables, and reptiles to index Web sites.
Jobo
Jobo is a simple tool for downloading the entire Web site. It is essentially a web Spider. The main advantage compared to other download tools is its ability to automatically populate form (e.g. automatic login) and use cookies to process sessions. Jobo also has flexible download rules (such as: URL, size, MIME type, etc.) to limit downloads.
Snoics-reptile
Snoics-reptile is a pure Java developed, used for Web site mirroring Crawl tool, you can use the URL portal provided in the configuration file, all of this site can be used in the browser to get the resources to grab all the local, including Web pages and various types of files, such as: Pictures, Flash, MP3, zip, rar, EXE and other documents. Can be the entire site underground to the hard drive, and to maintain the original site structure accurate unchanged. Just put the crawled Web site on a Web server (such as Apache), and you can achieve a complete site image.
Web-harvest
Web-harvest is a Java open source Web data extraction tool. It can collect the specified Web pages and extract useful data from those pages. Web-harvest mainly uses such techniques as xslt,xquery, regular expression and so on to realize the operation of Text/xml.