(turn) A few Java open source crawler _

(turn) A few Java open source crawler __java

Last Update:2018-07-20 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Turn from: Network,

Original source Unknown

Heritrix

Heritrix is an open source, scalable web crawler project. Heritrix is designed to strictly follow the instructions for robots.txt documents and meta-robots tags.

Websphinx

Ebsphinx is an interactive development environment for Java class packages and web crawlers. Web crawler (also known as a robot or spider) is a program that can automatically browse and process Web pages. The Websphinx consists of two parts: the reptile working platform and the Websphinx class package.

Weblech

Weblech is a powerful web site for downloading and mirroring tools. It supports the ability to download Web sites by functional requirements and to emulate the behavior of standard Web browsers as much as possible. Weblech has a functional console and is multithreaded.

Arale

Arale is designed primarily for personal use, and does not focus on page indexing like other reptiles. Arale can download an entire Web site or some resources from a Web site. Arale can also map dynamic pages to static pages.

J-spider

J-spider: is a fully configurable and customizable Web Spider engine. You can use it to check the site for errors (internal server errors, etc.), site external links check, analyze the structure of the site (can create a site map), download the entire Web site, You can also write a Jspider plugin to expand the functionality you need.

Spindle

Pindle is a Web indexing/search tool built on the Lucene toolkit. It includes an HTTP spider for creating indexes and a search class to search for these indexes. The spindle project provides a set of JSP tag libraries that enable JSP-based sites to increase search functionality without the need to develop any Java classes.

Arachnid

Arachnid: is a java-based web spider framework. It contains a simple HTML parser that can parse an input stream containing HTML content. By implementing arachnid subclasses, you can develop a simple web Spiders and can add a few lines of code calls after each page on the Web site is parsed. The arachnid download package contains two spider application examples to demonstrate how to use the framework.

Larm

Larm can provide a pure Java search solution for users of the Jakarta Lucene search engine framework. It contains methods to index files, database tables, and reptiles to index Web sites.

Jobo

Jobo is a simple tool for downloading the entire Web site. It is essentially a web Spider. The main advantage compared to other download tools is its ability to automatically populate form (e.g. automatic login) and use cookies to process sessions. Jobo also has flexible download rules (such as: URL, size, MIME type, etc.) to limit downloads.

Snoics-reptile

Snoics-reptile is a pure Java developed, used for Web site mirroring Crawl tool, you can use the URL portal provided in the configuration file, all of this site can be used in the browser to get the resources to grab all the local, including Web pages and various types of files, such as: Pictures, Flash, MP3, zip, rar, EXE and other documents. Can be the entire site underground to the hard drive, and to maintain the original site structure accurate unchanged. Just put the crawled Web site on a Web server (such as Apache), and you can achieve a complete site image.

Web-harvest

Web-harvest is a Java open source Web data extraction tool. It can collect the specified Web pages and extract useful data from those pages. Web-harvest mainly uses such techniques as xslt,xquery, regular expression and so on to realize the operation of Text/xml.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

(turn) A few Java open source crawler __java

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

(turn) A few Java open source crawler __java

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support