Web Crawler Summary

Source: Internet
Author: User
From: http://phengchen.blogspot.com/2008/04/blog-post.html
Heritrix

Heritrix is an open-source and scalable Web Crawler project. Heritrixis designed to strictly follow the exclusion instructions and meta robots labels in the robots.txt file.
Http://crawler.archive.org/

Websphinx

Websphinx is an interactive development environment for Java class packages and web crawlers. Web Crawlers (also known as robots or spider) can automatically browse and process web pages. Websphinx consists of two parts: the crawler platform and websphinx class package.
Http://www.cs.cmu.edu /~ RCM/websph.pdf/

Weblech

Weblech is a powerful web site download and image tool. It supports downloading web sites based on functional requirements and can imitate the behavior of standard Web browsers as much as possible. Weblech has a function console and uses multithreading.
Http://weblech.sourceforge.net/

Arale

Arale is mainly designed for personal use, and does not focus on page indexing like other crawlers. Arale can download the entire web site or some resources from the Web site. Arale can also map dynamic pages to static pages.
Http://web.tiscali.it/_flat/arale.jsp.html

J-spider

J-spider: a fully configurable and customized web spider engine. you can use it to check website errors (internal server errors, etc.), check internal and external links of the website, analyze the website structure (you can create a website map), and download the entire website, you can also write a jspider plug-in to expand the functions you need.
Http://j-spider.sourceforge.net/

Spindle

Spindle
Is a web index/search tool built on the Lucene toolkit. It includes an HTTP INDEX
Spider and a search class used to search for these indexes. The spindle project provides a set of JSP tag libraries so that JSP-based sites can add search without developing any Java classes.
Function.
Http://www.bitmechanic.com/projects/spindle/

Arachnid

Arachnid:
Is a Java-based Web
Spider framework. It contains a simple HTML Parser that can analyze the input stream containing HTML content. By implementing the arachnid subclass, you can develop a simple web
Spiders can add several lines of code after each page on the web site is parsed.
The Arachnid download package contains two spider application examples to demonstrate how to use the framework.
Http://arachnid.sourceforge.net/

Larm

Larm can provide a pure Java search solution for users of the Jakarta Lucene search engine framework. It contains the methods for indexing files, database tables, and web sites.
Http://larm.sourceforge.net/

Jobo

Jobo
Is a simple tool used to download the entire web site. It is essentially a web
Spider. Compared with other download tools, the main advantage is the ability to automatically fill the form (such as automatic logon) and use cookies to Process sessions. Jobo is flexible.
Download rules (such as the URL, size, and Mime Type of a webpage) to restrict download.
Http://www.matuschek.net/software/jobo/index.html

Snoics-reptile

Snoics
-Reptile is a Java-only tool used to capture website images. You can use the URL entry provided in the preparation file to get all the website images in a browser.
All retrieved resources are crawled locally, including webpages and various types of files, such as images, Flash files, MP3 files, zip files, RAR files, and exe files. Upload the entire website to the hard disk
And keep the original website structure unchanged. You only need to put the captured website on a Web server (such as APACHE) to implement a complete website image.
Http://www.blogjava.net/snoics

Web-harvest

Web-harvest is a Java open source web data extraction tool. It can collect specified web pages and extract useful data from these pages. Web-harvest mainly uses technologies such as XSLT, XQuery, and regular expressions to perform text/XML operations.
Http://web-harvest.sourceforge.net

Spiderpy

Spiderpy is an open-source Python-based Web Crawler tool that allows users to collect files and search for websites, and has a configurable interface.
Http://pyspider.sourceforge.net/

The spider web network xoops mod team

Pider web network xoops mod is a module under xoops, fully implemented by the PHP language.
Http://www.tswn.com/

Fetchgals

Fetchgals is a Perl-based multi-thread web crawler that uses tags to search for pornographic images.
Https://sourceforge.net/projects/fetchgals

Larbin

Larbin is a C ++-based Web Crawler with easy-to-operate interfaces, but it can only run in Linux, in a normal PC, larbin can crawl 5 million pages a day (of course, a good network is required)
Http://larbin.sourceforge.net/index-eng.html


Poster
Fengyan crazy


Time:
4/01/2008 08:47:00 AM















Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.