Crawler Tools Summary

Source: Internet
Author: User
Tags small web server isearch

Heritrix
Heritrix is an open source, extensible web crawler Project. Heritrix is designed to be strictly in accordance with the robots.txt document exclusion instructions and meta robots tags.
http://crawler.archive.org/

Websphinx
Websphinx is an interactive development environment for Java class packages and web crawlers. Web crawlers (also known as robots or spiders) are programs that can automatically browse and process Web pages. Websphinx consists of two parts: the crawler platform and the Websphinx class package.
http://www.cs.cmu.edu/~rcm/websphinx/

Weblech
Weblech is a powerful web site download and mirroring tool. It supports the ability to download Web sites by functional requirements and to mimic the behavior of standard Web browsers as much as possible. The Weblech has a feature console and is multi-threaded.
http://weblech.sourceforge.net/

Arale
Arale is primarily designed for personal use, and does not focus on page indexing like any other crawler. Arale can download an entire Web site or some resources from a Web site. Arale is also able to map dynamic pages into static pages.
Http://web.tiscali.it/_flat/arale.jsp.html

J-spider
J-spider: is a fully configurable and customizable Web Spider engine. You can use it to check the site for errors (internal server errors, etc.), to check the site for external links, to analyze the structure of the site (can create a site map), to download the entire Web site, You can also write a Jspider plugin to extend the functionality you need.
http://j-spider.sourceforge.net/

Spindle
Spindle is a Web index/search tool built on the Lucene toolkit. It includes an HTTP spider for creating an index and a search class to search for those indexes. The spindle project provides a set of JSP tag libraries that enable JSP-based sites to add search functionality without the need to develop any Java classes.
http://www.bitmechanic.com/projects/spindle/

Arachnid
Arachnid: Is a Java-based web spider framework. It contains a simple HTML parser capable of parsing an input stream that contains HTML content. By implementing the Arachnid subclass, you can develop a simple web Spiders and the ability to add a few lines of code calls after each page on the Web site is parsed. Arachnid's download package contains two spider application examples to demonstrate how to use the framework.
http://arachnid.sourceforge.net/

Larm
Larm can provide a pure Java search solution for users of Jakarta Lucene search engine framework. It contains methods to index files, database tables, and crawlers that index Web sites.
http://larm.sourceforge.net/

Jobo
Jobo is a simple tool for downloading an entire Web site. It is essentially a web Spider. Compared to other download tools, its main advantage is the ability to automatically populate a form (e.g., auto-login) and use cookies to process the session. Jobo also has flexible download rules (e.g., URL, size, MIME type, etc.) to restrict downloads.
Http://www.matuschek.net/software/jobo/index.html

Snoics-reptile
Snoics-reptile is a pure Java development, used for Web site image crawling tool, you can use the URL portal provided in the configuration file, all of this site can be obtained by the browser through the acquisition of resources to the local, including Web pages and various types of files, such as: Pictures, Flash, MP3, zip, rar, EXE and other files. The entire site can be completely underground to the hard drive, and can maintain the original structure of the site is accurate. Just put the crawled site into the Web server (e.g. Apache), you can achieve a full site image.
Http://www.blogjava.net/snoics

Web-harvest
Web-harvest is a Java open source Web data extraction tool. It collects the specified Web pages and extracts useful data from those pages. Web-harvest mainly uses such techniques as xslt,xquery and regular expressions to realize the operation of Text/xml.
Http://web-harvest.sourceforge.net

Spiderpy
Spiderpy is a python-encoded, open-source web crawler tool that allows users to collect files and search sites and have a configurable interface.
http://pyspider.sourceforge.net/

The Spider Web Network Xoops Mod Team
The Pider Web Network Xoops mod is a Xoops module that is fully implemented by the PHP language.
http://www.tswn.com/

Fetchgals
Fetchgals is a Perl multi-threaded web crawler that searches through tags for pornographic images.
Https://sourceforge.net/projects/fetchgals

Larbin
Larbin is a C + + web crawler tool that has an easy-to-use interface, but only runs under Linux and can crawl up to 5 million pages per day under a single PC (of course, it needs a good network).
Http://larbin.sourceforge.net/index-eng.html

J-spider

J-spider: is a fully configurable and customizable Web Spider engine. You can use it to check the site for errors (internal server errors, etc.), to check the site for external links, to analyze the structure of the site (can create a site map), to download the entire Web site, You can also write a Jspider plugin to extend the functionality you need.

Spindle

Pindle is a Web index/search tool built on the Lucene toolkit. It includes an HTTP spider for creating an index and a search class to search for those indexes. The spindle project provides a set of JSP tag libraries that enable JSP-based sites to add search functionality without the need to develop any Java classes.

Arachnid

Arachnid: Is a Java-based web spider framework. It contains a simple HTML parser capable of parsing an input stream that contains HTML content. By implementing the Arachnid subclass, you can develop a simple web Spiders and the ability to add a few lines of code calls after each page on the Web site is parsed. Arachnid's download package contains two spider application examples to demonstrate how to use the framework.

Larm

Larm can provide a pure Java search solution for users of Jakarta Lucene search engine framework. It contains methods to index files, database tables, and crawlers that index Web sites.

Jobo

Jobo is a simple tool for downloading an entire Web site. It is essentially a web Spider. Compared to other download tools, its main advantage is the ability to automatically populate a form (e.g., auto-login) and use cookies to process the session. Jobo also has flexible download rules (e.g., URL, size, MIME type, etc.) to restrict downloads.

Snoics-reptile

Snoics-reptile is a pure Java development, used for Web site image crawling tool, you can use the URL portal provided in the configuration file, all of this site can be obtained by the browser through the acquisition of resources to the local, including Web pages and various types of files, such as: Pictures, Flash, MP3, zip, rar, EXE and other files. The entire site can be completely underground to the hard drive, and can maintain the original structure of the site is accurate. Just put the crawled site into the Web server (e.g. Apache), you can achieve a full site image.

Web-harvest

Web-harvest is a Java open source Web data extraction tool. It collects the specified Web pages and extracts useful data from those pages. Web-harvest the Lord.


PHP Open source web crawler


1, Phpdig is a foreign very popular vertical search engine products (rather than a product, rather than a traditional search engine), the use of PHP language, using the PHP program to run high efficiency, greatly improve the speed of the search response, It can search the internet like Google or Baidu and other search engines, search content in addition to ordinary Web pages include TXT, doc, xls, PDF and other files, with powerful content search and file parsing functions.

2, Sphider is a lightweight the web spider and search engine written in PHP, using the MySQL as its back end database. It is a great tool for adding search functionality to your Web site or building your custom search engine. Sphider is small, easy-to-set up and modify, and are used in thousands of websites across the world.

Sphider supports all standard search options, but also includes a plethora of the advanced features such as Word autocompletio N, spelling suggestions etc. The sophisticated adminstration interface makes administering the system easy. The full list of Sphider features can is seen in the on section; Also is sure to check out the demo and take a look at the showcase, displaying some sites running sphider. If you run into problems, you can probably get a answer to your question in the forum.

3, ISearch

The ISearch PHP search engine allows you to build a searchable database for your Web site. Visitors can search for key words and a list of all pages that match are returned to them.
Introduction

ISearch is a tool for allowing visitors to a website to perform a search on the contents of the site. Unlike other such tools the Spidering engine was written in PHP, so it does not require binaries to being run on the server to Generate the search index for HTML pages.

Java Open source web crawler list
http://www.ideagrace.com/sf/web-crawler/

http://www.cs.cmu.edu/~rcm/websphinx/

C # Open Source sample
Http://www.codeproject.com/useritems/ZetaWebSpider.asp

Http://www.codeproject.com/aspnet/Spideroo.asp

Http://www.codeproject.com/cs/internet/Crawler.asp

Open source search engine for people to learn, research and grasp the search technology provides a very good way and material, promote the popularization and development of search technology, so that more and more people began to understand and promote the use of search technology. Using an open-source search engine can dramatically shorten the cycle of building search applications, create personalized search applications based on application needs, and even build search engine systems that meet specific needs. The open source of search engines, whether for technical staff or ordinary users, is a boon.

Search engine workflow is divided into three main steps: from the Internet crawl page → create Crawl page index library → search from the index library.

First of all, a Web-accessible crawler program is required to automatically crawl the entire internet based on the correlation between URLs, and crawl and collect Web pages. When the webpage is collected, the index analysis program is used to analyze the Web page information, and a lot of calculation is done based on certain correlation algorithm (such as hyperlink algorithm) to create an inverted index library. Once the index library is built, users can search through the provided search interface to submit keywords, and return search results based on a specific sorting algorithm. Therefore, the search engine is not a direct search of the Internet, but a crawl of the index library of the search, which is also a quick return to search results, the index plays the most important role in which the efficiency of the index algorithm directly affect the efficiency of the search engine is a key factor to evaluate whether the search engine is efficient.

Web crawlers, indexers, and queries together constitute an important component of the search engine, for specific languages, such as Chinese, Korean, and so on, also need word breakers for word segmentation, in general, the word breaker with the indexer to create a specific language index library. The synergy between them is shown in 1.

Open source search engine provides users with great transparency, open source code, open sorting algorithm, arbitrary customization, compared to commercial search engine, more users need. At present, open source search engine projects are also some, mainly in the Search Engine development Toolkit and architecture, Web search engine, file search engine several aspects, this article outlines the current more popular and relatively mature several search engine projects.

Open Source Search Engine Toolkit

1. Lucene

Lucene is currently the most popular open source full-Text Search engine toolkit, and is part of the Apache Foundation, launched by the veteran full-text indexing/Search expert Doug Cutting, with the name of his wife's middle name as the project. Lucene is not a full-featured search application, but a toolkit that focuses on text indexing and searching, and is able to add indexes and search capabilities to the application. Based on Lucene's excellent performance in indexing and searching, although Lucene written by Java is inherently cross-platform, it is still adapted to many other languages: Perl, Python, C + +,. NET, and so on.

Like other open source projects, Lucene has a very good architecture that makes it easy to research and develop on its basis, add new features, or develop new systems. Lucene itself only supports text files and a small number of language index, and does not have the crawler function, and this is the charm of lucene, through the rich interface provided by Lucene, we can according to their own needs to add specific language word breaker, for the specific document of the text parser, etc. The implementation of these specific functions can be achieved by some existing related open source software projects, and even commercial software, which ensures Lucene's focus on indexing and search. At present, some new open source projects, such as Lius, Nutch and so on, have been formed by adding crawlers and text parsers on the basis of Lucene. and Lucene's indexed data structure has become a de facto standard, used by many search engines.

2. Lius

Lius is the abbreviation for Lucene index Update and search, which is a text indexing framework developed based on Lucene and, like Lucene, can also be seen as a search engine development toolkit. It makes some research on the basis of Lucene and adds some new functions. With many open source software, Lius can directly parse and index documents of various formats/types including Ms Word, Ms Excel, Ms Powerpoing, RTF, PDF, XML, HTML, TXT, Open Support for Java beans is useful for database indexing, such as office and JavaBeans, and becomes more accurate when you program database connections for object-relational mappings such as Hibernate, JDO, TopLink, torque, and so on. Lius also added the index update function on the basis of Lucene, which further improved the maintenance function for indexes. and support for mixed indexes, you can combine all the content in the same directory with a certain condition, which is useful when you need to index documents in many different formats at the same time.

3. Egothor

Egothor is an open-source, high-performance full-text search engine for search applications based on full-text search, with core algorithms similar to Luccene, which has been around for many years and has some active developers and user groups. Project Initiator Leo Galambos is a senior assistant professor at the Faculty of Mathematics and Physics at Charlie University in Prague, Czech Republic, where he launched the program during his doctoral studies.

More often, we see Egothor as a Java library for full-text search engines that can add full-text search capabilities to specific applications. It provides an extended Boolean module that allows it to be used as a Boolean module or a vector module, and Egothor has features unique to some other search engines: It uses new dynamic algorithms to effectively improve the speed of index updates, and supports parallel query methods. Can effectively improve query efficiency. In Egothor's release, many enhanced ease-of-use applications such as crawlers, text parsers, and more efficient compression methods such as Golomb, Elias-gamma and more, support text parsing in a variety of common document formats, such as HTML, PDF, PS, Microsoft Office documents, XLS, etc., provide the GUI index interface and the applet or Web-based query method. In addition, Egothor can be conveniently configured as a standalone search engine, meta-data finder, point-to-point hub, and many other application systems.

4. Xapian

Xapian is a GPL-published search engine development Library, written in the C + + language that provides binding packages that make it easy for Perl, Python, PHP, Java, Tck, C #, Ruby and other languages to be used.

Xapian is also a highly adaptable toolset that enables developers to easily add advanced indexing and search capabilities to their applications. It supports the probability model of information retrieval and the rich Boolean query operation. Xapian's release package is typically made up of two parts: Xapian-core and Xapian-bindings, which are the core main programs, which are bundles that are bound to other languages.

Xapian provides a rich API and documentation program for program developers, as well as a number of programming examples and an Xapian-based application Omega,omega consists of an indexer and a CGI-based front-end search that can be HTML, PHP, PDF, Document indexing in a variety of formats, such as PostScript, Openoffice/staroffice, RTF, or even MySQL, PostgreSQL, SQLite, Sybase, MS SQL, LDAP, relational databases such as ODBC are indexed and can be exported from the front end in CSV or XML format, and program developers can expand on this basis.

5. Compass

Compass is an open source search engine architecture that is implemented on Lucene, providing a cleaner search engine API than Lucene. Added support for indexing transactions to make it easier to integrate with transaction processing applications such as databases. It is easier and more efficient to update without deleting the original document. The mapping mechanism between resources and search engines makes it easy to migrate applications that already use Lucene or that do not support objects and XML to compass development.

Compass is also integrated with the Hibernate, spring, and other architectures, so Compass is an excellent choice if you want to incorporate search engine functionality into Hibernate and spring projects.

Open source web search engine system

1. Nutch

Nutch is another open source project initiated by Lucene's author Doug Cutting, a complete web search engine based on Lucene, though not long-time, but widely welcomed for its excellent pedigree and simple and convenient use. We can use Nutch to build a complete search engine system similar to Google, for LAN, Internet search.

2. YaCy

YaCy is a distributed open-source web search engine based on peer-to Peer-to-peer, written in the Java language, the core of which is distributed on hundreds of computers, known as Yacy-peer computer programs, based on the network of Peers to form a yacy network, The entire network is a decentralized architecture, where all the yacy-peers are in a peer position, without a unified central server, each yacy-peer can be independent of the internet crawling Crawl, analysis and create index library, through peer-to-network and other yacy-peers sharing, And each yacy-peer is a separate proxy server, the ability to index the Web pages used by native users, and to take a multi-mechanism to protect the privacy of users, while the user also through the computer running the Web server query and return query results.

YaCy search engine mainly consists of five parts, in addition to the common search engine has a crawler, indexer, anti-sorting index library, it also includes a very rich search and management interface and for data sharing of peer network.

Open source Desktop Search engine system

1. Regain

Regain is a desktop search engine system similar to a web search engine, except that regain is not a search for Internet content, but rather a search for its own documents or files, and using regain makes it easy to perform large amounts of data (many g) searches in seconds. Regain adopts Lucene's search syntax, so it supports multiple query methods, supports multi-index search and advanced search based on file type, and can implement URL rewriting and file-to-HTTP bridging, and also provides better support for Chinese.

Regain is available in two versions: Desktop Search and Server search. Desktop Search provides a quick search of documents and Web pages in a LAN environment for normal desktop computers. Server version is mainly installed on the Web server, for the site and LAN environment of the file server search.

Regain is written in Java, so it can be installed on Windows, Linux, Mac OS, and Solaris with cross-platform installation. The server version requires a JSPS environment and tag library, so you need to install a Tomcat container. And the desktop version comes with a small Web server, the installation is very simple.

2. Zilverline

Zilverline is a Lucene-based desktop search engine, using the spring framework, it is mainly used for personal local disk and LAN content search, support multiple languages, and has its own Chinese name: Money search engine. Zilverline provides rich indexing support for document formats, such as Microsoft Office documents, RTF, Java, CHM, and even the ability to index archive files, such as Zip, RAR, and other archive files, during the indexing process, zilverline from Zip, RAR, CHM and other archive files to extract files to index. Zilverline can support incremental indexing, only new files are indexed, and regular auto-indexing is supported, and its index library can be stored in places that zilverline can access, even on DVDs. Zilverline also supports file path-to-URL mapping, which allows users to remotely search for local files.

Zilverline offers both personal and research and commercial applications in the form of a simple war package that can be downloaded from its official website (http://www.zilverline.org/). Zilverline's operating environment requires a Java environment and a servlet container, which is typically used by Tomcat. After making sure that the JDK and Tomcat containers are properly installed, simply copy the Zilverline War package (Zilverline-1.5.0.war) to the Tomcat WebApps directory and restart the Tomcat container to start using the Zilverline search engine.

Crawler Tools Summary

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.