"Search Engine Basics 3" search engine related open source projects and websites

Source: Internet
Author: User

Part of the content is transferred from: http://blog.csdn.net/hguisu/article/details/8024799

First, open source project

1.Lucene full-text retrieval system
Http://lucene.apache.org and http://www.lucene.com.cn/
Lucene is a subproject of the Apache Software Foundation 4 Jakarta Project group, an open source full-Text Search engine toolkit, which is not a full-text search engine, but a full-text search engine architecture that provides a complete query engine and index engine. Part of the text analysis engine (English and German two Western languages). Lucene's goal is to provide software developers with a simple and easy-to-use toolkit to facilitate full-text retrieval in the target system, or to build a complete full-text search engine on this basis. Lucene's original author is Doug Cutting, a veteran full-text index/search expert who used to be a major developer of the V-twin search engine, after excite as a senior system architect and currently engages in research on some of the internet's underlying architectures. Previously published in the author's own, he contributed to Lucene's goal of adding full-text search functionality to a variety of small and medium-sized applications.

2. Nutch System
Http://www.nutch.org and Http://www.nutchchina.com
Nutch is a complete open source search engine and is a complete application. The internal implementation is based on lucence to implement search engine applications.
With Nutch, you can build your own intranet search engine by simply setting it up, or you can build a search engine for the Internet, and you can search indexes with databases.

3. Heritrix Project

Compass is an open-source search engine architecture implemented on lucence, providing a cleaner search engine API. Added support for indexing transaction processing to make it easier to integrate with food processing applications such as databases. The update is simpler and more efficient, eliminating the need to delete the original document. Mapping between resources and search engines, compass can also integrate with Hibernate, Spring architecture

4.Larbin system
Larbin is an open source web crawler/spider, developed by French young Sébastien Ailleret and implemented in C + + language. The purpose of Larbin is to be able to track the URL of the page to expand the crawl and finally provide a wide range of data sources for search engines. Larbin is just a reptile, that is to say Larbin crawl only Web pages, as to how the parse thing is done by the user himself. In addition, how to store the database and index things larbin is not provided.
Latbin's initial design was based on a simple but highly configurable principle, so we can see that a simple larbin crawler can get 5 million of pages per day, which is very efficient.
With Larbin, we can easily get/determine all the connections of a single site, or even mirror a website, or use it to create a URL list group, such as URL retrive for all pages, to get the XML connection. or MP3, or custom larbin, can be used as a source of information for search engines.

5. yioop! PHP Search Engine
yioop! is a PHP search engine that can be used for general purpose search of the Web, or to provide URL searches and indexed searches of various documents, including: HTML, PDF, DOC, PPT, RTF, RSS, XML, SVG, PNG, JPG, BMP, GIF, and sit Emaps.

Second, research website

1,google Blackboard http://www.google.com.hk/ggblog/googlechinablog/
2,searchenginewatch.com Station.

3. The difference between Nutch and Lucene
Want to be a search engine, recently browsed many communities, found Lucene and Nutch use a lot of, and these two I always feel difficult to distinguish the concept, so I looked up some information. Here is an excerpt from an interview with Lucene and Nutch founder Doug Cutting:
Lucene is actually a library of functions that provide full text search, and it's not an application software. It provides a number of API functions that you can apply to a variety of practical applications. Now, it has become an Apache project and is widely used. Here is a list of some systems that have already used Lucene.
Nutch is an implementation of Web search based on the Lucene core, which is a real application. In other words, you can download it directly and bring it back. It is based on Lucene with web crawlers and some web-related stuff. The goal is to go from a simple in-site index and search to a global web search, just like Google and Yahoo. Of course, to compete with those giants, you have to move some brains and think of some way. We have tested 100M Web pages, and it is designed to work on more than 1B pages without problems. Of course, let it run on a machine, search for some servers, also run very well.

In general, I think Lucene will be used to search within the local server's website, while Nutch is extended to the entire network, the Internet retrieval. Of course, Lucene plus the crawler and so on will become nutch, so that understanding should be correct.

This article from Csdn Blog, reproduced please indicate the source: http://blog.csdn.net/rokii/archive/2008/03/01/2137450.aspx

To put it simply:
Z Lucene is not a complete application, but a repository for full-text retrieval.
Z Nutch is an application that can be based on Lucene to implement search engine applications.
Lucene provides a text index and search API for Nutch. A common problem is that I should
Should I use Lucene or Nutch? The simplest answer is: if you don't need to crawl the data, you should
Use Lucene. Common applications are: you have a data source, you need to provide a search page for this data
Surface. In this case, the best way is to fetch the data directly from the database and build it with the Lucene API

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.