Open source code search engines provide an excellent way and material for people to learn, study and master the search technology, and promote the popularization and development of the search technology, more and more people begin to understand and promote the use of search technology. Using open-source search engines can greatly shorten the cycle for building search applications, create personalized search applications based on application requirements, and even build search engine systems that meet specific requirements. The Open Source of search engines is a good news for both technicians and common users.
The workflow of a search engine consists of three steps: crawling a webpage from the Internet, creating an index database for crawling a webpage, and searching from an index database.
First, a web crawler program is required to automatically crawl the entire Internet based on the associations between URLs, and capture and collect crawled web pages. After the web page is collected, the index analysis program is used to analyze the web page information. Based on a certain relevance algorithm (such as the hyperlink algorithm), a large amount of computation is performed to create a inverted Sorting index database. After the index library is created, you can submit keywords on the provided search interface for search and return the search results based on a specific sorting algorithm. Therefore, the search engine does not directly search for the Internet, but searches for the crawled web index database. This is also the reason why the search results can be quickly returned, indexes play the most important role in it. The efficiency of index algorithms directly affects the efficiency of search engines and is a key factor in evaluating the efficiency of search engines.
Web Crawler, indexer, and queryer constitute an important component unit of the search engine. For specific languages, such as Chinese and Korean, word divider is also required. Generally, the word divider and the indexer work together to create an index library for a specific language. See figure 1.
Open-source search engines provide users with great transparency, open source code, open sorting algorithms, and customizable features, which are more required than commercial search engines. Currently, there are some open-source search engine projects, mainly including the search engine development kit and architecture, Web search engine, and file search engine, this article briefs several popular and mature search engine projects.
Open source search engine Toolkit
1. Lucene
Lucene is currently the most popular open-source full-text search engine toolkit. It is affiliated to the Apache Foundation and initiated by Doug cutting, a senior full-text indexing/retrieval expert, take the name of the project as the name of his wife. Lucene is not a full-featured search application, but a toolkit focusing on text indexing and search. It can add indexing and search capabilities to the application. Based on Lucene's outstanding performance in indexing and searching, although Lucene compiled by Java has a natural cross-platform nature, it is still adapted to many other languages: perl, Python, C ++ ,.. net.
Like other open-source projects, Lucene has a very good architecture that facilitates research and development based on it, adding new functions or developing new systems. Lucene only supports indexing of text files and a small number of languages, and does not support crawling. This is what makes Lucene attractive. It provides a wide range of interfaces through Lucene, we can add word divider for specific languages and text Parser for specific documents based on our own needs, these specific functions can be implemented by using some existing open-source software projects or even commercial software, which ensures Lucene's attention in indexing and searching. At present, some new open-source projects, such as lius and nutch, have been formed by adding crawlers and text parsers on the basis of Lucene. In addition, Lucene's index data structure has become a de facto standard, which is used by many search engines.
2. lius
Lius, short for Lucene Index Update and search, is a text index framework developed based on Lucene. Like Lucene, lius can also be seen as a search engine development kit. Based on Lucene, it has made some research and added some new functions. With the help of many open-source software, lius can directly parse and index documents of different formats/types, these document formats include MS Word, MS Excel, Ms powerpoing, RTF, PDF, XML, HTML, txt, open office, and JavaBeans, the support for Java Beans is very useful for database indexing. It is more accurate when users perform database connection programming for object link ing (such as Hibernate, JDO, toplink, and torque. Lius also adds the Index Update Function Based on Lucene to further improve the index maintenance function. Hybrid indexing is also supported to integrate all content related to a condition in the same directory. This function is useful for simultaneous indexing of documents in multiple formats.
3. egothor
Egothor is an open-source high-performance full-text search engine. It is applicable to search applications based on the full-text search function. It has a core algorithm similar to luccene. This project has been in existence for many years, it also has some active developers and user groups. Leo galambos, the project initiator, is a senior assistant professor at the School of Mathematics and Physics at Charlie University in Prague, Czech Republic. He initiated the project during his PHD program.
In more cases, we regard egothor as a Java library for full-text search engines, and can add full-text search functions for specific applications. It provides an extended Boolean module that can be used as a Boolean module or a vector module, and egothor has some special features not available to other search engines: it uses new dynamic algorithms to effectively improve the Index Update speed and supports parallel queries, which can effectively improve the query efficiency. In the released version of egothor, many ease-of-use applications, such as crawlers and text parsers, are added, and multiple efficient compression methods such as golomb and Elias-Gamma are integrated, supports text parsing in a variety of common document formats, such as HTML, PDF, PS, Microsoft Office documents, and XLS. It provides a GUI index interface and an applet or web-based query method. In addition, egothor can be easily configured as an independent search engine, metadata searcher, point-to-point hub, and other application systems.
4. xapian
Xapian is a GPL-based search engine development library written in C ++, with the provided binding package, you can easily use it in Perl, Python, PHP, Java, tck, C #, Ruby, and other languages.
Xapian is also a highly adaptive tool set that allows developers to easily add advanced indexing and search functions to their applications. It supports information retrieval probability models and a wide range of Boolean query operations. The release package of xapian usually consists of two parts: xapian-core and xapian-bindings. The former is the core main program, and the latter is a package bound to other languages.
Xapian provides a wide range of APIs and documents for programming developers. It also provides many programming instances and an xapian-based application Omega, omega is composed of an indexer and CGI-based front-end search. It can index documents in HTML, PHP, PDF, postscript, OpenOffice/StarOffice, RTF, and other formats, by using the Perl DBI module, you can even index MySQL, PostgreSQL, SQLite, Sybase, ms SQL, LDAP, ODBC, and other relational databases, and export search results from the front end in CSV or XML format, program developers can expand on this basis.
5. Compass
Compass is an open-source search engine architecture implemented on Lucene. Compared with Lucene, compass provides more concise search engine APIs. Added support for index transaction processing so that it can be more easily integrated with transaction processing applications such as databases. When it is updated, you do not need to delete the original document, which is simpler and more efficient. The ing mechanism is used between resources and search engines, which makes it easy for applications that have used Lucene or do not support object and XML to migrate to compass for development.
Compass can also be integrated with hibernate, spring, and other architectures. Therefore, if you want to add the search engine function to the Hibernate and spring projects, compass is an excellent choice.
Open-source Web search engine system
1. nutch
Nutch is another open-source project initiated by Doug cutting, author of Lucene. It is a complete Web search engine system built on Lucene. Although it has not been born for a long time, however, it is widely welcomed by its fine lineage and simple and convenient use. We can use nutch to build a complete search engine system similar to Google for LAN and Internet search.
2. yacy
Yacy is a distributed open-source Web search engine system based on peer-to-peer (P2P). It is written in Java, the core of the yacy-peer computer program is distributed on hundreds of computers, which forms a yacy network based on a P2P network. The entire network is a distributed architecture, all of the yacy-peers are in a peer-to-peer status, and there is no unified central server. Each yacy-peer can independently crawl, analyze, and create an index database on the Internet, share with other yacy-peers through the P2P network, and each yacy-peer is an independent proxy server that can index webpages used by local users, in addition, multiple mechanisms are used to protect user privacy. At the same time, users can query and return query results through the Web server running on the local machine.
The yacy search engine consists of five parts, except for crawlers, indexers, and reverse Sorting index libraries of common search engines, it also includes a rich search and management interface and a P2P network for data sharing.
Open-source Desktop Search Engine System
1. Regain
Regain is a desktop search engine system similar to a Web search engine. The difference is that regain does not search Internet content, but searches for its own documents or files, with regain, you can easily search for large amounts of data (multiple GB) in a few seconds. Regain adopts Lucene's search syntax. Therefore, it supports multiple query methods, multi-index search, and file-based advanced search. It also supports URL rewriting and file-to-HTTP bridging, it also provides better support for Chinese characters.
Regain provides two editions: desktop search and server search. Desktop Search provides quick search for documents on common desktop computers and web pages in LAN environments. The server version is mainly installed on Web servers to search for file servers in websites and LAN environments.
Regain is written in Java. Therefore, it can be installed on Windows, Linux, Mac OS, and Solaris. The JSPs environment and tag library are required for the server version. Therefore, a tomcat container must be installed. The desktop version comes with a small Web server, which is easy to install.
2. zilverline
Zilverline is a Lucene-based desktop search engine. It adopts the Spring framework and is mainly used for searching personal ephemeral disks and LAN content. It supports multiple languages, it also has its own Chinese name: yinqian lookup engine. Zilverline provides indexing support for a wide range of document formats, such as Microsoft Office documents, RTF, Java, and CHM. It can even compile an index for an archive file for search, for example, zip, rar, and other archive files. During the indexing process, zilverline extracts files from archive files such as ZIP, rar, and CHM for indexing. Zilverline supports incremental indexing and only indexing new files. It also supports regular automatic indexing. Its Index library can be stored in a place accessible to zilverline, or even in a DVD. At the same time, zilverline also supports file path-to-URL ing, which allows users to remotely search for local files.
Zilverline provides two licensing methods for individuals, research, and commercial applications. Its release form is a simple war package that can be downloaded from its official website (http://www.zilverline.org /). The running environment of zilverline requires the Java environment and servlet container. Generally, Tomcat is used. After ensuring that JDK and tomcat containers are correctly installed, simply copy the zilverline war package (zilverline-1.5.0.war) to the Tomcat webapps directory and restart the Tomcat container to start using the zilverline search engine.