From: http://paranimage.com/20-open-source-search-engines/
Some open-source search engine systems, including open-source Web search engines and open-source desktop search engines.
Sphider
Sphider is a lightweight web spider and search engine developed using PHP. It uses MySQL to store data. You can use it to add a search function for your website. Sphider is very small and easy to install and modify. It has been used by thousands of websites.
Risearch PHP
Risearch PHP is an efficient and powerful search engine, especially for small and medium websites. Risearch PHP is very fast. It can search-pages in less than 1 second. Risearch is an index search engine, which means that it first indexes your website and creates a database to store keywords on all pages of your website for quick search. Risearch is a full-text search engine script that organizes all the keywords into a document index, except for the keywords excluded from the definition in the configuration file. Risearch uses a classic reverse Index algorithm (the same as a large search engine), which is why it is faster than other search engines.
PHPDig
PHPDig is a web crawler and search engine developed using PHP. Create a vocabulary by indexing dynamic and static pages. When you search for a query, it displays the search results page containing keywords according to certain sorting rules. PHPDig contains a template system that can index PDF, Word, Excel, and PowerPoint documents. PHPDig is suitable for more professional and deeper personalized search engines. It is the best choice to create vertical search engines for a certain field.
Openwebspider
Openwebspider is an open-source multi-threaded web spider (ROBOT: crawler) and a search engine that contains many interesting functions.
Egothor
Egothor is an open-source and efficient full-text search engine written in Java. With the cross-platform features of Java, egothor can be applied to applications in any environment. It can be configured as a separate search engine and used for full-text search.
Nutch
Nutch is a search engine implemented by open-source Java. It provides all the tools we need to run our own search engine. Including full-text search and web crawler.
Lucene
Apache Lucene is a full-text search engine based on Java. It can be used to easily add full-text search functions to Java software. Lucene's most important task is to index every word in a file. indexing greatly improves the search efficiency than traditional word-by-word comparison. lucen provides a set of interpretations, filters, and analyzes files, to orchestrate and use an index API, apart from being efficient and simple, it is the most important thing to enable users to customize their functions at any time.
Oxygen
Is a Java-only Web search engine.
Bddbot
Bddbot is a simple and easy to understand and use search engine. The crawler crawls the URLs listed in a file (urls.txt) and stores the results in a database. It also supports a simple web server that accepts queries from the browser and returns response results. It can be easily integrated into your web site.
Zilverline
Zilverline is a search engine that searches for content on a local hard disk or Intranet through the Web. Zilverline can capture their contents from PDF, Word, Excel, PowerPoint, RTF, txt, Java, CHM, zip, rar, and other documents to create summaries and indexes. You can retrieve the results from the local hard disk or intranet. Zilverline supports multiple languages, including Chinese.
Xqengine
Xqengine is used by the full text search engine for XML documents. Use XQuery as its front-end query language. It allows you to query the XML document set by using the logical combination of keywords. It is similar to Google and other search engines in searching HTML documents. Xqengine is just a compact and embedded component developed in Java.
Mg4j
Mg4j allows you to build a compressed full text index for a large collection of documents by using the interpolative coding technology.
JXTA search
JXTA search is a distributed search system. Designed for point-to-point networks and websites.
Yacy
Yacy is a P2P Distributed Web search engine. It is also an HTTP cache proxy server. This project is a new method for building a P2P web index network. It can search for your own or global indexes, crawl your own web pages, or start distributed crawling.
Red-Piranha
Red-Piranha is an open-source search system that truly "learns" what you are looking. Red-Piranha can be used as a personal search engine for your desktop system (Windows, Linux and MAC), or an enterprise intranet search engine, or as a search function for your website, it can also be used as a P2P search engine, or combined with Wiki as a knowledge/document management solution, or search for RSS aggregate information you want, or your company's systems (including SAP, oracle or any other database/data source), or used to manage PDF, word, and other documents, or as a WebService that provides search information or your applications (Web, swing, SWT, flash, Mozilla-Xul, PHP,
Perl or C #/. Net) provides search backend and so on.
Lius
Lius is an index framework based on the Jakarta Lucene project. Lius added indexing functions for Lucene for many file formats such as: MS Word, MS excel, MS PowerPoint, RTF, PDF, XML, HTML, txt, open office sequence and JavaBeans. indexes for JavaBeans are particularly useful when we want to index databases or when users use persistence layer ORM technologies such as Hibernate, JDO, torque, and toplink for development.
Apache SOLR
SOLR is a high-performance full-text search server developed by Java 5 based on Lucene. Add a document to a search set using XML over HTTP. Querying this set is also achieved by receiving an XML/JSON response through HTTP. Its main features include: efficient and flexible caching, vertical search, highlighting search results, improving availability through index replication, and providing a set of powerful data schema to define fields, type and set text analysis, and provide a web-based management interface.
Paoding
Paoding Chinese Word Segmentation is a Chinese search engine word segmentation component developed in Java and can be integrated into Lucene applications for the Internet and enterprise intranets. Paoding fills the gaps in open-source components for Chinese Word Segmentation in China, and is committed to becoming the preferred open-source component for Chinese word segmentation for Internet websites. Paoding Chinese Word Segmentation pursues efficient word segmentation and a good user experience.
Carrot2
Carrot2 is an open-source search result classification engine. It automatically organizes the search results into topic categories. One architecture provided by carrot2 can obtain search results from various search engines (yahooapi, googleapi, MSN Search API, ls Meta Search, Alexa web search, Pubmed, opensearch, Lucene index, and SOLR.
Regain
Regain is a desktop search engine system similar to a Web search engine. The difference is that regain does not search for Internet content, but for its own documents or files, with regain, you can easily search for large amounts of data (multiple GB) in a few seconds. Regain adopts Lucene's search syntax. Therefore, it supports multiple query methods, multi-index search, and file-based advanced search. It also supports URL rewriting and file-to-HTTP bridging, it also provides better support for Chinese characters.
Regain provides two editions: desktop search and server search. Desktop Search provides quick search for documents on common desktop computers and web pages in LAN environments. The server version is mainly installed on Web servers to search for file servers in websites and LAN environments.
Source: Open-open