Introduction to Luncene and Solr
As the search engine function has a large number of functional requirements that require search engines in the portal community to improve user experience, there are currently a centralized solution for implementing search engines:
- Implement intra-site search by using Lucene's own encapsulation. Large workload and scalability, not used.
- Call the APIs of Google and Baidu to implement intra-site search. It is too dead to bind with a third-party search engine to meet the business expansion needs in the future.
- Implement intra-site search based on Compass + Lucene. It is suitable for indexing database-driven application data, especially replacing the traditional like '% expression %' to index fields such as varchar and clob, it is a worthwhile solution to implement intra-site search. However, you still need to encapsulate distributed processing and interface encapsulation to a certain extent.
- Implement intra-site search based on Solr. This solution provides complete solutions for better encapsulation and scalability. Therefore, this solution is used in the portal community and later added to the Compass solution.
Lucene Introduction
Lucene is a Java-based full-text information retrieval toolkit. It is not a complete search application, but provides indexing and search functions for your application. Lucene is currently an open-source project in the Apache Jakarta family. It is also the most popular Java-based open-source full-text retrieval toolkit.
Currently, many application search functions are based on Lucene, such as the search function of the Eclipse help system. Lucene can index text data. Therefore, Lucene can index and search your documents as long as you can convert the data format you want to index into text. For example, if you want to index some HTML and PDF documents, you must first convert the HTML and PDF documents into text formats, and then hand the converted content to Lucene for indexing, then, save the created index file to the disk or memory, and query the index file based on the query conditions entered by the user. Without specifying the format of the document to be indexed, Lucene can be applied to almost all search applications.
Figure 1 shows the relationship between the search application and Lucene, and also reflects the process of building a search application using Lucene:
Figure 1. Search for the relationship between the application and Lucene
Index and search
Indexing is the core of modern search engines. The indexing process is to process the source data into an index file that is very convenient to query. Why is indexing so important? Imagine that you want to search for documents containing a keyword in a large number of documents. If you do not create an index, you need to read these documents into the memory in sequence, then, check whether this article contains the keywords to be searched. In this case, it will take a lot of time. Think about the search engine's search results within milliseconds. This is because an index is created. You can think of an index as a data structure that allows you to quickly and randomly access the keywords stored in the index, then find the document associated with the keyword. Lucene adopts an inverted index mechanism. Reverse indexing means that we maintain a word/phrase table. For each word/phrase in this table, a linked list describes which documents contain the word/phrase. In this way, you can quickly obtain search results when entering query conditions. We will introduce the indexing mechanism of Lucene in the second part of this series. Because Lucene provides a simple and easy-to-use API, therefore, even if you are not familiar with the full text indexing mechanism at the beginning, you can easily use Lucene to index your documents.
After you have created an index for the document, you can search for these indexes. The search engine first parses the search keywords, then searches for the created indexes, and finally returns the documents associated with the keywords entered by the user.
It's not fun yet. Let's take a look at more related content about Lucene in the split line:
-------------------------------------- Split line --------------------------------------
Indexing and searching based on Lucene multi-index
Lucene (version 2nd) Chinese edition supporting source code
Lucene (version 2nd) PDF
Use Lucene-Spatial to implement full-text retrieval of integrated geographical locations
Lucene + Hadoop distributed search runtime framework Nut 1.0a9
Lucene + Hadoop distributed search runtime framework Nut 1.0a8
Lucene + Hadoop distributed search runtime framework Nut 1.0a7
Project 2-1: Configure Lucene and create a WEB query system [Ubuntu 10.10]
-------------------------------------- Split line --------------------------------------
Lucene details: click here
Lucene: click here
Solr Introduction
Solr is a Lucene-based Java search engine server. Solr provides hierarchical search, eye-catching hit display, and multiple output formats (including XML/XSLT and JSON ). It is easy to install and configure, and comes with an HTTP-based management interface. Solr has been used in many large websites and is relatively mature and stable. Solr encapsulates and extends Lucene, so Solr basically follows the related terms of Lucene. More importantly, the index created by Solr is fully compatible with the Lucene search engine library. By configuring Solr appropriately, encoding may be required in some cases. Solr can read and use indexes built into other Lucene applications. In addition, many Lucene tools (such as Nutch and Luke) can also use the index created by Solr.
Solr3.6.1 build an environment in Tomcat6
Tomcat-based Solr3.5 cluster deployment
Load Balancing for Solr clusters using Nginx on Linux
Install and use Solr in Linux
Deploy Solr 4 on Ubuntu 12.04 LTS through Tomcat
Solr implements Low Level query parsing (QParser)
Build a search Server Based on Solr 3.5
Solr 3.5 development and application tutorial PDF
Solr 4.0 deployment instance tutorial
Solr details: click here
Solr: click here
This article permanently updates the link address: