After learning about Lucene, let's dive into each module of Lucene. Here we mainly go deep into Lucene's index, that is, how to build the index process and concept.
Lucene and relational databases
From two perspectives, one is from the index aspect, and the other is fuzzy query. In fact, full-text search is classified as a Type of comparison.
1. Index comparison
| Item |
Full-text retrieval Library (Lucene) |
Relational Database Service |
| Core functions |
Mainly text retrieval, which is difficult to insert, delete, and modify. It is suitable for querying large text blocks. |
It is very convenient to insert, delete, and modify data. It has special SQL commands, but it is inefficient for searching large text blocks. |
| Library |
Similar to databases, multiple databases can be created, and the storage location of each database can be different. |
You can create multiple databases. Generally, each database has control files and data files, which are complex. |
| Table |
There is no strict table concept. Lucene tables are made up of loose defined fields when they are stored. |
There are strict table structures, primary keys, and field types. |
| Record |
Because there is no strict table concept, the record is embodied as an object, and the corresponding class of the record is document. |
Record, which corresponds to the table structure. |
| Field |
The field type can only be text or date. The field generally does not support computation and has no function. The field class corresponds to the field class. |
Rich Field Types and powerful functions. |
| Query Result set |
In Lucene, the query result set class is hits, such as hits (doc1, doc2, doc3 ......) |
Use resultset in JDBC |
2. Comparison of fuzzy search
| Item |
Lucene full-text search |
Fuzzy Database Query |
| Index |
Create an inverted index for data in the data source, which is faster |
The database index cannot be used and all records need to be traversed for fuzzy match. Therefore, the query speed is reduced by multiple orders of magnitude. |
| Matching Effect |
Key features are split through the word meta matching and language analysis interfaces to support Chinese characters. |
Because fuzzy query is performed and the matching is inaccurate, irrelevant information or missing information may be found. |
| Matching degree |
There is a matching degree algorithm, and the results with a higher degree of matching are ranked first |
No matching algorithm. The number of times a keyword appears in the record is the same. |
| Result output |
A special algorithm is used to output the first 100 results with the highest matching degree. The result set is read in a buffer-type small batch, with low system overhead. |
Returns all result sets. When there are many matching entries, a large amount of memory is required to store these temporary result sets, causing high system overhead. |
| Customization |
You can use the API to customize the sorting rules that meet the search and sorting requirements. |
Not customizable |
| Applicability |
For Fuzzy search applications with high load, the indexing data volume is large, the speed requirement is fast, and the matching requirement is high. |
Low usage, simple fuzzy matching rules, or a small amount of data to be queried |
Index creation process
The process of creating an index can be divided into the process of converting the original document into text, analyzing text, and saving the analyzed text to the index.