Lucene is an efficient Java-based full-text retrieval library.
So what is full-text search and why is full-text search required?
The current data in people's lives is generally divided into two categories: structured data and unstructured data. It is easy to understand that structured data is a fixed-format and structured or finite-length data, such as a database, meta-data, etc. Unstructured data is a variable length or no fixed format of data, tablets, mails, documents, etc. There is also a less classified as semi-structured data, such as xml,html, to a certain extent, we can use it in accordance with structured data processing, or can be extracted from the plain text in accordance with unstructured data processing.
Unstructured data is also known as full-text data. , there are two main ways to search for it:
- Sequential Scan Method (serialscanning): As the name implies, to find the content contains a string of documents, the document is found next to each other, against each document from beginning to end, scanning, guided scan all the documents. Similar to the ability to search for files in Windows.
- The second type is the index. is to extract the information from unstructured data and reorganize it so that it can be organized so as to improve the retrieval efficiency. For example, our phone book, find contacts from the phone book, we can index to a contact according to the first letter pinyin.
Indexing is a process called full-text search (Full-text search). As the general process of full-text retrieval, it is also the process of lucene retrieval.
My blog: code Daquan: www.codedq.net; amateur grass: www.xttblog.com; Love sharing: www.ndislwf.com or ifxvn.com.
Lucene is generally:
- An efficient, extensible, full-text retrieval library.
- All implemented in Java, without configuration.
- Only plain text files are supported for indexing (indexing) and search.
- It is not responsible for extracting plain text files from other formats or fetching files from the network.
In the Lucene in action, the structure and process of Lucene
Description Lucene There are two procedures for indexing and searching, including index creation, indexing, and searching for three points.
My blog: code Daquan: www.codedq.net; amateur grass: www.xttblog.com; Love sharing: www.ndislwf.com or ifxvn.com.
Let's look at the various components of lucene in more detail:
- Document object used for indexed documents representation.
- IndexWriter through function adddocument adds a document to the index, implementing the process of creating an index.
- Lucene the index is to apply the reverse index.
- When the user has a request, Query The query statement that represents the user.
- Indexsearcher Search by Function Search Lucene Index .
- Indexsearcher Calculate term weight and score and returns the result to the user.
- the collection of documents returned to the user is used Topdocscollector representation.
So how do you apply these components?
Let's go into the process of indexing and searching for calls to the Lucene API in more detail.
- The
- index process is as follows:
- Create a indexwriter to write an index file with several parameters, Index_dir strong> is where the index file is stored, and Analyzer is used for lexical analysis and language processing of the document.
- creates a document on behalf of the documents we want to index.
- adds a different field to the document. We know that a document has a variety of information, such as title, author, modification time, content, etc. Different types of information are represented by different field , in this example, a total of two types of information are indexed, one is the file path, and the other is the file content. where FileReader , Src_file , represents the source file to be indexed.
- indexwriter Call the function adddocument to write the index to the index folder.
- The search process is as follows:
- Indexreader reads the index information on the disk into memory, Index_dir is where the index file is stored.
- Create Indexsearcher ready to search.
- Create Analyer used for lexical analysis and language processing of query statements.
- Create Queryparser used to parse a query statement.
- Queryparser Call Parser parse the syntax, form a query syntax tree, put the.
- Indexsearcher Call Search querying the syntax tree query Search to get results topscoredoccollector .
These are the simple calls to Lucene API functions.
My blog: code Daquan: www.codedq.net; amateur grass: www.xttblog.com; Love sharing: www.ndislwf.com or ifxvn.com.
However, after entering the Lucene source code, it is found that Lucene has a lot of packages, the relationship is complex.
However, it is not difficult to find that Lucene's various source code modules are an implementation of the normal index and search process.
This figure is the package structure for the Lucene implementation of the full-text retrieval process described in the previous section. (Refer to Http://www.lucene.com.cn/about.htm in the article "open source full text search engine Lucene")
- Lucene The analysis module is mainly responsible for lexical analysis and language processing and the formation of term .
- Lucene the index module is responsible for the creation of the index, there are IndexWriter .
- Lucene the store modules are primarily responsible for reading and writing indexes.
- Lucene the Queryparser mainly responsible for grammatical analysis.
- Lucene the search The module is primarily responsible for searching the index.
- Lucene the similarity module is responsible for the implementation of relevance scoring.
Understanding the entire structure of lucene, we can begin the Lucene journey of source code.
Free Lucene principle and Code Analysis full version: Lucene principle and Code Analysis full version pdf download.
Free Lucene principle and Code Analysis full version download