This series begins to learn Lucene.
Among the data we deal with, there are three types of data:
Structured data: Data with fixed or limited length, such as data in our database
Unstructured data: No fixed-format, no fixed-length data, such as text content on our web
Semi-structured data: such as JSON, XML data.
So how do we deal with these different types of data?
For structured data in a database, use SQL statements to query
For unstructured data, we scan sequentially, full-text search.
In this way, sequential scanning is a scan of data from the beginning of the data to the last piece of data. Obviously, this is a waste of time and performance.
So what is full-text search?
That's what Lucene is going to do. Let's take a look at a diagram to describe its role in the whole system:
For the application of the upper part of the Lucen, we can see that the mobile phone has structured, semi-structured, unstructured data, which is indexed by Lucene, and the other is retrieval, where the user retrieves our index library by entering the keyword of the search criteria and returns the result to the user.
So what is an index?
Just like the pinyin search in the Xinhua dictionary and the radical index used to look up words.
Also in Lucene, full-text search refers to the documents in which a word appears. For example:
In, the keyword "Lucene" appears in the 1th and 3rd documents. The key word "SOLR" appears in the 1th, 3, 5 documents. The keyword "Hadoop" appears in the 3rd, 5, 7, 8, 9 documents.
Here we call the whole process " reverse index ". The list of documents linked to each keyword on the right is what we call the inverted list .
What is a reverse index?
Reverse indexing: This type of string-to-file mapping is a reverse process of file-to-string mapping. In fact, a mapping relationship is described.
Create an index
All right. So what is the procedure for creating a full-text search?
Here we will create a full-text search in three steps or three things to say:
Data that needs to be retrieved (Document)
Word segmentation technology (Analyzer)
Index Creation (Indexer)
Let's give an example.
The first step, the document data instance
My blog Space
Happybks's Lucene article
Happbks's Hadoop article
The second step, word segmentation technology. (We use standard participle here.) )
I | | | | guest | space |
Happybks| 's |lucene| |
Happbks| 's |hadoop| |
Note that after the standard participle, the Chinese is sliced by word, and the English uppercase characters are converted to lowercase.
The third step, index creation.
Term
|
Id |
Term |
Id |
Term |
Id |
I |
1 |
Happybks |
2 |
Happbks |
3 |
Of |
1 |
Of |
2 |
Of |
3 |
Bo |
1 |
Lucene |
2 |
Hadoop |
3 |
Guest |
1 |
Text |
2 |
Text |
3 |
Empty |
1 |
Chapter |
2 |
Chapter |
3 |
Room |
1 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
We are merging the indexes.
Term |
Id |
Term |
Id |
Term |
Id |
I |
1 |
Happybks |
2,3 |
|
|
Of |
The |
|
|
|
|
Bo |
1 |
Lucene |
2 |
Hadoop |
3 |
Guest |
1 |
Text |
2,3 |
|
|
Empty |
1 |
Chapter |
2,3 |
|
|
Room |
1 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
This table is what we call an index.
Now, let's look at how the index is used to retrieve it.
Index retrieval
It is divided into four steps:
Search keywords (keywords)
Word segmentation technology (Analyzer)
Search index (search)
return results
Let's put it in an example to sort through the steps.
The first step, get the user search keywords
Lucene Articles
The second step, the use of Word segmentation technology
Lucene|-Wen | zhang
The third step is to retrieve the index.
As we can see from the above diagram, the document that contains all the word breaker elements in the inverted list is document 2.
Fourth step, return the result-the 2nd document.
This paper mainly expounds the general principle and process of full-text retrieval. As for what mathematical model Lucene uses, how to implement full-text indexing, I'll describe it in the articles later in this series.
Lucene Note Series (1) Full-text retrieval of the theoretical basis of--lucene