Lucene Note Series (1) Full-text retrieval of the theoretical basis of--lucene

Source: Internet
Author: User

This series begins to learn Lucene.

Among the data we deal with, there are three types of data:

Structured data: Data with fixed or limited length, such as data in our database

Unstructured data: No fixed-format, no fixed-length data, such as text content on our web

Semi-structured data: such as JSON, XML data.

So how do we deal with these different types of data?

For structured data in a database, use SQL statements to query

For unstructured data, we scan sequentially, full-text search.

In this way, sequential scanning is a scan of data from the beginning of the data to the last piece of data. Obviously, this is a waste of time and performance.

So what is full-text search?

That's what Lucene is going to do. Let's take a look at a diagram to describe its role in the whole system:


For the application of the upper part of the Lucen, we can see that the mobile phone has structured, semi-structured, unstructured data, which is indexed by Lucene, and the other is retrieval, where the user retrieves our index library by entering the keyword of the search criteria and returns the result to the user.

So what is an index?

Just like the pinyin search in the Xinhua dictionary and the radical index used to look up words.

Also in Lucene, full-text search refers to the documents in which a word appears. For example:

In, the keyword "Lucene" appears in the 1th and 3rd documents. The key word "SOLR" appears in the 1th, 3, 5 documents. The keyword "Hadoop" appears in the 3rd, 5, 7, 8, 9 documents.

Here we call the whole process " reverse index ". The list of documents linked to each keyword on the right is what we call the inverted list .

What is a reverse index?

Reverse indexing: This type of string-to-file mapping is a reverse process of file-to-string mapping. In fact, a mapping relationship is described.


Create an index

All right. So what is the procedure for creating a full-text search?

Here we will create a full-text search in three steps or three things to say:

Data that needs to be retrieved (Document)

Word segmentation technology (Analyzer)

Index Creation (Indexer)


Let's give an example.

The first step, the document data instance

My blog Space

Happybks's Lucene article

Happbks's Hadoop article

The second step, word segmentation technology. (We use standard participle here.) )

I | | | | guest | space |

Happybks| 's |lucene| |

Happbks| 's |hadoop| |

Note that after the standard participle, the Chinese is sliced by word, and the English uppercase characters are converted to lowercase.

The third step, index creation.

Term
Id Term Id Term Id
I 1 Happybks 2 Happbks 3
Of 1 Of 2 Of 3
Bo 1 Lucene 2 Hadoop 3
Guest 1 Text 2 Text 3
Empty 1 Chapter 2 Chapter 3
Room 1





















We are merging the indexes.

Term Id Term Id Term Id
I 1 Happybks 2,3

Of The



Bo 1 Lucene 2 Hadoop 3
Guest 1 Text 2,3

Empty 1 Chapter 2,3

Room 1





















This table is what we call an index.

Now, let's look at how the index is used to retrieve it.


Index retrieval

It is divided into four steps:

Search keywords (keywords)

Word segmentation technology (Analyzer)

Search index (search)

return results


Let's put it in an example to sort through the steps.

The first step, get the user search keywords

Lucene Articles

The second step, the use of Word segmentation technology

Lucene|-Wen | zhang


The third step is to retrieve the index.

As we can see from the above diagram, the document that contains all the word breaker elements in the inverted list is document 2.


Fourth step, return the result-the 2nd document.

This paper mainly expounds the general principle and process of full-text retrieval. As for what mathematical model Lucene uses, how to implement full-text indexing, I'll describe it in the articles later in this series.
















Lucene Note Series (1) Full-text retrieval of the theoretical basis of--lucene

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.