Lucene Note Series (1) Full-text retrieval of the theoretical basis of--lucene

Last Update:2015-08-18 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This series begins to learn Lucene.

Among the data we deal with, there are three types of data:

Structured data: Data with fixed or limited length, such as data in our database

Unstructured data: No fixed-format, no fixed-length data, such as text content on our web

Semi-structured data: such as JSON, XML data.

So how do we deal with these different types of data?

For structured data in a database, use SQL statements to query

For unstructured data, we scan sequentially, full-text search.

In this way, sequential scanning is a scan of data from the beginning of the data to the last piece of data. Obviously, this is a waste of time and performance.

So what is full-text search?

That's what Lucene is going to do. Let's take a look at a diagram to describe its role in the whole system:

For the application of the upper part of the Lucen, we can see that the mobile phone has structured, semi-structured, unstructured data, which is indexed by Lucene, and the other is retrieval, where the user retrieves our index library by entering the keyword of the search criteria and returns the result to the user.

So what is an index?

Just like the pinyin search in the Xinhua dictionary and the radical index used to look up words.

Also in Lucene, full-text search refers to the documents in which a word appears. For example:

In, the keyword "Lucene" appears in the 1th and 3rd documents. The key word "SOLR" appears in the 1th, 3, 5 documents. The keyword "Hadoop" appears in the 3rd, 5, 7, 8, 9 documents.

Here we call the whole process " reverse index ". The list of documents linked to each keyword on the right is what we call the inverted list .

What is a reverse index?

Reverse indexing: This type of string-to-file mapping is a reverse process of file-to-string mapping. In fact, a mapping relationship is described.

Create an index

All right. So what is the procedure for creating a full-text search?

Here we will create a full-text search in three steps or three things to say:

Data that needs to be retrieved (Document)

Word segmentation technology (Analyzer)

Index Creation (Indexer)

Let's give an example.

The first step, the document data instance

My blog Space

Happybks's Lucene article

Happbks's Hadoop article

The second step, word segmentation technology. (We use standard participle here.) ）

Happybks| 's |lucene| |

Happbks| 's |hadoop| |

Note that after the standard participle, the Chinese is sliced by word, and the English uppercase characters are converted to lowercase.

The third step, index creation.

Term	Id	Term	Id	Term	Id
I	1	Happybks	2	Happbks	3
Of	1	Of	2	Of	3
Bo	1	Lucene	2	Hadoop	3
Guest	1	Text	2	Text	3
Empty	1	Chapter	2	Chapter	3
Room	1

We are merging the indexes.

Term	Id	Term	Id	Term	Id
I	1	Happybks	2,3
Of	The
Bo	1	Lucene	2	Hadoop	3
Guest	1	Text	2,3
Empty	1	Chapter	2,3
Room	1

This table is what we call an index.

Now, let's look at how the index is used to retrieve it.

Index retrieval

It is divided into four steps:

Search keywords (keywords)

Word segmentation technology (Analyzer)

Search index (search)

return results

Let's put it in an example to sort through the steps.

The first step, get the user search keywords

Lucene Articles

The second step, the use of Word segmentation technology

Lucene|-Wen | zhang

The third step is to retrieve the index.

As we can see from the above diagram, the document that contains all the word breaker elements in the inverted list is document 2.

Fourth step, return the result-the 2nd document.

This paper mainly expounds the general principle and process of full-text retrieval. As for what mathematical model Lucene uses, how to implement full-text indexing, I'll describe it in the articles later in this series.

Lucene Note Series (1) Full-text retrieval of the theoretical basis of--lucene

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Lucene Note Series (1) Full-text retrieval of the theoretical basis of--lucene

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Lucene Note Series (1) Full-text retrieval of the theoretical basis of--lucene

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support