Lucene-based case development: Lucene initial cognition

Last Update:2015-01-17 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Reprint Please specify source: http://blog.csdn.net/xiaojimanman/article/details/42804713

Data classification:

For the data in daily life, we can be broadly divided into the following three categories: structured data , unstructured data , semi-structured data:

Structured Data : refers to data with fixed or finite length, such as Database row data: In a database, a two-dimensional table structure can be used to logically express the implemented data

unstructured data : A variable-length or non-fixed-format data, such as mail, Word documents, audio, super-audio, etc.

semi-structured data : For this kind of data can be processed according to structured data, you can also extract plain text to be processed by unstructured data, such as XML data

For different data, of course, you need to take a different search, structured data may be familiar with, you can use familiar SQL statements to retrieve, such as "select * from student where Stuno like ' 2014% '", Such a simple SQL statement can find all student numbers starting at 2014, and what should be done with unstructured data? Using the like in SQL? The answer is obviously no, for unstructured data commonly used in the retrieval method of sequential scanning , index (index), without testing can also know that sequential scanning efficiency is still quite poor, The following is an introduction to index retrieval of unstructured data.

Index steps:

For unstructured data indexed retrieval is also called full-text Search (Full-text-search), in the process of indexing, I roughly divided it into two major steps:

Index creation (indexing): The process of creating an index of structured or unstructured data extraction information, as shown in the left half:

Search index retrieves the created index based on the user's query criteria, returning the query result, as shown in the right half of the reference:

The left part is the index creation process, can be the file system data, database data, Web data, etc., through the creation of the index, the final file is formed, the right half is the user retrieval process, get the user query criteria, retrieve the index library finally return the results of the search.

Looking closely, we can easily think of the following 3 questions:

What is an index?

How do I create an index?

How do I retrieve an index?

Here are answers to these three questions.

What is an index

If you have not touched the index before, this part is still difficult to understand the many professional nouns, so here is a simple example to illustrate, what is the index:

For the three small pictures I think we are not too unfamiliar, and now we have to recall, we are using the Xinhua dictionary to find a man explained the process: through the syllable index or the radical index to find the man you want to find the number of pages--turn to the corresponding page numbers, see the interpretation of the Chinese character. Let's think about this, if we don't have these syllable indexes or the radical index, will it take a long time for us to retrieve this process? I don't know if you have a general understanding of the index.

In fact the middle part, which is equivalent to the index we mentioned here, the string-to-file mapping is the file-to-string mapping of the reverse process, we call this information reverse index .

In the left half of the information that we generally call a dictionary , each string on the left points to the link to the right of the document, which is called the inverted table . The example of how the Xinhua dictionary corresponds to a reverse index is a reflection of this.

How to create an index

For the creation of an index, I have summed up a three-step process: The data to be retrieved (Document), word breaker (Analyzer), Index Build (Indexer), can be easily referenced:

Let's take a simple example to introduce the process.

First Step data: Document Case data:

How are you doing! I'm Xiao Li.

Where is China?

Who are you?

Lucene Basic Knowledge Learning course.

Are you studying at PetroChina?

Where is PetroChina?

The second step: Word segmentation technology, where the use of standaranalyzer (standard participle)

You | good | me | yes | little | lee |

China | in | where | |

You | are | who |

You | in | | | stone | oil | on | learning |

Step Three: Index creation dictionary:

The third step: indexing the merged words into inverted list:

By now, the index file has been created.

How to search

On the index retrieval process, I have summed up four steps: get search terms (KeyWords), Word segmentation technology (Analyzer), Index (aearch), return to the list of results , You can simply refer to:

And we continue to take the above-mentioned cases to talk about the search process

First step: keyword case data

Petrochina

The second step: Word segmentation technology (because when the index is created, the use of standard participle, index in the search process, it should also adopt the word segmentation technology)

Medium | stone | oil |

Step Three: Retrieve index Search records:

Fourth step: Return to the results list

Where is PetroChina?

Are you studying at PetroChina?

The above three questions, have been through specific examples of the answer, because of their exposure to lucene time is just a short two years, a lot of principles themselves are not too clear, so their own this series of blogs will not involve too many specific principles, if you want to do in-depth understanding, It is advisable to buy a reference book to make a systematic understanding of it.

Lucene-based case development: Lucene initial cognition

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Lucene-based case development: Lucene initial cognition

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Lucene-based case development: Lucene initial cognition

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support