Lucene-based case development: Lucene initial cognition

Source: Internet
Author: User

Reprint Please specify source: http://blog.csdn.net/xiaojimanman/article/details/42804713


Data classification:

For the data in daily life, we can be broadly divided into the following three categories: structured data , unstructured data , semi-structured data:

Structured Data : refers to data with fixed or finite length, such as Database row data: In a database, a two-dimensional table structure can be used to logically express the implemented data

unstructured data : A variable-length or non-fixed-format data, such as mail, Word documents, audio, super-audio, etc.

semi-structured data : For this kind of data can be processed according to structured data, you can also extract plain text to be processed by unstructured data, such as XML data

For different data, of course, you need to take a different search, structured data may be familiar with, you can use familiar SQL statements to retrieve, such as "select * from student where Stuno like ' 2014% '", Such a simple SQL statement can find all student numbers starting at 2014, and what should be done with unstructured data? Using the like in SQL? The answer is obviously no, for unstructured data commonly used in the retrieval method of sequential scanning , index (index), without testing can also know that sequential scanning efficiency is still quite poor, The following is an introduction to index retrieval of unstructured data.


Index steps:

For unstructured data indexed retrieval is also called full-text Search (Full-text-search), in the process of indexing, I roughly divided it into two major steps:

Index creation (indexing): The process of creating an index of structured or unstructured data extraction information, as shown in the left half:

Search index retrieves the created index based on the user's query criteria, returning the query result, as shown in the right half of the reference:



The left part is the index creation process, can be the file system data, database data, Web data, etc., through the creation of the index, the final file is formed, the right half is the user retrieval process, get the user query criteria, retrieve the index library finally return the results of the search.

Looking closely, we can easily think of the following 3 questions:

What is an index?

How do I create an index?

How do I retrieve an index?

Here are answers to these three questions.


What is an index

If you have not touched the index before, this part is still difficult to understand the many professional nouns, so here is a simple example to illustrate, what is the index:

For the three small pictures I think we are not too unfamiliar, and now we have to recall, we are using the Xinhua dictionary to find a man explained the process: through the syllable index or the radical index to find the man you want to find the number of pages--turn to the corresponding page numbers, see the interpretation of the Chinese character. Let's think about this, if we don't have these syllable indexes or the radical index, will it take a long time for us to retrieve this process? I don't know if you have a general understanding of the index.

In fact the middle part, which is equivalent to the index we mentioned here, the string-to-file mapping is the file-to-string mapping of the reverse process, we call this information reverse index .


In the left half of the information that we generally call a dictionary , each string on the left points to the link to the right of the document, which is called the inverted table . The example of how the Xinhua dictionary corresponds to a reverse index is a reflection of this.


How to create an index

For the creation of an index, I have summed up a three-step process: The data to be retrieved (Document), word breaker (Analyzer), Index Build (Indexer), can be easily referenced:


Let's take a simple example to introduce the process.


First Step data: Document Case data:

How are you doing! I'm Xiao Li.

Where is China?

Who are you?

Lucene Basic Knowledge Learning course.

Are you studying at PetroChina?

Where is PetroChina?


The second step: Word segmentation technology, where the use of standaranalyzer (standard participle)

You | good | me | yes | little | lee |

China | in | where | |

You | are | who |

lucene| | knowledge | learning | education | learning | course |

You | in | | | stone | oil | on | learning |

Medium | stone | oil | in | where | |


Step Three: Index creation dictionary:


The third step: indexing the merged words into inverted list:


By now, the index file has been created.


How to search

On the index retrieval process, I have summed up four steps: get search terms (KeyWords), Word segmentation technology (Analyzer), Index (aearch), return to the list of results , You can simply refer to:


And we continue to take the above-mentioned cases to talk about the search process

First step: keyword case data

Petrochina

The second step: Word segmentation technology (because when the index is created, the use of standard participle, index in the search process, it should also adopt the word segmentation technology)

Medium | stone | oil |

Step Three: Retrieve index Search records:


Fourth step: Return to the results list

Where is PetroChina?

Are you studying at PetroChina?


The above three questions, have been through specific examples of the answer, because of their exposure to lucene time is just a short two years, a lot of principles themselves are not too clear, so their own this series of blogs will not involve too many specific principles, if you want to do in-depth understanding, It is advisable to buy a reference book to make a systematic understanding of it.

Lucene-based case development: Lucene initial cognition

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.