Lucene-based case development: lucene's initial cognition and lucene case

Source: Internet
Author: User
Tags string to file

Lucene-based case development: lucene's initial cognition and lucene case

Reprinted please indicate the source: http://blog.csdn.net/xiaojimanman/article/details/42804713


Data Category:

Data in daily life can be roughly divided into the following three categories:Structured Data,Unstructured data,Semi-structured data:

Structured Data:Refers to data with a fixed format or a limited length, such as database row data: stored in a database, you can use the two-dimensional table structure to logically express the implemented data

Unstructured data: Data with an indefinite length or no fixed format, such as emails, Word documents, audio, and ultra-Audio

Semi-structured data: This type of data can be processed by structured data, or extracted plain text and processed by unstructured data, such as xml data.

Of course, different retrieval methods are required for different data. You may be familiar with structured data and can use familiar SQL statements for retrieval, for example, "select * from student where stuno like '000000'", a simple SQL statement can be used to find information about all students whose student IDs start with 2014%; what should we do with unstructured data? Use like? In SQL? The answer is obviously no. The common retrieval methods for unstructured data are:Sequential Scan,Index)You can also know that the efficiency of sequential scanning is still quite poor without testing. Next we will focus on the indexing and retrieval of unstructured data.


Indexing procedure:

For unstructured data, index retrieval is also called Full-Text-Search. During the indexing process, it is roughly divided into two major steps:

Index creation (Indexing): the process of extracting information from structured or unstructured data to create an index, as shown in the left half:

Search searches for created indexes based on your query conditions and returns the query results, as shown in the right-hand section:



The left part is the index creation process. The file system data, database data, and web data can be created through the index to form the final index file. The right part is the user retrieval process, obtain the user's query conditions and retrieve the index database to finally return the search results.

Taking a closer look, we can easily think of the following three questions:

What is an index?

How to create an index?

How to perform index search?

We will answer these three questions one by one.


What is an index?

If you have never touched the index before, this part is still difficult to understand. Therefore, the following is a simple example to illustrate what indexes are:

 

I don't think you are familiar with the three pictures. Now let's recall that we are using the Xinhua Dictionary to find a man's explanation process: find the number of pages of the man you want to find through the syllable index or the department head index --> flip to the corresponding page number to view the explanation of the Chinese character. If we don't have these syllable indexes or Radical Indexes, will this search take a long time? I don't know if you have a rough understanding of the index.

In fact, the middle part is equivalent to the index we mentioned here. The ing from string to file is the reverse process of file to string ing. We call this informationReverse Index.


The information stored in the left halfDictionary, Each string on the left points to the document link on the right. The linked list of this document is calledInverted table. Think about how the examples of Xinhua Dictionary correspond to reverse indexes.


How to create an index

I have summarized the three steps for creating an index:Data (Document),Analyzer),Indexer)For a simple reference:


We will introduce this process through simple examples.


Step 1 data: Document case data:

Hello! I'm Xiao Li.

Where is China?

Who are you?

Basic Learning Courses of lucene.

Are you going to school in CNPC?

Where is CNPC?


Step 2: Word Segmentation technology. StandarAnalyzer (Standard word segmentation) is used here)

You | good | Me | Yes | small | Li |

Medium | country | where | medium |

You | Yes | who |

Lucene | base | knowledge | learning | course |

You | in | medium | rock | oil | Upper | learning |

Medium | rock | oil | where | medium |


Step 3: Create an index dictionary:


Step 3: create indexes and merge words into inverted tables:


Now, the index file has been created.


How to search

I have summarized the index retrieval process in four steps:Obtain the key words),Analyzer),Search index (Aearch),Return result listFor more information, see:


We will continue to talk about the search process based on the above examples.

Step 1: KeyWord case data

CNPC

Step 2: Word Segmentation technology (because the standard word segmentation is used when an index is created, this word segmentation technology should also be used in the search process)

Medium | rock | oil |

Step 3: search index search records:


Step 4: return result list

Where is CNPC?

Are you going to school in CNPC?


The above three questions have been answered through specific examples. Since it takes only two years for me to contact lucene, I am not very clear about many principles, therefore, this series of blogs won't involve too many specific principles. If you want to have an in-depth understanding, you are advised to buy a reference book to understand its system.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.