Lucene-based case development: an indexed mathematical model

Last Update:2015-01-18 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Reprint Please specify source: http://blog.csdn.net/xiaojimanman/article/details/42818185

Through the previous blog, so maybe there is a general understanding of how the search is a process, this blog will simply introduce the mathematical model of lucene implementation.

We have mentioned earlier that Lucene implements an index that is a reverse index, with dictionaries and inverted tables (the actual structure is much more complex than this), what is the mathematical model of the index? Before starting this, you should familiarize yourself with the next few nouns.

Document: Several examples of the index creation process in the previous blog , each sentence can be regarded as a document, of course, the document has only one field (field), through the standard word segmentation technology, We divide the value of this field into many lexical elements (term), documents, fields, and lexical elements, three of which are the three nouns we need to understand. (Think about it here, what are the documents, fields, and words in our case novel?). ）

Calculate weight (term Weight) process

From the above introduction, a document can be divided into multiple words (different word segmentation technology, divided into different lexical elements), the different lexical elements for the document is different in importance. There are two main factors that affect the importance of a single word element in a document:

1, Term Frequency (TF): That is, the number of times in this document, the greater the TF, indicating that the more important the word element;

2, document Frequency (DF): That is, how many documents contain this TERM,DF, the greater the meaning of the word element is less important.

These two factors have a good understanding of the weight effect, which is like our own skills, the mastery of deep skills to their own work is more important than the general skills, of course, if a skill is only you or a few people know, then you are very competitive in this area (as if it is far away). Let's take a look at the formula:

Of course, this is only a simple and typical formula, Lucene in the actual implementation of the process is rarely different.

Space vector Model (VSM)

We can think of a document as a collection of a series of lexical elements, each with a weight, as follows:

Document = {Term1, Term2, term3 ... termn}

Documentvector = {weight1, weight2, weight3 ... weightn}

We put the document into an n-dimensional space vector (all the documents are divided into n-dimensional vectors, and the N-dimension vector, where document D is mapped on M-coordinates as the weight of M-words in document D),

This way we calculate the relative degree of two documents, we can look at the angle between their size (the smaller the angle, the higher the correlation, or the less relevant). Of course, the n-dimensional vector can also be used for user retrieval.

Retrieving indexes

In this mathematical model, the retrieval of the document information is converted to the angle between the two vectors, of course, we also need to the user's search keywords to create a space vector, such as:

From this, we can think that the smaller the angle between the two vectors, the greater the correlation, the smaller the correlation, so we calculate the cosine of the angle as the correlation of the two. So we can get the correlation between the query string and the record in the index by calculating the angle cosine of the query vector with each vector, and then sort it to get the final result.

The next introduction to the Lucene file structure will no longer introduce the basic principles of Lucene, the following is a brief introduction of the case may be used in the API. So if you want to know more about the principles of Lucene, please refer to other learning materials or related information books on the web.

PS: The time is not early, continue to write the file structure part tomorrow.

Lucene-based case development: an indexed mathematical model

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Lucene-based case development: an indexed mathematical model

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Lucene-based case development: an indexed mathematical model

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support