Lucene-based case development: an indexed mathematical model

Source: Internet
Author: User

Reprint Please specify source: http://blog.csdn.net/xiaojimanman/article/details/42818185

Through the previous blog, so maybe there is a general understanding of how the search is a process, this blog will simply introduce the mathematical model of lucene implementation.

We have mentioned earlier that Lucene implements an index that is a reverse index, with dictionaries and inverted tables (the actual structure is much more complex than this), what is the mathematical model of the index? Before starting this, you should familiarize yourself with the next few nouns.

Document: Several examples of the index creation process in the previous blog , each sentence can be regarded as a document, of course, the document has only one field (field), through the standard word segmentation technology, We divide the value of this field into many lexical elements (term), documents, fields, and lexical elements, three of which are the three nouns we need to understand. (Think about it here, what are the documents, fields, and words in our case novel?). )


Calculate weight (term Weight) process

From the above introduction, a document can be divided into multiple words (different word segmentation technology, divided into different lexical elements), the different lexical elements for the document is different in importance. There are two main factors that affect the importance of a single word element in a document:

1, Term Frequency (TF): That is, the number of times in this document, the greater the TF, indicating that the more important the word element;

2, document Frequency (DF): That is, how many documents contain this TERM,DF, the greater the meaning of the word element is less important.

These two factors have a good understanding of the weight effect, which is like our own skills, the mastery of deep skills to their own work is more important than the general skills, of course, if a skill is only you or a few people know, then you are very competitive in this area (as if it is far away). Let's take a look at the formula:


Of course, this is only a simple and typical formula, Lucene in the actual implementation of the process is rarely different.


Space vector Model (VSM)

We can think of a document as a collection of a series of lexical elements, each with a weight, as follows:

Document = {Term1, Term2, term3 ... termn}

Documentvector = {weight1, weight2, weight3 ... weightn}

We put the document into an n-dimensional space vector (all the documents are divided into n-dimensional vectors, and the N-dimension vector, where document D is mapped on M-coordinates as the weight of M-words in document D),


This way we calculate the relative degree of two documents, we can look at the angle between their size (the smaller the angle, the higher the correlation, or the less relevant). Of course, the n-dimensional vector can also be used for user retrieval.


Retrieving indexes

In this mathematical model, the retrieval of the document information is converted to the angle between the two vectors, of course, we also need to the user's search keywords to create a space vector, such as:


From this, we can think that the smaller the angle between the two vectors, the greater the correlation, the smaller the correlation, so we calculate the cosine of the angle as the correlation of the two. So we can get the correlation between the query string and the record in the index by calculating the angle cosine of the query vector with each vector, and then sort it to get the final result.


The next introduction to the Lucene file structure will no longer introduce the basic principles of Lucene, the following is a brief introduction of the case may be used in the API. So if you want to know more about the principles of Lucene, please refer to other learning materials or related information books on the web.


PS: The time is not early, continue to write the file structure part tomorrow.

Lucene-based case development: an indexed mathematical model

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.