Full-text index ranking Calculation

Source: Internet
Author: User
Tags split words
Ranking Calculation Problems

The process of ranking calculation depends on a series of factors. The broken characters in different languages are used to split words in different languages. For example, the string "dog-house" can be broken into "dog" and "House", and the other broken character is broken into "dog-house ". This means that the matching and ranking will vary according to the specified language, because not only the words are different, but also the document length is different. The difference in document length may affect the ranking of all queries.

SuchIndexrowcount
Such statistics may vary greatly. For example, if the primary index of a directory contains 2 billion rows, the index of a new file will be compiled as an intermediate index in the memory, the document ranking based on the number of documents in the index in the memory may be different from that in the main index. Therefore, we recommend that you use
Alter Fulltext catalog... reorganize Transact-SQL
Statement to merge these indexes into a primary index. The full text engine automatically merges indexes based on parameters (such as the number and size of intermediate indexes.

MaxoccurrenceThe value is normalized to one of the 32 ranges. This means, for example, 50 documents with a word length and 100
Documents with the same term length are processed in the same way. The following table is used for standardization. Since the two documents are within the range of the adjacent table values between 32 and 128, they are considered to have the same valid length of 128
(32 <Doclength<= 128 ).

Copy code
{ 16, 32, 128, 256, 512, 725, 1024, 1450, 2048, 2896, 4096, 5792, 8192, 11585, 16384, 23170, 28000, 32768, 39554, 46340, 55938, 65536, 92681, 131072, 185363, 262144, 370727, 524288, 741455, 1048576, 2097152, 4194304 };

Containstable ranking The following algorithms are used for ranking:

Copy code
StatisticalWeight = Log2( ( 2 + IndexedRowCount ) / KeyRowCount )Rank = min( MaxQueryRank, HitCount * 16 * StatisticalWeight / MaxOccurrence )

The ranking method of phrase matching items is similar to that of each key, except for estimation.
Keyrowcount(Number of rows containing the phrase), and the value may be larger than the actual value.


Isabout ranking

Containstable supports searching weighted words using the isabout option. According to the traditional information retrieval system, isabout
Vector Space query. The default ranking algorithm used is the well-known formula jaccard. The ranking is calculated based on each word in the query, and then combined according to the following descriptions.

Copy code
ContainsRank = same formula used for CONTAINSTABLE ranking of a single term (above).Weight = the weight specified in the query for each term. Default weight is 1.WeightedSum = Σ[key=1 to n] ContainsRankKey * WeightKeyRank =  ( MaxQueryRank * WeightedSum ) / ( ( Σ[key=1 to n] ContainsRankKey^2 )       + ( Σ[key=1 to n] WeightKey^2 ) - ( WeightedSum ) )

Freetexttable ranking Ranking based
Calculate the ranking formula of Okapi bm25. Freetexttable
The query adds words to the query by using the acronyms (deformation of the original query term). These words are processed as separate words that have no special relationship with the words derived from them. Synonyms derived from the synonym library function are treated as separate words with the same weighting value. Each word in the query will affect the ranking.

Copy code
Rank = Σ[Terms in Query] w ( ( ( k1 + 1 ) tf ) / ( K + tf ) ) * ( ( k3 + 1 ) qtf / ( k3 + qtf ) ) )Where: w is the Robertson-Sparck Jones weight. In simplified form, w is defined as: w = log10 ( ( ( r + 0.5 ) * ( N – R + r + 0.5 ) ) / ( ( R – r + 0.5 ) * ( n – r + 0.5 ) )N is the number of indexed rows for the property being queried. n is the number of rows containing the word. K is ( k1 * ( ( 1 – b ) + ( b * dl / avdl ) ) ). dl is the property length, in word occurrences. avdl is the average length of the property being queried, in word occurrences. k1, b, and k3 are the constants 1.2, 0.75, and 8.0, respectively. tf is the frequency of the word in the queried property in a specific row. qtf is the frequency of the term in the query. 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.