Full-text index ranking Calculation

Last Update:2018-12-06 Source: Internet

Author: User

Tags split words

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Ranking Calculation Problems

The process of ranking calculation depends on a series of factors. The broken characters in different languages are used to split words in different languages. For example, the string "dog-house" can be broken into "dog" and "House", and the other broken character is broken into "dog-house ". This means that the matching and ranking will vary according to the specified language, because not only the words are different, but also the document length is different. The difference in document length may affect the ranking of all queries.

SuchIndexrowcount
Such statistics may vary greatly. For example, if the primary index of a directory contains 2 billion rows, the index of a new file will be compiled as an intermediate index in the memory, the document ranking based on the number of documents in the index in the memory may be different from that in the main index. Therefore, we recommend that you use
Alter Fulltext catalog... reorganize Transact-SQL
Statement to merge these indexes into a primary index. The full text engine automatically merges indexes based on parameters (such as the number and size of intermediate indexes.

MaxoccurrenceThe value is normalized to one of the 32 ranges. This means, for example, 50 documents with a word length and 100
Documents with the same term length are processed in the same way. The following table is used for standardization. Since the two documents are within the range of the adjacent table values between 32 and 128, they are considered to have the same valid length of 128
(32 <Doclength<= 128 ).

	Copy code
{ 16, 32, 128, 256, 512, 725, 1024, 1450, 2048, 2896, 4096, 5792, 8192, 11585, 16384, 23170, 28000, 32768, 39554, 46340, 55938, 65536, 92681, 131072, 185363, 262144, 370727, 524288, 741455, 1048576, 2097152, 4194304 };

Copy code

{ 16, 32, 128, 256, 512, 725, 1024, 1450, 2048, 2896, 4096, 5792, 8192, 11585, 16384, 23170, 28000, 32768, 39554, 46340, 55938, 65536, 92681, 131072, 185363, 262144, 370727, 524288, 741455, 1048576, 2097152, 4194304 };

Containstable ranking The following algorithms are used for ranking:

	Copy code
StatisticalWeight = Log2( ( 2 + IndexedRowCount ) / KeyRowCount )Rank = min( MaxQueryRank, HitCount * 16 * StatisticalWeight / MaxOccurrence )

The ranking method of phrase matching items is similar to that of each key, except for estimation.
Keyrowcount(Number of rows containing the phrase), and the value may be larger than the actual value.

Isabout ranking

Containstable supports searching weighted words using the isabout option. According to the traditional information retrieval system, isabout
Vector Space query. The default ranking algorithm used is the well-known formula jaccard. The ranking is calculated based on each word in the query, and then combined according to the following descriptions.

	Copy code
ContainsRank = same formula used for CONTAINSTABLE ranking of a single term (above).Weight = the weight specified in the query for each term. Default weight is 1.WeightedSum = Σ[key=1 to n] ContainsRankKey * WeightKeyRank = ( MaxQueryRank * WeightedSum ) / ( ( Σ[key=1 to n] ContainsRankKey^2 ) + ( Σ[key=1 to n] WeightKey^2 ) - ( WeightedSum ) )

Copy code

ContainsRank = same formula used for CONTAINSTABLE ranking of a single term (above).Weight = the weight specified in the query for each term. Default weight is 1.WeightedSum = Σ[key=1 to n] ContainsRankKey * WeightKeyRank =  ( MaxQueryRank * WeightedSum ) / ( ( Σ[key=1 to n] ContainsRankKey^2 )       + ( Σ[key=1 to n] WeightKey^2 ) - ( WeightedSum ) )

Freetexttable ranking Ranking based
Calculate the ranking formula of Okapi bm25. Freetexttable
The query adds words to the query by using the acronyms (deformation of the original query term). These words are processed as separate words that have no special relationship with the words derived from them. Synonyms derived from the synonym library function are treated as separate words with the same weighting value. Each word in the query will affect the ranking.

	Copy code
Rank = Σ[Terms in Query] w ( ( ( k1 + 1 ) tf ) / ( K + tf ) ) * ( ( k3 + 1 ) qtf / ( k3 + qtf ) ) )Where: w is the Robertson-Sparck Jones weight. In simplified form, w is defined as: w = log10 ( ( ( r + 0.5 ) * ( N – R + r + 0.5 ) ) / ( ( R – r + 0.5 ) * ( n – r + 0.5 ) )N is the number of indexed rows for the property being queried. n is the number of rows containing the word. K is ( k1 * ( ( 1 – b ) + ( b * dl / avdl ) ) ). dl is the property length, in word occurrences. avdl is the average length of the property being queried, in word occurrences. k1, b, and k3 are the constants 1.2, 0.75, and 8.0, respectively. tf is the frequency of the word in the queried property in a specific row. qtf is the frequency of the term in the query.

Copy code

Rank = Σ[Terms in Query] w ( ( ( k1 + 1 ) tf ) / ( K + tf ) ) * ( ( k3 + 1 ) qtf / ( k3 + qtf ) ) )Where: w is the Robertson-Sparck Jones weight. In simplified form, w is defined as: w = log10 ( ( ( r + 0.5 ) * ( N – R + r + 0.5 ) ) / ( ( R – r + 0.5 ) * ( n – r + 0.5 ) )N is the number of indexed rows for the property being queried. n is the number of rows containing the word. K is ( k1 * ( ( 1 – b ) + ( b * dl / avdl ) ) ). dl is the property length, in word occurrences. avdl is the average length of the property being queried, in word occurrences. k1, b, and k3 are the constants 1.2, 0.75, and 8.0, respectively. tf is the frequency of the word in the queried property in a specific row. qtf is the frequency of the term in the query.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More