Summary of chapter 3 of Introduction to Information Retrieval

Source: Internet
Author: User
Document directory
  • Index Structure:
  • Index rules:
  • Query Method:
I. Hash table and Search Tree

The main ways to implement a dictionary are:Hash tableAndSearch Tree(Binary Tree, B tree, AVL Tree );

Advantages of implementing a dictionary for a hash table:

(1) query efficiency O (1 );

Disadvantages:

(1) hash conflict.

(2)Fuzzy search is not supported.

(3) the hash function needs to be constantly changed to meet the requirements.

Advantages of implementing a search tree dictionary:

(1) fuzzy search is supported.

Disadvantages:

(1) query efficiency is relatively slow.

(2) The tree should be balanced.


2. Single wildcard query 1. Tail wildcard Query

For example, ABC * indicates that the query where the wildcard appears at the end is the query at the end. This query can be completed using the search tree by traversing the tree in the order of A, B, and C. As shown in:

 

If you want to query AB *,

(1) Compare the root node: because a is in a-m, go to the left.

(2) because AB is between a-hu, go to the left.

(3) The remaining sub-nodes are the results that meet the requirements. Just traverse and obtain their posting.

2. header wildcard Query

If you want to query the header wildcard, You need to introduce the concept of reverse B-tree. Reverse B-tree refers to the reverse query order of B-tree. For example, to query * CBA, the query order is A, B, and C. For example:

To query * Ba, follow these steps:

(1) Compare the root node, because a is between a-m, it goes to the left.

(2) Because Ba is in the range of AA-uh, go left.

Therefore, the following sub-nodes are the results that meet the conditions.

3. General wildcard Query

For example, ABC * CBA only needs to be divided into ABC * and * CBA, and the knowledge of 1, 2 can be used separately.However, you must filter the query results in ABC * CBA.. For example, abcba meets ABC * and * CBA, but does not meet ABC * CBA;

 

Iii. index structure dedicated to wildcard query 1. permuterm Index

 

Method: $ indicates the end of a word (Regular Expression), that is, if AB is used, it is represented as AB $ and is arranged in a round to form AB $, $ AB, B $, and point to AB;

When processing a single wildcard query, if you want to query * B, add $ first, then rotate,Make * at the end of the word, That is, B $ *, and search in the search tree. If B $ A meets the requirements, AB meets the requirements.

When processing multiple wildcard queries, if you want to query a * B *, first add $, that is, a * B * $, and then rotate to $ a * B *, query $ A * first, and then filter the obtained result through a * B.

Disadvantage: The dictionary will become very large.

2. K-GRAM Index

Definition of K-GRAM: K consecutiveLetterK-shingle is described in Chapter 20th, meaning K consecutiveWord.

For example, hello's 3-gram is: El, Ell, llo;

 

Index Structure:

The Dictionary of K-gram index is a collection of K-grams of all words.

The posting of K-gram index is a sequence of words matching K-gram.

Index rules:

Before creating an index, you must add $ at the beginning and end of the word, and then perform K-gram indexing;

Query Method:

Example: Search for com *; use a 3-gram Index

(0) add $, that is, $ com * $; if it is 2-gram, it is $ C, CO, Om

(1) Search for matched words through 3-gram index;

(2) because the search result of 3-gram index is not accurate (for example, coordcom matches $ Co, COM, but not $ com *), therefore, you must filter $ com * to obtain the result.

 

Conclusion: The K-gram index is very slow because you need to obtain the word for the K-gram index (the original single-layer index is changed to a two-layer index) and then filter it again, in order to enter the normal inverted index to find the docid;

Round-Robin indexes do not require post-filtering, but consume a lot of space;


Iv. Pinyin Verification Method

Principle: first, find words with large degree of closeness. If the degree of closeness is the same, find common words;

1. Editing distance

The two words are considered as two-dimensional Matrices through the dynamic planning method for calculation.

 

 

For example, Paris and Alice

 

Note: There is a disadvantage in editing distance, that is, if you want to calculate and edit the distance between the query and each term, the efficiency is too low, because there are tens of millions of terms in the inverted index dictionary;

2. Calculate the jaccard coefficient using K-gram

Jaccard coefficient: Given the set a and B, j = (AB)/(a + B-AB );

Assume that A and B are two words with the length of M and N. Therefore, A and B have the numbers of m-1 and N-1 K-grams, respectively;

Given a query Q, After calculating the corresponding K-gram of Q, you can traverse the K-gram index and calculate the jaccard coefficient for each word and Q, AB can be understood as how many K-grams overlap, the A + B-AB can be understood as the total number of K-grams after the merge (through the K-gram length of A and B and the number of K-gram overlap minus AB ), and obtain words whose jaccard coefficient is higher than the threshold.

 

Conclusion: You can perform K-gram indexing before calculating the editing distance.

 

Of course, the above is only applicable to spelling errors of individual words. If the query is similar to I are happy. No error is detected, because the individual words are correct.

Check Method:When few results are returned after a query, you may have doubts about the query phrase.In the left-side navigation pane.

3. Voice verification

It is called the soundex algorithm by placing words with similar pronunciation in the same group (by the speech hash function.

 

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.