Introduction to Information Retrieval Study Notes-Chapter 2: Word Dictionary and inverted record table

Source: Internet
Author: User

2.1.1 document analysis and encoding conversion:

The first step in document processing is to convert a series of binary byte sequences on files or Web servers into character sequences.

In practice, you must first determine the document encoding method (Machine Learning classification, heuristic, and other methods) and determine the document type (word? Zip ?) Then, convert the byte sequence into a character sequence.

2.1.2 document unit selection:

Generally, every file in a directory is considered as a document.

However, this is not the case in some cases. For example, we may want to separate emails and attachments (files in attachments.

For a long document, it is generally said that there is an "index granularity". For example, taking a book as the index unit may be too long. You can select a chapter or an article as the index granularity. The indexing granularity is too large, the indexing accuracy is too low, the indexing granularity is too small, the indexing accuracy is high, and the recall rate is low.

2.2 Determination of word item set

2.2.1 Glossary ):

Example:

Sometimes, we need to differentiate the entry and entry type:

  • Entry: An Example of the Character Sequence in the document.
  • Entry type: A set composed of the same entries
  • Word: A word category that may be normalized in the dictionary of the information retrieval system.
  • Word set and word term can be completely different. For example, a category tag of a classification system is used as a word item.
  • To sleep perchance to dream, five entries, four entry classes (to is classified as one type), and three word items (deprecated)

Whether it is a Boolean query or a free text query, people always want to perform the same entry-based processing on the document and query to ensure that the processing results of the same string sequence in the text and query are the same.

Word-based processing is often related to the prediction itself.

For predictions in some specific fields, some specific terms are often recognized as word items. Such as C ++, C #, B-52, fly100. one way is not to index entries including currency volume, numbers, URLs, etc, if these terms are indexed, the index vocabulary will be significantly increased.

2.2.2 stop word ):

A common method to generate a disabled word table isCollection frequencySort them in ascending order and select them manually. Removing the deprecated vocabulary can greatly reduce the number of inverted record tables to be stored by the system.

2.2.3 term Normalization)

In many cases, even if the entries are not completely consistent, people actually want to match them. For example, when querying USA, we want to return documents containing U. S..

Token normalization refers to the process of dividing multiple seemingly incomplete terms into an equivalent class to facilitate matching between them.

The most common practice is to create an equivalence class implicitly. Anti-discriminatory and antidiscriminatory are mapped to antidiscriminatory. In this way, any one of the two words will be searched and a document containing either word will be returned.

In addition, the equivalence class method is established, and the association between multiple non-normalized entries is located.

. A common method is to index a non-normalized entry and maintain a query extension Word Table consisting of multiple words for a query term. When you enter a query term

Expand according to the extended vocabulary and combine the inverted record table corresponding to the extended words. Another method is

Words are extended during index building. For example, for documents that contain automobile, we also use car for indexing (same

Example: documents containing cars are also indexed using automobile.) ①. Compared with the implicit equivalence class creation method

Low Efficiency because they need to store and merge more inverted records. The first method adds a query extension word.

Therefore, more time is required for query processing. The second method requires more storage space to store inverted records

On the other hand, because the extension word lists of the two associated words can have an intersection but do not have to be identical

The method is more flexible than the method that implicitly creates an equivalence class. This also means that different related words can be used.No

Symmetric Scaling. Figure 2-6 provides an example. In this example, if you enter windows, we want to return

Contains Windows operating system documents. However, if you enter a window

But it is unlikely to match windows in windows.

2.2.4 stem reduction and form Merge (stemming and lemmatization)

2.3 fast merge algorithm for inverted record tables based on table jumping

One method is to use a jump table (Skip List) to create a jump table while creating an index.

2.4 Reverse Record table and phrase query with location information

2.4.1 binary word index

In all possible queries, it is quite special to use nouns and noun phrases to express the concepts of user queries.

, Storing longer phrases may greatly increase the size of the vocabulary. Exhausted all length exceeds 2

It is absolutely daunting to maintain its index. Even if you only exhaust all the binary words, it will greatly increase the vocabulary.

Size

2.4.2 location information index

In practice, a more common method is to use the so-called positional index (positional index)

Document: (location 1, Location 2 ...)

Example:

Location indexes can be performed, and K-word neighbor searches are performed. However, they also provide the progressive complexity of operations to merge (and) Inverted record tables.

2.4.3 hybrid index mechanism

Binary word index and Location index can be effectively merged. If the user queries only specific phrases,

For example, Michael Jackson, the efficiency of combining inverted record tables based on location indexes is very low. A hybrid policy is:

Some queries use the phrase index or only the binary word index, while other phrase queries use the location index. Phrase index Institute

The better query records can be obtained based on the user's recent access behavior logs. That is to say, they are

Frequently Used queries

Location index: phrase queries with the largest processing overhead are often such phrases. Each word in these phrases is very common, but they are rarely combined. In this way, you can add the query Britney Spears to the short

The language index may only provide a speed-up effect of about three times for this query, because many documents that mention one word are related documents. If the WHO is added to the phrase index, the query will be accelerated. Therefore, the implementation is more expected to add the latter to the phrase index, although it appears less frequently than the former.

Introduction to Information Retrieval Study Notes-Chapter 2: Word Dictionary and inverted record table

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.