Modern Information Retrieval-Spatial Vector Model

Source: Internet
Author: User
Tags idf

Teacher Wang's modern information index is wonderful, but the three-section joint-arranged courses have always made me unable to concentrate too much. Record the knowledge here, but also review it.

Supports the index method of Boolean query. Given a query, many matching results may be matched. Therefore, the matching results (documents) are scored or related weights are analyzed, it is particularly important.

1. parameterized index and domain index

  

Generally, documents have additional structures (title, author, content, etc.), which are also called metadata. For these search systems, parameterized indexes can be performed to complete parameterized search, similar

"Querying documents written by William Shakespeare in 1601 that contain the phrase Alas poor" is usually used in specialized field indexing, such as Baidu academic.

For example, consider a document set. Each article has three fields: author, title, and body. If you query "Shakespeare", 1 is displayed for each of the fields, 0 is not displayed. The three domain pairs must have three weighting coefficients, G1, G2, and G3.

If G1 = 0.2; g2 = 0.3; G3 = 0.5, it indicates that the body appears more important in the three domains. If the title and body of a document appear at this time, the score of this document is 0.8.

After reading this example, we can better understand it, but one problem is how to determine the weight of each field. Generally, weights are trained from manually labeled training sets so that programs can automatically determine more accurate fields.

The training set sample is like this:

If the file has a title, and a body, ST (D, q) indicates whether the title field of the query and the document can match

Sample: $1 Document ID: 37 query: Linux ST: 1 SB: 1 correlation judgment: Correlation

After a similar sample set is handed over to the formula for training, G within the error range can be obtained to obtain the weight of the field. The calculation process is no longer described.

2. Calculate the frequency and weight.

This involves a very important concept: vector space model.

The formula in the book is not very easy to understand. I have seen a good article in the library. I will go over it.

The following describes a fixed query and document set, which consists of a query Q and three documents:

Q: "gold silver truck"

D1: "shipment of gold damaged in a fire"

D2: "delivery of silver arrived in a silver truck"

D3: "shipment of gold arrived in a truck"

In this document set, there are three documents, so d = 3. If a term appears only in one of the three documents, the IDF of the term is lg (D/DFI) = lg (3/1) = 0.477. Similarly, if a term appears in two of the three documents, the IDF of the term is lg (D/DFI) = lg (3/2) = 0.176. If a word item appears in all three documents, the IDF of the word item is lg (D/DFI) = lg (3/3) = 0.

The IDF value of each word item in the three documents is as follows:

IDF (inverse Document Frequency) is not very important because of words that are frequently used in documents. For example, some deprecated words are introduced.

Now we can construct a document vector. Because there are 11 word items in the document set, we construct an 11-dimensional document vector. We can use the word items in alphabetical order given above to construct the document vector, so T1 corresponds to the first word item "A", T2 corresponds to "arrived", and so on. The weight calculation method of word item I in vector J is idfi × tfij. The document vector is shown in Table 2-1.

Docid

A

Arrived

Damaged

Delivery

Fire

Gold

In

Of

Shipment

Silver

Truck

D1

0

0

0.477

0

0.477

0.176

0

0

0.176

0

0

D2

0

0.176

0

0.477

0

0

0

0

0

0.954

0.176

D3

0

0.176

0

0

0

0.176

0

0

0.176

0

0.176

Q

0

0

0

0

0

0.176

0

0

0

0.477

0.176

Then calculate the weight:

It is the inner product of the vector between Q and each document. This also represents the similarity.

SIM (Q, D1) = 0x0 + 0x0.477 + 0x0 + 0x0.477 + 0.176x0.176 + 0x0 + 0x0 + 0x0.176 + 0.477x0 + 0x0.176 = 0.031

Calculate other values to obtain the sorting result.

This is a relatively primitive model. To be continued.

Modern Information Retrieval-Spatial Vector Model

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.