Modern Information Retrieval-Spatial Vector Model

Last Update:2014-10-18 Source: Internet

Author: User

Tags idf

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Teacher Wang's modern information index is wonderful, but the three-section joint-arranged courses have always made me unable to concentrate too much. Record the knowledge here, but also review it.

Supports the index method of Boolean query. Given a query, many matching results may be matched. Therefore, the matching results (documents) are scored or related weights are analyzed, it is particularly important.

1. parameterized index and domain index

Generally, documents have additional structures (title, author, content, etc.), which are also called metadata. For these search systems, parameterized indexes can be performed to complete parameterized search, similar

"Querying documents written by William Shakespeare in 1601 that contain the phrase Alas poor" is usually used in specialized field indexing, such as Baidu academic.

For example, consider a document set. Each article has three fields: author, title, and body. If you query "Shakespeare", 1 is displayed for each of the fields, 0 is not displayed. The three domain pairs must have three weighting coefficients, G1, G2, and G3.

If G1 = 0.2; g2 = 0.3; G3 = 0.5, it indicates that the body appears more important in the three domains. If the title and body of a document appear at this time, the score of this document is 0.8.

After reading this example, we can better understand it, but one problem is how to determine the weight of each field. Generally, weights are trained from manually labeled training sets so that programs can automatically determine more accurate fields.

The training set sample is like this:

If the file has a title, and a body, ST (D, q) indicates whether the title field of the query and the document can match

Sample: $1 Document ID: 37 query: Linux ST: 1 SB: 1 correlation judgment: Correlation

After a similar sample set is handed over to the formula for training, G within the error range can be obtained to obtain the weight of the field. The calculation process is no longer described.

2. Calculate the frequency and weight.

This involves a very important concept: vector space model.

The formula in the book is not very easy to understand. I have seen a good article in the library. I will go over it.

The following describes a fixed query and document set, which consists of a query Q and three documents:

Q: "gold silver truck"

D1: "shipment of gold damaged in a fire"

D2: "delivery of silver arrived in a silver truck"

D3: "shipment of gold arrived in a truck"

In this document set, there are three documents, so d = 3. If a term appears only in one of the three documents, the IDF of the term is lg (D/DFI) = lg (3/1) = 0.477. Similarly, if a term appears in two of the three documents, the IDF of the term is lg (D/DFI) = lg (3/2) = 0.176. If a word item appears in all three documents, the IDF of the word item is lg (D/DFI) = lg (3/3) = 0.

The IDF value of each word item in the three documents is as follows:

IDF (inverse Document Frequency) is not very important because of words that are frequently used in documents. For example, some deprecated words are introduced.

Now we can construct a document vector. Because there are 11 word items in the document set, we construct an 11-dimensional document vector. We can use the word items in alphabetical order given above to construct the document vector, so T1 corresponds to the first word item "A", T2 corresponds to "arrived", and so on. The weight calculation method of word item I in vector J is idfi × tfij. The document vector is shown in Table 2-1.

Docid	A	Arrived	Damaged	Delivery	Fire	Gold	In	Of	Shipment	Silver	Truck
D1	0	0	0.477	0	0.477	0.176	0	0	0.176	0	0
D2	0	0.176	0	0.477	0	0	0	0	0	0.954	0.176
D3	0	0.176	0	0	0	0.176	0	0	0.176	0	0.176
Q	0	0	0	0	0	0.176	0	0	0	0.477	0.176

Then calculate the weight:

It is the inner product of the vector between Q and each document. This also represents the similarity.

SIM (Q, D1) = 0x0 + 0x0.477 + 0x0 + 0x0.477 + 0.176x0.176 + 0x0 + 0x0 + 0x0.176 + 0.477x0 + 0x0.176 = 0.031

Calculate other values to obtain the sorting result.

This is a relatively primitive model. To be continued.

Modern Information Retrieval-Spatial Vector Model

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More