Improved vector space model (VSM)

Source: Internet
Author: User

We adopt a more formal definition and use a slightly larger example to demonstrate how to use the dataset-based frequency weight. Corresponding to a given word item, its weight is calculated using IDF (inverse Document Frequency.

To create a corresponding vector for each document, consider the following definitions.

 

Each document vector has n components, and each word in the entire document set contains one entry. Each component in the vector is the weight of each word item calculated in the entire document set. In each document, the word item weight is automatically assigned based on the frequency of word items appearing in the entire document set and the frequency of word items appearing in a specific document. If a word term appears more frequently in a document, the greater the weight is. If a word term appears more frequently in all documents, the smaller the weight.

Only when a word item appears in a document, the weight of the word item in the document vector is non-zero. For a large document set that contains many small documents, the document vector may contain a large number of zero elements. For example, a document set contains different word items, that is, each document is represented by a vector of dimensions. A given document vector containing only 100 different word items contains 9 900 zero components.


Word weight factors in documents mainly consider Word Frequency and inverse document frequency. That is to say, we use the following formula to calculate the value of entry J in the vector corresponding to document I:

 

Next we will consider a document set that contains the D1 and D2 documents. In the document D1, the word "green" appears ten times, in D2, "green" appears only five times. If only "green" is queried, document d1 is listed before document D2 in the result.

When we use t different word items in a document search system to query, the system calculates the vector D (di1, di2 ,..., DIT ). Vector values are filled with the word item weight described above. Similarly, the vector constructed by the word item in the query is Q (wq1, wq2 ,..., Wqt ).

The similarity between Query Q and document DI can be simply defined as the inner product of two vectors. Because the length of the query vector is similar to that of the document vector, this policy is often used to calculate the similarity between the two documents. In section 3.2, we will discuss applying SC to document clustering.

 

Similarity Calculation example

The following describes a fixed query and document set, which consists of a query Q and three documents:

Q: "gold silver truck"

D1: "shipment of gold damaged in a fire"

D2: "delivery of silver arrived in a silver truck"

D3: "shipment of gold arrived in a truck"

In this document set, there are three documents, so d = 3. If a term appears only in one of the three documents, the IDF of the term is lg (D/DFI) = lg (3/1) = 0.477. Similarly, if a term appears in two of the three documents, the IDF of the term is lg (D/DFI) = lg (3/2) = 0.176. If a word item appears in all three documents, the IDF of the word item is lg (D/DFI) = lg (3/3) = 0.

The IDF value of each word item in the three documents is as follows:

 

Now we can construct a document vector. Because there are 11 word items in the document set, we construct an 11-dimensional document vector. We can use the word items in alphabetical order given above to construct the document vector, so T1 corresponds to the first word item "A", T2 corresponds to "arrived", and so on. The weight calculation method of word item I in vector J is idfi × tfij. The document vector is shown in Table 2-1.

Table 2-1 Document Vector

Docid

A

Arrived

Damaged

Delivery

Fire

Gold

In

Of

Shipment

Silver

Truck

D1

0

0

0.477

0

0.477

0.176

0

0

0.176

0

0

D2

0

0.176

0

0.477

0

0

0

0

0

0.954

0.176

D3

0

0.176

0

0

0

0.176

0

0

0.176

0

0.176

Q

0

0

0

0

0

0.176

0

0

0

0.477

0.176


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.