Improved vector space model (VSM)

Last Update:2014-08-04 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

We adopt a more formal definition and use a slightly larger example to demonstrate how to use the dataset-based frequency weight. Corresponding to a given word item, its weight is calculated using IDF (inverse Document Frequency.

To create a corresponding vector for each document, consider the following definitions.

Each document vector has n components, and each word in the entire document set contains one entry. Each component in the vector is the weight of each word item calculated in the entire document set. In each document, the word item weight is automatically assigned based on the frequency of word items appearing in the entire document set and the frequency of word items appearing in a specific document. If a word term appears more frequently in a document, the greater the weight is. If a word term appears more frequently in all documents, the smaller the weight.

Only when a word item appears in a document, the weight of the word item in the document vector is non-zero. For a large document set that contains many small documents, the document vector may contain a large number of zero elements. For example, a document set contains different word items, that is, each document is represented by a vector of dimensions. A given document vector containing only 100 different word items contains 9 900 zero components.

Word weight factors in documents mainly consider Word Frequency and inverse document frequency. That is to say, we use the following formula to calculate the value of entry J in the vector corresponding to document I:

Next we will consider a document set that contains the D1 and D2 documents. In the document D1, the word "green" appears ten times, in D2, "green" appears only five times. If only "green" is queried, document d1 is listed before document D2 in the result.

When we use t different word items in a document search system to query, the system calculates the vector D (di1, di2 ,..., DIT ). Vector values are filled with the word item weight described above. Similarly, the vector constructed by the word item in the query is Q (wq1, wq2 ,..., Wqt ).

The similarity between Query Q and document DI can be simply defined as the inner product of two vectors. Because the length of the query vector is similar to that of the document vector, this policy is often used to calculate the similarity between the two documents. In section 3.2, we will discuss applying SC to document clustering.

Similarity Calculation example

The following describes a fixed query and document set, which consists of a query Q and three documents:

Q: "gold silver truck"

D1: "shipment of gold damaged in a fire"

D2: "delivery of silver arrived in a silver truck"

D3: "shipment of gold arrived in a truck"

In this document set, there are three documents, so d = 3. If a term appears only in one of the three documents, the IDF of the term is lg (D/DFI) = lg (3/1) = 0.477. Similarly, if a term appears in two of the three documents, the IDF of the term is lg (D/DFI) = lg (3/2) = 0.176. If a word item appears in all three documents, the IDF of the word item is lg (D/DFI) = lg (3/3) = 0.

The IDF value of each word item in the three documents is as follows:

Now we can construct a document vector. Because there are 11 word items in the document set, we construct an 11-dimensional document vector. We can use the word items in alphabetical order given above to construct the document vector, so T1 corresponds to the first word item "A", T2 corresponds to "arrived", and so on. The weight calculation method of word item I in vector J is idfi × tfij. The document vector is shown in Table 2-1.

Table 2-1 Document Vector

Docid	A	Arrived	Damaged	Delivery	Fire	Gold	In	Of	Shipment	Silver	Truck
D1	0	0	0.477	0	0.477	0.176	0	0	0.176	0	0
D2	0	0.176	0	0.477	0	0	0	0	0	0.954	0.176
D3	0	0.176	0	0	0	0.176	0	0	0.176	0	0.176
Q	0	0	0	0	0	0.176	0	0	0	0.477	0.176

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Improved vector space model (VSM)

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support