[IR course note] vector space model

Source: Internet
Author: User

VSM thought

The document is represented as a vector on the r|v| so that the similarity between the document and the document can be calculated (depending on the Euclidean distance or cosine angle)

So how do you represent a document as a vector?

First, we need to select the base vector/dimensions, the base vector must be linear independent or orthogonal vector.

In an IR system, there are two ways of determining the base vector:

1. Core concept (core concept): classify the types of words in terms of their "tilt" on different classifications to determine the value of the vector. But, it is difficult to determine the base vector.

2. The term (word) appears as a base vector, assuming that all base vectors are orthogonal to each other and independent of each other. Here's how we use this.

a vector representation of a document is the sum of all the vectors of the term that appear in the document.

How do I decide on weights?

1. In the document, a term occurrence is recorded as 1 and does not appear as 0.

2.tf method (term frequency): In a document, note the frequency (number of times) that the term appears.

3.tf-idf method (Inverse document Frequency): The original term frequency will face such a serious problem: that in the correlation calculation with the query, all the terms are considered equally important. In fact, some terms have little or no distinguishing ability for relevance calculations. A straightforward idea is to give lower weights to words with higher document set frequencies.

The DfT represents the number of occurrences of a word item T in all documents

IDFT = log (N/DFT) N represents the number of all documents.

Tf-idft,d = tft,d x idft

How to calculate the similarity degree?

1. European distance

2. Cosine angle


[IR course note] vector space model (vectors)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.