VSM thought
The document is represented as a vector on the r|v| so that the similarity between the document and the document can be calculated (depending on the Euclidean distance or cosine angle)
So how do you represent a document as a vector?
First, we need to select the base vector/dimensions, the base vector must be linear independent or orthogonal vector.
In an IR system, there are two ways of determining the base vector:
1. Core concept (core concept): classify the types of words in terms of their "tilt" on different classifications to determine the value of the vector. But, it is difficult to determine the base vector.
2. The term (word) appears as a base vector, assuming that all base vectors are orthogonal to each other and independent of each other. Here's how we use this.
a vector representation of a document is the sum of all the vectors of the term that appear in the document.
How do I decide on weights?
1. In the document, a term occurrence is recorded as 1 and does not appear as 0.
2.tf method (term frequency): In a document, note the frequency (number of times) that the term appears.
3.tf-idf method (Inverse document Frequency): The original term frequency will face such a serious problem: that in the correlation calculation with the query, all the terms are considered equally important. In fact, some terms have little or no distinguishing ability for relevance calculations. A straightforward idea is to give lower weights to words with higher document set frequencies.
The DfT represents the number of occurrences of a word item T in all documents
IDFT = log (N/DFT) N represents the number of all documents.
Tf-idft,d = tft,d x idft
How to calculate the similarity degree?
1. European distance
2. Cosine angle
....
[IR course note] vector space model (vectors)