The last time we introduced the information retrieval technology-Boolean retrieval, the Boolean model can solve a very important problem, that is, to find the documents related to user requirements (which also requires a lot of processing, such as Word Segmentation, normalization, removing deprecated words, etc. We just introduce the main framework process ). However, there may be a lot of documents, maybe thousands or tens of thousands, which is far from what the user wants. Users will not choose from tens of thousands of documents. Therefore, we need to sort the results and display the documents that best meet users' needs at the top, just as google and baidu do. Careful friends can find that, in fact, information retrieval is a step-by-step pruning and filtering process, and the last thing left behind is what the user wants.
Therefore, we need a scoring mechanism to sort and exclude top N documents from the scores and return them to users. How can we determine this score? Of course, it is the similarity between the query that the user queries and the returned document. There are many ways to calculate similarity:
Method 1 Jaccard coefficient
This method looks easy to understand, that is, the number of words that appear together with the query and document, divided by the total number of words. Of course there are also many problems
1. The number of times words appear in the document is not taken into account (tf factor is not taken into account)
2. The document frequency is not taken into account (idf factor is not taken into account)
3. The length of a document is not taken into account. The similarity between a long document and a short document is significantly different in calculation.
Let's take a look at a very famous model-spatial vector model.
Method 2 vector space model (VSM)
First, we will introduce two concepts: tf and idf.
Tf is term frequency, which indicates the number of times a term t appears in document d. This is a very important concept in the document. The increase in the number of occurrences means a higher degree of importance. However, the increase in relevance is not a year-on-year comparison with the increase in the number of occurrences. Therefore, tf usually needs to do the following processing:
W1 = log10 (tf + 1)
In this way, we need to weaken the influence of the number of times on relevance.
Df is document frequency, indicating the frequency of a term appearing in the entire document set. Contrary to tf, the importance of a term is inversely proportional to the frequency of its appearance in the corpus. For example, words such as and or appear in almost all documents, so the meaning of these words is very weak, and some professional words only appear in several documents, obviously more important. Idf is the reciprocal of df, which is used for convenience.
Similarly, in order to weaken the frequency effect, we also do the following:
W2 = log10 (N/df) where N is the total number of documents, df is the number of times the document term appears in all document sets.
It must be noted that the calculation of tf and idf has many variations, and you do not have to use the formula above. In many cases, you must analyze the formula based on the size of the document set.
With the above tfidf as the weight, we can easily calculate the weights of all words, then use an n-dimensional vector to represent a document, and also use an n-dimensional vector to represent a query, if no corresponding term exists in the query, the weight of the claim is 0.
Therefore, using our data knowledge, we can know that in the same space, the smaller the angle between the two vectors, the more similar the two vectors, and the less independent the two vectors. Therefore, using the cosine theorem, we can easily obtain the similarity between vectors.
The spatial vector model is one of the most common and important models in information retrieval. It is easy to understand, intuitive, and effective. Hope to help you.