Newbie Information Retrieval 4: vector space model and Similarity Calculation

Last Update:2018-12-06 Source: Internet

Author: User

Tags idf

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Similarity is literally the degree of similarity between two things. In information retrieval, similarity indicates the similarity between two documents or the similarity between queries and documents.

First, let's look back at the retrieval process:

1: Enter the query term first.

2: search engines search for documents based on query words.

3: the search engine displays the query results to users in a certain way.

Therefore, whether a document meets the user's query requirements can be measured by the similarity between the text and the query. The similarity can always be calculated as a real number at the end, so it can be sorted based on the similarity between the document and the query. Documents with high similarity with queries are listed at the top, while documents with low similarity are listed at the bottom. Similarity Calculation methods are varied. For example, the previous articleArticleTf-idf can be used to accumulate and represent the similarity between documents and queries. Of course, this method seems to have no theoretical depth, so we will not discuss it.

For one thing, researchers often try to use mathematical theories to explain it, model it, and make it rational. Mathematics contains profound and profound content, so the interpretation methods are also different. Some researchers try to use this mathematical theory to explain, and some researchers try to use that mathematical theory to explain. Some people have successfully explained this, but some of them fail. When a first-class investigator finds a new interpretation method and establishes a model, other third-stream investigators begin to repair the model. Now, let's talk about a search model proposed by top researchers: vector space model. This model is used for document classification. This model is initially used for document classification and is calculated between documents and category features to achieve correct document classification, however, this model can also be used in information retrieval.

Vector space model is to think of queries and documents as N-dimensional space vectors, and N is the dictionary size. Each dimension represents a query word. The coordinates of vectors in each dimension can be calculated.

Set the query vector:

Q = [Q1, q2 ,......, Qn];

The document vector is expressed:

D = [D1, D2 ,......, DN];

In this way, both Q and document D can be expressed as two vectors.

So how do we calculate the similarity? Cosine similarity is commonly used here:

The cosine similarity calculation in this model has a very vivid explanation: Think of each document as a point in an n-dimensional space. A query can be imagined that a ray of light from the origin passes through the n-dimensional space. The points near the beam are highly similar to the query, and the points far from the beam are less similar to the query.

How can we calculate the coordinates of each dimension of the query vector and document vector? The value of TF * IDF can be used for representation. Because there are very few query words, most of the coordinate values in the query vector are 0. When they are multiplied by the document vector, some coordinates of the document vector D will be hidden, this can be used to speed up computing. The length of the vector can be pre-stored in the inverted table, so the whole process can be quickly obtained in the inverted table. After similarity is obtained, you can sort the values and generate a list and return it to the user.

This is the vector space model and Cosine similarity calculation. This model is very successful, so that every book I read will talk about this model. We will not go into details here about how to calculate coordinates in vectors.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More