Newbie Information Retrieval 4: vector space model and Similarity Calculation

Source: Internet
Author: User
Tags idf

Similarity is literally the degree of similarity between two things. In information retrieval, similarity indicates the similarity between two documents or the similarity between queries and documents.

First, let's look back at the retrieval process:

1: Enter the query term first.

2: search engines search for documents based on query words.

3: the search engine displays the query results to users in a certain way.

Therefore, whether a document meets the user's query requirements can be measured by the similarity between the text and the query. The similarity can always be calculated as a real number at the end, so it can be sorted based on the similarity between the document and the query. Documents with high similarity with queries are listed at the top, while documents with low similarity are listed at the bottom. Similarity Calculation methods are varied. For example, the previous articleArticleTf-idf can be used to accumulate and represent the similarity between documents and queries. Of course, this method seems to have no theoretical depth, so we will not discuss it.

For one thing, researchers often try to use mathematical theories to explain it, model it, and make it rational. Mathematics contains profound and profound content, so the interpretation methods are also different. Some researchers try to use this mathematical theory to explain, and some researchers try to use that mathematical theory to explain. Some people have successfully explained this, but some of them fail. When a first-class investigator finds a new interpretation method and establishes a model, other third-stream investigators begin to repair the model. Now, let's talk about a search model proposed by top researchers: vector space model. This model is used for document classification. This model is initially used for document classification and is calculated between documents and category features to achieve correct document classification, however, this model can also be used in information retrieval.

Vector space model is to think of queries and documents as N-dimensional space vectors, and N is the dictionary size. Each dimension represents a query word. The coordinates of vectors in each dimension can be calculated.

Set the query vector:

Q = [Q1, q2 ,......, Qn];

The document vector is expressed:

D = [D1, D2 ,......, DN];

In this way, both Q and document D can be expressed as two vectors.

So how do we calculate the similarity? Cosine similarity is commonly used here:

The cosine similarity calculation in this model has a very vivid explanation: Think of each document as a point in an n-dimensional space. A query can be imagined that a ray of light from the origin passes through the n-dimensional space. The points near the beam are highly similar to the query, and the points far from the beam are less similar to the query.

How can we calculate the coordinates of each dimension of the query vector and document vector? The value of TF * IDF can be used for representation. Because there are very few query words, most of the coordinate values in the query vector are 0. When they are multiplied by the document vector, some coordinates of the document vector D will be hidden, this can be used to speed up computing. The length of the vector can be pre-stored in the inverted table, so the whole process can be quickly obtained in the inverted table. After similarity is obtained, you can sort the values and generate a list and return it to the user.

This is the vector space model and Cosine similarity calculation. This model is very successful, so that every book I read will talk about this model. We will not go into details here about how to calculate coordinates in vectors.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.