Machine Learning Foundation 5--document similarity retrieval and measurement algorithm

Source: Internet
Author: User
Tags square root idf

case: A similar article is recommended when reading an article.

This case is simple and rough, especially when I read the novel, when the book shortage, really want to have such a function. (PS: I work for a fiction company now)

So , how do you measure the similarity between articles?

Before you start, talk about Elasticsearch.

The index used by Elasticsearch is called an inverted index. Split the document into one word and then record where the word appears in the document. Please refer to Wikipedia for specific explanations.

Here, we will use a similar approach to inverted indexes-the word bag model .

We have the following sentence.

"Carlos calls the sport Futbol. Emily calls the sport soccer. "

1 2 0 2 2 0 1 0 1 1 ...
Carlos The Tree Calls Sport Cat Futbal Dog Soccer Emily ...

We ignore the word order and put it in a corpus .

Suppose that I have 2 articles that have been counted as follows:

1 0 0 0 5 3 0 0 1 0 0 0 0

3 0 0 0 2 0 0 1 0 1 0 0 0

So, how to judge the similarity of 2 articles?

We calculate this value by using the vector point multiplication method.

1*3 + 0*0 + 0*0 ... + 5*2 + ... = 13

We calculated a similarity of 13.

Let's calculate the following article:

1 0 0 0 5 3 0 0 1 0 0 0 0

0 0 1 0 0 0 9 0 0 6 0 4 0

0 + 0 + 0 .... = 0

The similarity degree was found to be 0.

Problem:

If we expand the article twice times, see what happens.

Originally:

  1 0 0 0 5 3 0 0 1 0 0 0 0

3 0 0 0 2 0 0 1 0 1 0 0 0

Similarity degree =13

Expansion twice times:

2 0 0 0 10 6 0 0 2 0 0 0 0

6 0 0 0 4 0 0 2 0 2 0 0 2

Similarity degree =52

We've just expanded the space by twice times, but the similarity has changed. We can see that the longer the article, the more obvious the effect.

So, what should be done to solve this problem?

Normalization of vectors

Using vector normalization, you can put different lengths of articles on the same status, there will be no problem above.

Calculate vector Norm:

Computes the sum of the elements equally, taking its square root.

  

Distinguish common words and uncommon words, and increase the importance of uncommon words:

Common words such as: "The", "Player", "field", "goal"

Uncommon words such as: "Futbol", "Messi"

Why should we increase the importance of uncommon words ?

It's easy to understand, and often, uncommon words can describe the uniqueness of this text.

So what should we do:

The more uncommon words in corpora, called uncommon ones, increase the weights of these words, which are equivalent to emphasizing words that appear only in parts of the document.

At the same time, the weight is reduced for each word, depending on the number of documents it appears in the corpus.

We call it locally common and globally rare. What we're looking for is a balance of some kind of local incidence and a global rarity.

TF-IDF (Word Frequency--Reverse file rate method):

TF, which counts the number of occurrences of a word.

IDF, it is used to reduce the weight of this number according to it.

The following is how IDF is calculated:

    

Why use this calculation formula?

As the formula shows:

When docs using Word becomes larger, the formula is closer to Log1 = 0

The closer the formula using Word, the more close the equation to logLARGE-LARGE

If: In 64 documents, the word "the" appears 1000 times in 63 documents, and Messi appears 5 times in 3 documents. The bottom is 2.

The:log (64/1+63) = 0

Messi:log (64/1+3) = 4

Then TF * IDF

the:1000 * 0 = 0

Messi:4 * 5 = 20

We need a function:

Defines a distance that is used to measure similarity.

1. We can calculate the similarity of this article and other articles, and return an optimal result.

2. We can calculate the similarity between this article and other articles, and return the K most relevant results (K-Nearest neighbor search).

End

Lesson: Machine Learning Basics: Case Studies (Washington University)

Video link: Https://www.coursera.org/learn/ml-foundations/lecture/EPR3S/clustering-documents-task-overview

WEEK4 algorithms for retrieval and measuring similarity of documents

Machine Learning Foundation 5--document similarity retrieval and measurement algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.