case: A similar article is recommended when reading an article.
This case is simple and rough, especially when I read the novel, when the book shortage, really want to have such a function. (PS: I work for a fiction company now)
So , how do you measure the similarity between articles?
Before you start, talk about Elasticsearch.
The index used by Elasticsearch is called an inverted index. Split the document into one word and then record where the word appears in the document. Please refer to Wikipedia for specific explanations.
Here, we will use a similar approach to inverted indexes-the word bag model .
We have the following sentence.
"Carlos calls the sport Futbol. Emily calls the sport soccer. "
1 |
2 |
0 |
2 |
2 |
0 |
1 |
0 |
1 |
1 |
... |
Carlos |
The |
Tree |
Calls |
Sport |
Cat |
Futbal |
Dog |
Soccer |
Emily |
... |
We ignore the word order and put it in a corpus .
Suppose that I have 2 articles that have been counted as follows:
1 0 0 0 5 3 0 0 1 0 0 0 0
3 0 0 0 2 0 0 1 0 1 0 0 0
So, how to judge the similarity of 2 articles?
We calculate this value by using the vector point multiplication method.
1*3 + 0*0 + 0*0 ... + 5*2 + ... = 13
We calculated a similarity of 13.
Let's calculate the following article:
1 0 0 0 5 3 0 0 1 0 0 0 0
0 0 1 0 0 0 9 0 0 6 0 4 0
0 + 0 + 0 .... = 0
The similarity degree was found to be 0.
Problem:
If we expand the article twice times, see what happens.
Originally:
1 0 0 0 5 3 0 0 1 0 0 0 0
3 0 0 0 2 0 0 1 0 1 0 0 0
Similarity degree =13
Expansion twice times:
2 0 0 0 10 6 0 0 2 0 0 0 0
6 0 0 0 4 0 0 2 0 2 0 0 2
Similarity degree =52
We've just expanded the space by twice times, but the similarity has changed. We can see that the longer the article, the more obvious the effect.
So, what should be done to solve this problem?
Normalization of vectors
Using vector normalization, you can put different lengths of articles on the same status, there will be no problem above.
Calculate vector Norm:
Computes the sum of the elements equally, taking its square root.
Distinguish common words and uncommon words, and increase the importance of uncommon words:
Common words such as: "The", "Player", "field", "goal"
Uncommon words such as: "Futbol", "Messi"
Why should we increase the importance of uncommon words ?
It's easy to understand, and often, uncommon words can describe the uniqueness of this text.
So what should we do:
The more uncommon words in corpora, called uncommon ones, increase the weights of these words, which are equivalent to emphasizing words that appear only in parts of the document.
At the same time, the weight is reduced for each word, depending on the number of documents it appears in the corpus.
We call it locally common and globally rare. What we're looking for is a balance of some kind of local incidence and a global rarity.
TF-IDF (Word Frequency--Reverse file rate method):
TF, which counts the number of occurrences of a word.
IDF, it is used to reduce the weight of this number according to it.
The following is how IDF is calculated:
Why use this calculation formula?
As the formula shows:
When docs using Word becomes larger, the formula is closer to Log1 = 0
The closer the formula using Word, the more close the equation to logLARGE-LARGE
If: In 64 documents, the word "the" appears 1000 times in 63 documents, and Messi appears 5 times in 3 documents. The bottom is 2.
The:log (64/1+63) = 0
Messi:log (64/1+3) = 4
Then TF * IDF
the:1000 * 0 = 0
Messi:4 * 5 = 20
We need a function:
Defines a distance that is used to measure similarity.
1. We can calculate the similarity of this article and other articles, and return an optimal result.
2. We can calculate the similarity between this article and other articles, and return the K most relevant results (K-Nearest neighbor search).
End
Lesson: Machine Learning Basics: Case Studies (Washington University)
Video link: Https://www.coursera.org/learn/ml-foundations/lecture/EPR3S/clustering-documents-task-overview
WEEK4 algorithms for retrieval and measuring similarity of documents
Machine Learning Foundation 5--document similarity retrieval and measurement algorithm