The sample code for calculating similarity in PHP is as follows:
The code is as follows: |
Copy code |
<? Php Function similarity (array $ vec1, array $ vec2 ){ Return dotProduct ($ vec1, $ vec2)/(absVector ($ vec1) * absVector ($ vec2 )); } Function dotProduct (array $ vec1, array $ vec2 ){ $ Result = 0; Foreach (array_keys ($ vec1) as $ key1 ){ Foreach (array_keys ($ vec2) as $ key2 ){ If ($ key1 ===$ key2) $ result + = $ vec1 [$ key1] * $ vec2 [$ key2]; } } Return $ result; } Function absVector (array $ vec ){ $ Result = 0; Foreach (array_values ($ vec) as $ value ){ $ Result + = $ value * $ value; } Return sqrt ($ result ); } // Word frequency vector $ V1 = array ('Ours '=> 5, 'design' => 2, 'one' => 1, 'algorithmic '=> 0, 'any' => 0, 'similarity '=> 1 ); $ V2 = array ('let' => 5, 'design '=> 0, 'one' => 3, 'algorithmic' => 0, 'any' => 0, 'similarity '=> 1 ); // Calculate similarity. A greater value indicates a higher degree of similarity. $ Result1 = similarity ($ v1, $ v2 ); Var_dump ($ result1 ); |
The example is provided. We do not know the algorithm yet. The following are two articles for your reference.
There is a long article. I want to use a computer to extract its key words (Automatic Keyphrase extraction) without manual intervention. How can I do it correctly?
This problem involves many cutting-edge computer fields such as data mining, text processing, and information retrieval. However, unexpectedly, there is a very simple classical algorithm that can provide satisfactory results. It is simple to do not need advanced mathematics, ordinary people only 10 minutes can understand, this is what I want to introduce today TF-IDF algorithm.
Let's start with an instance. Suppose there is a long articleBee farming in ChinaWe are going to use a computer to extract its keywords.
An easy-to-think idea is to find the words that appear the most frequently. If a word is important, it should appear multiple times in this article. Therefore, we conduct Term Frequency (TF) statistics.
As you can see, the most frequently used words are ---- "," Yes ", and" in. They are called "stop words", indicating words that are not helpful for finding results and must be filtered out.
Let's assume that we have filtered them out, and only consider the remaining meaningful words. In this way, we may encounter another problem. We may find that the three words "China", "Bee", and "breeding" appear as many times. Does this mean that, as keywords, they are of the same importance?
Obviously not. Because "China" is a common word, "Bees" and "aquaculture" are relatively less common. If these three words appear as many as once in an article, it is reasonable to think that the importance of "Bee" and "farming" is greater than that of "China", that is, in keyword sorting, "Bees" and "breeding" should be placed before "China.
Therefore, we need to adjust the importance coefficient to determine whether a word is a common word.
If a word is rare, but it appears many times in this article, it probably reflects the characteristics of this article, which is exactly what we need.
When expressed in statistical language, a "importance" weight should be assigned to each word based on the term frequency. The most common words ("," is "," in ") give the minimum weight, and the more common words (" China ") give a smaller weight, relatively rare words ("bees", "farming") give a greater weight. This weight is called "Inverse Document Frequency" (IDF). Its size is inversely proportional to the degree of common occurrence of a word.
After knowing "Word Frequency" (TF) and "inverse document frequency" (IDF), multiply these two values to get the TF-IDF value of a word. The more important a word is to an article, the greater its TF-IDF value. Therefore, the first few words are the keywords of this article..
The following is the details of this algorithm.
Step 1: calculate the word frequency.
Considering the length of the article, in order to facilitate the comparison of different articles, the word frequency should be standardized.
Or
Step 2: calculate the inverse document frequency.
In this case, a corpus (corpus) is required to simulate the language use environment.
If a word is more common, the larger the denominator is, the smaller the frequency of the inverse document is, the closer it is to 0. The reason for adding 1 to the denominator is to avoid the denominator being 0 (that is, all documents do not contain this word ). Log indicates the logarithm of the obtained value.
Step 3: calculate the TF-IDF.
It can be seen that the TF-IDF is proportional to the number of occurrences of a word in the document, and is inversely proportional to the number of occurrences of the word in the entire language.Therefore, the algorithm for automatically extracting keywords is very clear, that is, to calculate the TF-IDF value of each word in the document, and then sort in descending order, take the first few words.
Taking the Chinese bee farming as an example, assuming that the length of this article is 1000 words, "China", "Bee", and "breeding" appear 20 times each, the word frequency (TF) is 0.02. Then, search for Google and find that there are a total of 25 billion web pages containing the word ", which is assumed to be the total number of Chinese web pages. There are a total of 6.23 billion Web pages including "China" and 0.0484 billion Web pages including "Bee" and 0.0973 billion Web pages including "breeding. Their inverse document frequency (IDF) and TF-IDF are as follows:
As can be seen from the table above, "Bee" has the highest TF-IDF value, "breeding" second, "China" has the lowest. (If you still calculate the TF-IDF of the word ", it would be a value extremely close to 0 .) Therefore, if you select only one word, "Bee" is the keyword of this article.
In addition to automatically extracting keywords, TF-IDF algorithms can also be used in many other places. For example, in information retrieval, for each document, you can calculate the TF-IDF of a group of search words ("China", "Bee", "farming"), add them, you can get the TF-IDF of the entire document. The document with the highest value is the most relevant to the search term.
The advantage of TF-IDF algorithm is that it is simple and fast, and the result is more in line with the actual situation. The disadvantage is that the importance of a word is measured by word frequency, which is not comprehensive enough. Sometimes important words may appear less frequently. Moreover, this algorithm cannot reflect the location information of words. The words with the top position and those with the back position are considered to be of the same importance, which is incorrect. (One solution is to give a greater weight to the first section of the full text and the first sentence of each section .)
For the sake of simplicity, let's start with the sentence.
The code is as follows: |
Copy code |
Sentence A: I like watching TV and not watching movies.
Sentence B: I do not like watching TV or watching movies.
|
How can we calculate the similarity between the above two statements?
The basic idea is: the more similar the two sentences are, the more similar they are. Therefore, we can start with word frequency and calculate their similarity.
Step 1: Word segmentation.
The code is as follows: |
Copy code |
Sentence A: I/Like/watch/TV, do not/Like/watch/movie.
Sentence B: I/I/movie.
|
Step 2: list all words.
The code is as follows: |
Copy code |
I like, watch, TV, movie, no, too.
|
Step 3: calculate the word frequency.
The code is as follows: |
Copy code |
Sentence A: I like 2, Watch 2, TV 1, Movie 1, not 1, or 0.
Sentence B: I like 2, Watch 2, TV 1, Movie 1, not 2, and also 1.
|
Step 4: write out the word frequency vector.
The code is as follows: |
Copy code |
Sentence A: [1, 2, 2, 1, 1, 1, 0]
Sentence B: [1, 2, 2, 1, 1, 2, 1]
|
Here, the question is how to calculate the similarity between the two vectors.
We can think of them as two line segments in the space, all starting from the origin ([0, 0,...]) and pointing to different directions. An angle is formed between two line segments. If the angle is 0 degrees, the direction is the same and the line segments overlap. If the angle is 90 degrees, the angle is formed and the direction is completely different; if the angle is 180 degrees, it means the opposite direction.
Therefore, we can determine the similarity between vectors by the angle. The smaller the angle, the more similar it is.
Take a two-dimensional space as an example. In the preceding figure, a and B are two vectors. We need to calculate their angle & theta ;. The cosine theorem tells us that we can use the following formula:
If the vector a is [x1, y1] and the vector B is [x2, y2], you can rewrite the cosine theorem to the following form:
Mathematicians have proved that this calculation method of cosine is also true for n-dimensional vectors. Assume that A and B are two n-dimensional vectors, and A is [A1, A2 ,..., an], B is [B1, B2 ,..., bn], then the cosine of the angle & theta; between A and B is equal:
Using this formula, we can obtain the cosine of the angle between sentence A and sentence B.
The closer the cosine value is to 1, the closer the angle is to 0, that is, the closer the two vectors are, this is called "cosine similarity ".Therefore, the preceding sentence A and sentence B are very similar. In fact, their angle is about 20.3 degrees.
As a result, we get an algorithm for "finding similar articles:
The code is as follows: |
Copy code |
(1) using the TF-IDF algorithm to find out the keywords of the two articles;
(2) each article extracts several keywords (such as 20) and merges them into a set to calculate the word frequency of each article in this set (to avoid the length difference of the article, can use relative term frequency );
(3) generate the word frequency vectors of the two articles;
(4) calculate the cosine similarity between two vectors. A larger value indicates a more similar cosine.
|
Cosine similarity is a very useful algorithm. It can be used to calculate the similarity between two vectors.
Next time, I want to talk about how to automatically generate the abstract of an article based on word frequency statistics.