Http://www.ruanyifeng.com/blog/2013/03/tf-idf.html
Application of TF-IDF and cosine similarity (i): Automatic extraction of keywords
Http://www.ruanyifeng.com/blog/2013/03/cosine_similarity.html
Application of TF-IDF and cosine similarity (II.): Finding similar articles
Http://www.ruanyifeng.com/blog/2013/03/automatic_summarization.html
Application of TF-IDF and cosine similarity (iii): automatic summary
Nanyi
Date: March 15, 2013
The headline seems to be complicated, but what I'm going to talk about is a very simple question.
There is a very long article, I want to use the computer to extract its keywords (Automatic keyphrase extraction), completely without human intervention, how can I do it correctly?
This problem involves data mining, text processing, information retrieval and many other computer frontiers, but surprisingly, there is a very simple classical algorithm, can give a very satisfactory result. It's simple enough to not need advanced mathematics, the average person can only use 10 minutes to understand, this is what I want to introduce today TF-IDF algorithm.
Let's start with an example. Suppose there is now a long article "China's bee farming", we are ready to use a computer to extract its key words.
An easy-to-think idea is to find the most frequently occurring words. If a word is important, it should appear more than once in this article. Therefore, we carry out "word frequency" (term Frequency, abbreviated as TF) statistics.
As a result, you must have guessed that the most frequently used words are----"," "Yes", "in"----. They are called "Stop words" (stop words), meaning words that are not helpful for finding results and must be filtered out.
Let's say we filter them all out, just consider the rest of the words that are actually meaningful. This will lead to another problem, and we may find that the three words "China", "Bee" and "breed" appear as many times as possible. Does this mean that, as a key word, they are of the same importance?
Obviously it's not. Because "China" is a very common word, "bee" and "breed" are relatively less common. If the three words in an article appear the same number of times, there is reason to think, "bee" and "culture" is more important than "China", that is, in the keyword sort above, "bee" and "breeding" should be ranked in front of "China".
So, we need an important adjustment factor to measure whether a word is a common word. If a word is rare, but it appears more than once in this article, it is likely to reflect the nature of the article, the key word we need.
The expression of statistical language is that on the basis of the word frequency, we should assign a weight of "importance" to each term. The most common words ("the", "Yes", "in") give the smallest weight, the more common words ("China") give a smaller weight, the less rare words ("bee", "breed") give greater weight. This weight is called "Inverse document Frequency" (Inverse documents Frequency, abbreviated as IDF), whose size is inversely proportional to the common degree of a word.
Once you know the word frequency (TF) and inverse document frequency (IDF), multiplying the two values, you get the TF-IDF value of a term. The higher the importance of a word to an article, the greater its TF-IDF value. So, in the first few words, is the key word of this article.
Here is the details of the algorithm.
The first step is to calculate the word frequency.
Considering the article has the length of the points, in order to facilitate the comparison of different articles, the "Word frequency" standardization.
Or
The second step is to calculate the inverse document frequency.
At this point, a corpus (corpus) is needed to simulate the language's usage environment.
If a word is more common, the greater the denominator, the less the inverse document frequency is closer to 0. The denominator is added 1 to avoid a denominator of 0 (that is, all documents do not contain the word). Log indicates the logarithm of the resulting value.
The third step is to calculate the TF-IDF.
As you can see, TF-IDF is proportional to the number of occurrences of a word in the document, in inverse proportion to the number of occurrences of the word in the entire language. Therefore, the algorithm of automatic extraction of keywords is very clear, is to calculate the TF-IDF value of each word of the document, and then in descending order, take the first few words.
In the case of "Chinese bee culture", it is assumed that the length of the article is 1000 words, "Chinese", "Bee" and "breed" appear 20 times, then the word frequency (TF) of these three words is 0.02. Then, Google found that there are 25 billion pages containing the word "", assuming this is the total number of Chinese pages. There are 6.23 billion pages containing "China", with 48.4 million pages containing "bee", and 97.3 million pages containing "culture". Their inverse document frequency (IDF) and TF-IDF are as follows:
As can be seen from the table above, "bees" have the highest TF-IDF value, "breed" and secondly, "China" is the lowest. (If you also calculate the "TF-IDF" of the word, it will be an extremely close to 0 value.) So, if you choose only one word, "bee" is the key word of this article.
In addition to extracting keywords automatically, the TF-IDF algorithm can also be used in many other places. For example, in the information retrieval, for each document, you can calculate a group of search terms ("China", "Bee", "breed") TF-IDF, add them, you can get the entire document TF-IDF. The document with the highest value is the one that is most relevant to the search term.
The advantages of the TF-IDF algorithm are simple and fast, and the results are more realistic. The disadvantage is that simply by "word frequency" to measure the importance of a term, not comprehensive enough, sometimes important words may appear not many times. Moreover, this algorithm can not reflect the position of the word, the occurrence of the position of the word and the occurrence of the post-position of the word, are considered to be the same importance, this is not true. (One solution is to give a larger weight to the first sentence of the first paragraph and each paragraph of the text.) )
Next time, I'll use TF-IDF to combine cosine similarity to measure the similarity between documents.
Finish
Application of TF-IDF and cosine similarity (II.): Finding similar articles
Nanyi
Date: March 21, 2013
Last time, I used the TF-IDF algorithm to automatically extract keywords.
Today, let's look at another related issue. Sometimes, in addition to finding keywords, we also want to find other articles similar to the original article. For example, "Google News" under the main news, also provides a number of similar news.
In order to find similar articles, "Cosine similarity" (cosine similiarity) is needed. Let me give you an example of what "cosine similarity" is.
For the sake of simplicity, let's start with the sentences.
Sentence A: I like watching TV and don't like watching movies.
Sentence B: I don't like watching TV, and I don't like watching movies.
How can I calculate the similarity of the above two sentences?
The basic idea is that if the two words are more similar in terms, their content should be more similar. Therefore, we can start with the word frequency and calculate their similarity.
The first step, participle.
Sentence A: I/like/watch/TV, no/Like/watch/movie.
Sentence B: I/don't/like/watch/TV, also/no/like/watch/movie.
The second step is to list all the words.
I, like, watch, TV, movie, No, also.
The third step is to calculate the word frequency.
Sentence A: I am 1, like 2, see 2, TV 1, movie 1, not 1, also 0.
Sentence B: I am 1, like 2, see 2, TV 1, movie 1, not 2, also 1.
Fourth step, write the word frequency vector.
Sentence a:[1, 2, 2, 1, 1, 1, 0]
Sentence b:[1, 2, 2, 1, 1, 2, 1]
Here, the question becomes how to calculate the similarity between the two vectors.
We can think of them as two line segments in space, all from the origin ([0, 0, ...] ), pointing in a different direction. An angle is formed between the two segments, if the angle is 0 degrees, which means that the direction is the same, the line is coincident, and if the angle is 90 degrees, it means a right angle, a completely different direction, and if the angle is 180 degrees, it means that the direction is opposite. Therefore, we can judge the similarity of vectors by the size of the angle. The smaller the angle, the more similar the representation.
For the two-dimensional space, A and B are two vectors, and we want to calculate their angle θ. The cosine theorem tells us that you can use the following formula to calculate:
Assuming that the a vector is [x1, the y1],b vector is [x2, y2], then the cosine theorem can be rewritten in the following form:
Mathematicians have shown that this method of calculation of cosine is also true for n-dimensional vectors. Suppose A and B are two n-dimensional vectors, A is [A1, A2, ..., an], B is [B1, B2, ..., Bn], and the cosine of the angle θ of a and B is equal to:
Using this formula, we can get the cosine of the angle between sentence a and sentence B.
The closer the cosine is to 1, the closer the angle is to 0 degrees, that is, the more similar the two vectors, which is called "Cosine similarity." So, the above sentence A and sentence B are very similar, in fact they have an angle of about 20.3 degrees.
As a result, we get an algorithm for "finding similar articles":
(1) Using TF-IDF algorithm, find out the keywords of two articles;
(2) Each article to take out several keywords (such as 20), combined into a set, calculate each article for the word frequency of the set (in order to avoid the differences in the length of the article, you can use relative frequency);
(3) Generate two articles of the respective word frequency vector;
(4) Calculates the cosine similarity of two vectors, the greater the value, the more similar the representation.
"Cosine similarity" is a very useful algorithm that can be used as long as it calculates the similarity of two vectors.
Next time, I want to talk about how to automatically generate a summary of an article on the basis of word frequency statistics.
Finish
Application of TF-IDF and cosine similarity (iii): automatic summary
Nanyi
Date: March 26, 2013
Sometimes, a very simple mathematical method, you can complete a very complex task.
The first two parts of this series are good examples. By counting the word frequency alone, you can find keywords and similar articles. Although they are not the best way to work, it is certainly the easiest way to do it.
Today, this theme continues. Discusses how to use the word frequency, automatic summary of the article (Automatic summarization).
If you can extract 150 words from a 3000-word article, you can save readers a lot of reading time. A summary called "manual summary" by a person is called "automatic Summary" by the machine. Many sites need it, such as paper sites, news sites, search engines, and so on. In 2007, the American Scholar's paper "A Survey on Automatic Text summarization" (Dipanjan Das, Andre f.t. Martins, 2007) summarizes the current automatic summarization algorithm. Among them, a very important one is the word frequency statistics.
This approach was first developed by the 1958 IBM scientist H.p Luhn's paper "The Automatic Creation of Literature Abstracts".
Dr. Luhn that the information in the article is contained in the sentence, some sentences contain more information, some sentences contain less information. Automatic summarization is about finding the sentences that contain the most information.
The amount of information in a sentence is measured by "keywords". The more words you include, the more important the sentence is. Luhn proposed to use "cluster" (cluster) to represent the aggregation of keywords. The so-called "cluster" is a sentence fragment containing multiple keywords.
Is the illustration of Luhn original paper, the part that is framed is a "cluster". As long as the distance between keywords is less than "threshold", they are considered to be in the same cluster. The threshold value recommended by Luhn is 4 or 5. That is, if there are more than 5 other words between the two keywords, you can divide the two keywords into two clusters.
Next, for each cluster, it calculates the importance score.
In the previous figure, there are 7 words in the cluster, 4 of which are keywords. Therefore, its importance score is equal to (4 x 4)/7 = 2.3.
Then, finding the sentences with the highest scores (say, 5 sentences), and putting them together, constitutes an automatic summary of the article. Specific implementations can be found in the Mining the social web:analyzing Data from Facebook, Twitter, LinkedIn, and other social Media Sites (O ' Reilly, 2 011) The 8th chapter of the book, Python code see GitHub.
Luhn's algorithm was later simplified, eliminating the "cluster" and considering only the keywords contained in the sentence. Here is an example (using pseudo-code), which only takes into account the first sentence of the keyword.
Summarizer (Originaltext, maxsummarysize):
Calculates the word frequency of the original text, generating an array such as [(Ten, ' the '), (3, ' language '), (8, ' Code ') ...]
Wordfrequences = getwordcounts (originaltext)
Filter out the Stop word, the array becomes [(3, ' language '), (8, ' Code ') ...]
Contentwordfrequences = Filtstopwords (wordfrequences)
Sort by word frequency, the array becomes [' Code ', ' language ' ...]
Contentwordssortbyfreq = Sortbyfreqthendropfreq (contentwordfrequences)
Divide the article into sentences
sentences = Getsentences (originaltext)
Select the first sentence that appears in the keyword
Setsummarysentences = {}
foreach Word in Contentwordssortbyfreq:
Firstmatchingsentence = search (sentences, word)
Setsummarysentences.add (firstmatchingsentence)
If setsummarysentences.size () = Maxsummarysize:
Break
Make a summary of the selected sentences in the order in which they appear
Summary = ""
foreach Sentence in sentences:
If sentence in setsummarysentences:
Summary = Summary + "" + sentence
Return summary
Similar algorithms have been written as tools, such as the Simplesummariser module of the Java-based classifier4j library, the C-based OTS Library, and the C # implementation and Python implementations based on classifier4j.
Finish
Application of cosine similarity