The headline seems to be complicated, but what I'm going to talk about is a very simple question.
there is a very long article, I want to use a computer to extract its keywords ( Automatic keyphrase Extraction ), without human intervention at all, how can I do it correctly?
This problem involves data mining, text processing, information retrieval and many other computer frontiers, but surprisingly, there is a very simple classical algorithm, can give a very satisfactory result. It is simple enough to not need advanced mathematics, ordinary people can only use ten minutes to understand, this is what I want to introduce today tf-idf algorithm.
Let's start with an example. Suppose there is now a long article "China's bee farming", we are ready to use a computer to extract its key words.
an easy-to-think idea is to find the most frequently occurring words. If a word is important, it should appear more than once in this article. Therefore, we carry out " word frequency "(termFrequency, abbreviated as TF) statistics.
As a result, you must have guessed that the most frequently occurring word was----"of the","is a","in the"----The most commonly used word in this category. They are called " stop word "(stop Words), which means words that are not helpful for finding results and must be filtered out.
Let 's say we filter them all out, just consider the rest of the words that are actually meaningful. This is another problem that we may find " China "," bee " , " Breed " These three words are the same number of occurrences. Does this mean that, as a key word, they are of the same importance?
obviously it's not. Because"China"is a very common word, comparatively speaking,"Bee"and the"Breed"not so common. If there are as many occurrences of these three words as in an article, there is reason to think that"Bee"and the"Breed"is more important than"China", that is , in the keyword sort above,"Bee"and the"Breed"should be lined up"China"the front.
So, we need an important adjustment factor to measure whether a word is a common word. If a word is rare, but it appears more than once in this article, it is likely to reflect the nature of the article, the key word we need.
The expression in statistical language, is on the basis of the word frequency, to each of the words to assign a"Importance"weights. The most common words ("of the","is a","in the"given the smallest weights, the more common words ("China"given smaller weights, less uncommon words ("Bee","Breed") gives a larger weight. This weight is called"Inverse Document Frequency"(Inverse Document Frequency, abbreviated toIDF), whose size is inversely proportional to the common degree of a word.
know " Word frequency " ( TF ) and " inverse document frequency " ( IDF TF-IDF value. The higher the importance of a word to an article, its TF-IDF The value is greater. So, in the first few words, is the keyword of this article .
Here is the details of the algorithm.
The first step is to calculate the word frequency.
considering the length of the article, in order to facilitate the comparison of different articles, " Word frequency " standardization.
Or
The second step is to calculate the inverse document frequency.
at this point, you need a corpus ( Corpus ) to simulate the language's usage environment.
If a word is more common, the greater the denominator, the smaller the inverse document frequency is, the closer 0 . The denominator is added 1toavoid a denominator of 0(that is, all documents do not contain the word). Log indicates the logarithm of the resulting value.
The third step is to calculate TF-IDF .
can see that TF-IDF is proportional to the number of occurrences of a word in the document, in inverse proportion to the number of occurrences of the word in the entire language. therefore, the algorithm of automatic extraction of keywords is very clear, is to calculate the tf-idf value of each word of the document, and then in descending order, take the first few words.
in the case of "Chinese bee farming", it is assumed that the length of this article is +a word,"China","Bee","Breed"each appeared -times, then these three words of"Word frequency"(TF) All for0.02. Then, searchGooglefound, including"of the"pages of the word are common -billion, assuming this is the total number of Chinese web pages. Include"China"The Web page is shared62.3billion sheets, including"Bee"the Web page is0.484billion sheets, including"Breed"the Web page is0.973hundred million Zhang. Their inverse document frequency (IDF) andTF-IDFas follows:
visible from the table above,"Bee"of theTF-IDFvalues are highest,"Breed"Second,"China"minimum. (If you also calculate"of the"Word ofTF-IDF, that would be an exceedingly close0the value. So, if you select only one word,"Bee"is the key word of this article.
TF-IDF Algorithms can also be used in many other places. For example, in the case of information retrieval, for each document, you can calculate a set of search terms ( " China " bee " , " breed " ) TF-IDF TF-IDF Span style= "font-family: Song Body" >. The document with the highest value is the one that is most relevant to the search term.
TF-IDF the advantages of the algorithm are simple and fast, and the results are more realistic. The disadvantage is that simply by " word frequency " to measure the importance of a term, not comprehensive enough, sometimes important words may appear not many times. Moreover, this algorithm can not reflect the position of the word, the occurrence of the position of the word and the occurrence of the post-position of the word, are considered to be the same importance, this is not true. (One solution is to give a larger weight to the first sentence of the first paragraph and each paragraph of the text.) )
Next time , I will use TF-IDF to combine cosine similarity to measure the similarity between documents.
Application of TF-IDF and cosine similarity (i): Automatic extraction of keywords