From: http://hi.baidu.com/jrckkyy/blog/item/fa3d2e8257b7fdb86d8119be.html
TF/IDF (Term Frequency/inverse Document Frequency) is recognized as the most important invention in information retrieval.
1. TF/IDF describe the correlation between a single term and a specific document
Term Frequency: indicates the correlation between a term and a document.
Formula: number of times this term appears in the document divided by the total number of times all the terms appear in the document.
IDF (inverse Document Frequency) indicates the weight of a topic of a document. It is mainly compared by the number of docuement containing the term and the total number of docuement sets. The more occurrences, the smaller the weight.
The formula is that log (D/dt) D represents the total number of docuemnt sets, and DT represents the total number of documents containing the term.
In this way, the relevance of search results based on the keywords K1, K2, and K3 becomes TF1 * idf1 + TF2 * idf2 + tf3 * idf3. For example, the total number of terms in Document1 is 1000, K1, K2, and K3 appear in Document1 for 100,200 and 50. The total docuement of K1, K2, and K3 is
1000,100. The total number of document sets is 10000.
Tf1. = 100/1000 = 0.1
Tf2= 200/1000 = 0.2
Tf3 = 50/1000 = 0.05
Idf1 = Log (10000/1000) = Log (10) = 2.3
Idf2 = Log (10000/100000) = Log (1) = 0;
Idf3 = Log (10000/5000) = Log (2) = 0.69
In this way, the correlation between K1, K2, K3 and docuement1 is 0.1*2.3 + 0.2*0 + 0.05*0.69 = 0.2645.
K1 is greater than K3 in Document1, while K2 is 0.
The concept of TF/IDF is the cross entropy of Probability Distribution of keywords under a specific condition (Kullback-Leibler Divergence ).
2. Use TF/IDF to describe the document similarity.
Assume that the term TF/IDF of Document1 and Document2 are T11, T12, t13 ,... t1n and T21, T22, T23 ,..., t2n. similarity between them can be expressed by cosine theorem. Then:
Cos (D1, D2) = Inner Product of D1 and D2/(d1 length * D2 length) = (T11 * T21 + T12 * T22 + t13 * T23 +... + t1n * t2n)/(| d1 | * | D2 | ).
D1 = SQRT (T11 * T11 + T12 * T12 + t13 * t13 +... + t1n * t1n );
The greater the angle, the greater the similarity. If the value is 1, the D1 and D2 are consistent.
Today we can absorb a lot of information from the Internet, sometimes a bunchArticleYou can't finish it. If we want to absorb the information, but the time is not enough, it is a way to use the computer to help us filter the information, or use the computer to help us make a summary. If I have an article in my hand today and want to use a computer to help us find out the most important keywords in this article, what should I do? In the field of information retrieval (IR: Information Retrieval), there is a basic method. The required method for getting started is to use TF and IDF (TF: Term Frequency, IDF: inverse Document Frequency ). Using these two estimates can give computers the ability to calculate important keywords, thus saving us time.
Next let's take a look at what TF and IDF are like? The full name of TF is term frequency, that is, the number of times a keyword appears. For example, in an article, the word "computer" appears many times, or the word "user needs" appears many times, then the frequency of these words will be very high. Words that appear many times in an article must be important. For example, the phrase "artificial intelligence" appears frequently in the article. However, in addition to TF (term frequency), what about IDF (inverse Document Frequency? Let's first think about how difficult it will be to judge the most important keywords of an article by using the frequency at which a word appears. First of all, we will encounter some frequently used words, which are frequently used. They will appear as frequently as important words, making it impossible for the computer to tell which words are frequently used, those are important words. In English, there is a rule concluded by linguist called Zipf's law.
An Introduction to Wikipedia is as follows:
Basically, Qi Fu's law can be expressed as that the frequency of occurrence of a word in the phoneme library of natural language is inversely proportional to its ranking in the frequency table. therefore, words with the highest frequency appear at about twice the occurrence frequency of the second digit, and words with the second digit appear at twice the occurrence frequency of the fourth digit. This law is used as a reference for anything related to power law probability distributions. This "law" is Harvard linguist George Kingsley Zipf (IPA [Z? F]) published.
For example, in the brown library, "the" is the most common word, which appears about 7% (0.1 million of 69971 words) in the library ). As Qi Fu's law describes, the word "of" with the second digit occupies 3.5% of the entire library (36411 times ), "and" (28852 times ). only 135, but this account is half of the brown language library.
So now we know where the problem is. If we only use the frequency of words to determine the most important keywords in an article, we may find common words instead of the most important words, the words "the", "A", and "it" in English are common words, but the most important words in an article are not these words, even those important words appear frequently.
what should we do at this time? IDF helped at this time. Before learning about IDF, what about DF. DF is document frequency. That is to say, if we have fixed n articles in our hands today, the document frquency (DF) of a keyword means that this keyword appears several times in N Articles. Inverse Document Frequency (IDF) is to take DF to the reciprocal, so that a number is multiplied by IDF, so it is divided by DF. With TF and IDF, we can calculate TF multiplied by IDF and calculate a score for each keyword. The score level represents the importance of this keyword in an article. Why do we say that we can find important words instead of common words? This is because TF puts the most frequent appearance in an article in the first place, the second in the second place, and so on. However, after the IDF is passed, it is divided by DF. The words that often appear, such as "the", "A", and "it" in English, will appear in every article, therefore, DF is large. DF is large. The IDF after the reciprocal is small, and IDF is small. After the TF is used, although "the", "A", and "it" appear frequently in an article, the importance of tf-idf is reduced because IDF is small and TF * IDF is multiplied, we (computer programs) will not mistakenly think of these frequently used words as important words!
What kind of score will a really important word get? If this article is about AI, "AI" appears many times, so "AI" has a high level of TF in this article. However, not every article in our computer database is about AI. Therefore, "AI" may only appear in one or three articles in N Articles, therefore, DF is only 3, and IDF is 0.33. Suppose we have n = 100 and there are 100 articles in the database. Other frequently-seen words are "the", and each article appears. df is 100, IDF is 0.01. Therefore, the IDF of "AI" is higher than the IDF of "the". Assume that the words "AI" and "the" appear exactly the same in this article, the word "AI" has a higher score than the word "the", and the computer will judge that "AI" is an important keyword in this article, the word "the" is not an important keyword in this article.
Therefore, using TF * IDF, we can calculate the importance of a keyword in an article. In this direction, we can calculate the key words in an article and help us sort out an article. In the opposite direction, we can give a keyword, and then calculate the TF * IDF for this keyword in each article, and then compare which article, this keyword is the most important. Use this method to find the article most relevant to a keyword. TF * IDF is a basic and good method, whether it is to find key words in an article or to find relevant chapters by keywords. Have you written a program? Readers who have tried this method may try it by themselves, but they may have to prepare their own documents (corpus) or use a web crawler on the Internet) save a few interesting web pages, and then clear the HTML tags. You can use this method to test your skills when there is only plain text left!
We can also compare the differences between humans and computers. Computers are very good at computing mathematical numbers or performing fixed steps, and the speed is fast. Humans can understand the meaning of a word. After reading an article, they can understand the meaning. Then, they will find the most important keyword in this article, starting with "meaning, what are the important keywords in this article when I recall or make a conclusion. However, if you want the computer to follow this direction, first understand the meaning of the word, then understand the meaning of the article, and then make a conclusion that the important keywords of this article are difficult, to understand the meaning of words, a computer must first have a semantic network or an ontology to classify words according to their meanings, there is a way to understand the relationship between a word and other words, just like the relationship classification in the biological field. To understand a chapter later, you must understand a sentence that involves NLP (natural language processing, it is like finding the main word, verb, subject word, and complement in a sentence, distinguishing the reference of the clause and the main sentence, the synonym, and the judgment in the previous and later versions to produce different analysis (parsing ). To understand the entire article.
Therefore, for computers, TF * IDF has a high computing speed and a large number of projects, so it can be computed without a large computer. The relationship between strong AI and weak AI can also be mentioned here. From the engineering perspective, TF * IDF is a good method. It works! It saves us time or a small step in solving big problems. However, strong AI will come up with the argument "Chinese Room" here. That is to say, if a computer can identify important keywords, does it mean that the computer really "knows" (understand) what is the significance of the keyword?
Chinese Room (Chinese Room) Simply put, a person is in the room, leaving only two windows, one place will send the paper, and the other place will send the paper. There is a manual in the room, which is filled with a comparison table. It indicates what kind of English text the reporter can see, what kind of Chinese text should be output, and how to compare some commands. For example, a command can be sent to combine in the window, write the two Chinese characters together before sending them out. Then we started to send an English sentence out of the room, and another window would show a Chinese translation of the sentence. However, the argument is that although the room seems to translate English into Chinese, the operator in the room does not understand Chinese, there is also a reference table in the manual, which acts mechanically, but it looks like the room will be in the middle of the room, so this room should know Chinese.
My opinion here is that we can solve the problem as long as there are answers to the question, whether or not the computer really understands the meaning of the word (understand. However, for a long time, if we really need a computer with human intelligence to be able to really understand, not to seem to understand behavior, then we need to carefully discuss the argument of the Chinese Room. Maybe biological methods, such as computational neuroscience methods, are a direction.
We may ask again how to understand the significance of neurons in only the action potential and static states? However, there is only one neuron, and it may not be able to understand the meaning. The interaction of neurons in all the brains may make sense! The secret is computational neuroscience? One of the questions to be answered. Interested readers can also start from the human brain to solve strong AI problems. Or a mathematical expert, maybe a mathematical theory, can solve the problem of meaning understanding very beautifully, maybe, like manifolds, has a set of features that use different orientation for viewing, it also has the nature of global and local, and is a good candidate. Solving strong AI in this direction is also another possibility. In short, continue to study hard!