word "cow" appears 3 times, then the word "cow" in the document is 3/100=0.03. One way to calculate the file frequency (DF) is to determine how many files have appeared in the word "cow" and then divided by the total number of files contained in the file set. Therefore, if the word "cow" appeared in 1,000 documents, and the total number of documents is 10,000,000, the reverse file frequency is log (10,000,000/1,000) = 4. The final TF-IDF score is 0.03 * 4=0.12.Second: the relevance of the searc
"cow" appears 3 times, then the word "cow" in the document is 3/100=0.03. One way to calculate the file frequency (DF) is to determine how many files have appeared in the word "cow" and then divided by the total number of files contained in the file set. Therefore, if the word "cow" appeared in 1,000 documents, and the total number of documents is 10,000,000, the reverse file frequency is log (10,000,000/1,000) = 4. The final TF-IDF score is 0.03 * 4=0.12.
Second: The relevance of the search re
file. If the total number of words in a document is 100, and the word "cow" appears 3 times, then the word "cow" in the document is 3/100=0.03. One way to calculate the file frequency (DF) is to determine how many files have appeared in the word "cow" and then divided by the total number of files contained in the file set. Therefore, if the word "cow" appeared in 1,000 documents, and the total number of documents is 10,000,000, the reverse file frequency is log (10,000,000/1,000) = 4. The final
idf=log (2) = 1. Using IDF, the calculation formula of the related lines becomes the weighted summation by the simple summation of the word frequency, namely:tf1*idf1+tf2*idf2+tf3*idf3+ ...Using this method to calculate the weight distribution is very objective, accurate estimation of the correlation between keywords and web pages.Reference book: The Beauty of mathematicsOriginal starting: http://www.ido321.com/1338.htmlFiled under: Dom Notes (Eight)
weighted sum, i.e. TF1*IDF1 + tf2*idf2 + ... + TFN*IDFN. In the example above, the Web page and the "Application of atomic energy" have a correlation of 0.0069, of which "atomic energy" contributed 0.0054, while "application" contributed only 0.0015. This ratio is quite consistent with our intuition.
The concept of TF-IDF is recognized as the most important invention in information retrieval. In search, literature classification,
First, find the elementdocument.getElementById ("id"): Based on the ID to find a layer, up to find avar A=document.getelementbyid ("id"): Place the found element in variable A;Document.getelementbyname ("name"): Based on name, find out the array;Document.getelementbytagname ("name"): Based on the name of the tag, find out the array;Document.getelementbyclassname ("name"): According to ClassName, find out the array;
Ii. contents of operation1, non-form elements1), alert (a.innerhtml): Get t
TFIDF is actually: TF * IDF,TF Word frequency (term Frequency), IDF reverse file frequencies (inverse document Frequency). TF represents the frequency at which the entry appears in document D. The main idea of IDF is that if the fewer documents that contain the entry T, that is, the smaller the n, the larger the IDF, the better the class-distinguishing ability of the term T.The main idea of TFIDF is that if a word or phrase appears in an article with a high frequency of TF and is seldom seen in
From: http://hi.baidu.com/jrckkyy/blog/item/fa3d2e8257b7fdb86d8119be.html
TF/IDF (Term Frequency/inverse Document Frequency) is recognized as the most important invention in information retrieval.
1. TF/IDF describe the correlation between a single term and a specific document
Term Frequency: indicates the correlation between a term and a document.Formula: number of times this term appears in the document divided by the total number of times all the terms appear in the document.
IDF
. In summary, if a keyword W appears in DW webpages, the larger the DW, the smaller the weight of W, and vice versa. In information retrieval, the most commonly used weight is "inverse text frequency index" (inverse Document Frequency abbreviated to IDF), and its formula is log (D/DW) d indicates the number of all webpages. For example, assume that the Chinese web page number is d = 1 billion and the word "of" should be deleted and appear on all webpages, that is, DW = 1 billion. Then its IDF =
Tf-idf1. Concept2. Principle3. Java Code Implementation IdeasData set:three MapReduceFirst MapReduce: (using an IK word breaker, a post, which is the content of a record, is split into words) The result of the first MapReduce final run: 1. Get The total number of micro-blogs in the data collection;2. Get the TF value for each word in the current Weibo Mapper End:key:longwritable (offset) value:3823890314914825 The weather was fine today, and the sist
need to study ~Package Com.lean;import java.util.arraylist;import java.util.arrays;/* * 1. How to measure the relevance of Web pages and queries---information retrieval field * TF-IDF (Word frequency-inverse text rate index ) algorithm: * TF frequency = (number of occurrences of Word/total number of words in text) * Idf=log (D/DW) =log (total number of pages/pages containing specific words)----> Why is log (), the interpretation of mathematical beauty is "cross-entropy of the probability distri
Contact Us
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.