Analysis of TF-IDF and Its Application in computing Advertisement

Last Update:2018-12-03 Source: Internet

Author: User

Tags idf

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Analysis of TF-IDF:

TF-IDF is a common weighted technique. TF-IDF is a statistical method used to assess the importance of a word term to one of a collection or corpus. The importance of a word term increases proportionally with the number of times it appears in the document, but it also decreases proportionally with the frequency of its appearance in the corpus. Various forms of TF-IDF weighting are often used by search engines as a measure or rating of the degree of relevance between a file and a user query. In addition to TF-IDF, search engines on the Internet also use a link-based analysis-based rating method to determine the order in which files appear in the search results. The main idea of TFIDF is: if a word or phrase isArticleThe word or phrase is considered to have good classification ability and is suitable for classification. TFIDF is actually: TF * IDF, term frequency (term frequency), frequency (inverse Document Frequency) of IDF anti-document ). TF indicates the frequency at which the entry appears in document D. The main idea of IDF is: if there are fewer documents containing entry T, that is, the smaller the value of N, the larger the IDF, it indicates that entry T has good classification ability. If the number of documents in a certain type of documents C containing the entry T is m, and the total number of documents in other classes containing T is K, it is clear that the number of documents containing T is n = m + K, when M is large, n is also large, and the IDF value obtained according to the IDF formula is small, it indicates that the T-type differentiation capability of the entry is not strong. However, if an entry frequently appears in a class document, it indicates that the entry can represent the characteristics of the text of the class. Such entries should be given a higher weight, it is also selected as the Feature Word of the text to distinguish it from other documents. This is the shortcoming of IDF. (Note: we can see that TF-IDF is suitable for distinguishing different documents in the same class (rough division) Document Set)
In a given file, term frequency (TF) refers to the number of times a given word appears in the file. This number is often normalized to prevent it from being biased towards long files. (A word may have a higher word frequency than a short file in a long file, regardless of whether the word is important or not .) Inverse Document Frequency (IDF) is a measure of the general importance of words. The IDF of a specific word can be obtained by dividing the total number of files by the number of files containing the word. The frequency of high words in a specific file and the low file frequency of the word in the entire file set can produce a high-weight TF-IDF. Therefore, TF-IDF tends to retain the more special words in the document, filter common words. Theoretical Basis TFIDFAlgorithmIt is built on the assumption that the words most meaningful to the difference document should be those that appear frequently in the document, in other documents of the entire document set, words with less frequency appear. Therefore, if the feature space coordinate system uses TF word frequency as a measure, the characteristics of similar texts can be reflected. In addition, considering the ability of words to distinguish different types, the TFIDF method considers that a word has a smaller frequency, and the ability to distinguish different types of texts is greater. Therefore, the frequency IDF concept of inverse text is introduced. The product of TF and IDF is used as the value measure of the feature space coordinate system, and the adjustment of the weight TF is completed with it, the purpose of weight adjustment is to highlight important words and suppress secondary words. However, in essence, IDF is a weighting method that tries to suppress noise.
And simply think that words with low text frequency are more important, and words with high text frequency are useless. Obviously, this is not completely correct. The simple structure of IDF does not effectively reflect the importance of words and the distribution of feature words, making it unable to adjust the weight well, therefore, the accuracy of the TFIDF method is not very high. In addition, the TFIDF algorithm does not reflect the location information of words. For Web documents, the weight calculation method should reflect the structure features of HTML. Feature Words have different degrees of reflection on the content of articles in different tokens, and their weight calculation methods should also be different. Therefore, we should assign different coefficients to the feature words in different locations on the webpage, and multiply them by the word frequency of the feature words to improve the text representation effect.

TF-IDF Application:

1. It is used to compare the similarity between documents and between queries and documents. TF-IDF is used to calculate the weight of word items in a document;

2. on the web, you can cluster different users based on their browsing behaviors. For specific operations, you can extract the URLs browsed by each user from the log, in this way, a user-URL matrix is obtained. Each user is expressed as a vector, and the component is the webpage browsed by the user (corresponding URL address) the weight (calculated based on the TF-IDF). In this way, the similarity between users can be calculated. If the similarity between the two users is high, the browsing behavior of the two users is similar.

In computing advertisements, the first thing we need to do is to divide users into different clusters and put corresponding advertisements for different clusters ....

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More