TF-IDF and text similarity measurement

Last Update:2018-12-06 Source: Internet

Author: User

Tags idf

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Conversion from TF-IDF and text similarity measurement | because I recently developed a personalized document recommendation system, I have considered how to carry out content-based user recommendation, in short, it is about describing the similarity between documents and users.

TF-IDFTerm Frequency-inverse document frequency is a common weighted technique used for information retrieval and Text Mining. TF-IDF is a statistical method used to assess the importance of a word to a document in a collection or corpus. The importance of a word increases proportionally with the number of times it appears in the file, but it also decreases proportionally with the frequency of its appearance in the corpus. Various forms of TF-IDF weighting are often used by search engines as a measure or rating of the degree of relevance between a file and a user query. In addition to TF-IDF, search engines on the Internet also use a rating method based on link analysis to determine the order in which files appear in the search results.

In a given file,Word FrequencyTerm Frequency (TF) refers to the number of times a given word appears in the file. This number is often normalized to prevent it from being biased towards long files. (A word may have a higher word frequency than a short file in a long file, regardless of whether the word is important or not .) For words in a specific fileTIIts importance can be expressed:

In the preceding formula:NI,JIs the word in the fileDJThe number of occurrences, while the denominator is in the fileDJThe total number of occurrences of all words in.

Reverse file frequency(Inverse Document Frequency, IDF) is a measure of the general importance of words. The IDF of a specific word can be divided by the total number of files containing the word by the number of files, and then obtain the quotient:

Where

| D |: Total number of objects in the Corpus
: Contains wordsTINumber of files (that is, the number of files)

Then

The frequency of high words in a specific file and the low file frequency of the word in the entire file set can produce a high-weight TF-IDF. Therefore, TF-IDF tends to filter out common words and retain important words.

====================== Text Similarity Measure ================================== ===

Method 1: vector space model

In vector space model, text refers to various machine-readable records. A feature item (term, expressed in T) is the basic language unit that points out the content in document D and can represent the content of the document, it is mainly composed of words or phrases. The text can be expressed as D (T1, T2 ,..., Tn), where TK is a feature item, 1 <= k <= n. For example, if a document contains four feature items: A, B, C, and D, this document can be represented as D (A, B, C, D ). For texts containing N feature items, each feature item is usually given a certain weight to indicate its importance. That is, D = D (T1, W1; T2, W2 ;..., TN, wn). The short note is d = D (W1, W2 ,..., Wn), we call it the vector representation of Text D. Where wk is the weight of TK, 1 <= k <= n. In the above example, if the weights of A, B, C, and D are respectively 30, 20, 10, then the vectors of the text are represented as D (, 20, 10 ). In the vector space model, the content relevance between two texts D1 and D2 is expressed by the cosine of the angle between common vectors of SIM (D1, D2). The formula is:

W1k and w2k indicate the weights of the K feature items of the text D1 and D2 respectively, 1 <= k <= n.
In automatic categorization, we can use a similar method to calculate the relevance between the document to be classified and a specific category. For example, the feature items of text D1 are a, B, c, d, and the weights are 30, 20, 10, respectively. The feature items of Category C1 are a, c, d, e, if the weights are 40, 30, 20, and 10, the D1 vector is expressed as D1 (, 20, 10, 0), and the C1 vector is expressed as C1 (, 10 ), the correlation between the preceding text D1 and the Category C1 is 0.86.

Method 2: String Similarity

There are many algorithms for similarity calculation like string, commonly used include the largest public string, and the editing distance.

The editing distance is used to calculate the minimum number of inserts, deletions, and replacements required to convert from the original string (s) to the target string (T). It is widely used in NLP, for example, some evaluation methods (wer, mwer, etc.) are used to calculate the number of changes you have made to the original article. Levenshtein, also called levenshtein distance, was first proposed by Russian scientist levenshtein.

Of course, there are still many methods.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More