Turn: Text similarity algorithm

Source: Internet
Author: User
Tags perl script

Text Similarity algorithm

Source: http://www.cnblogs.com/liangxiaxu/archive/2012/05/05/2484972.html

1. TF-IDF1.1TF of important inventions in information retrieval

Term frequency is the keyword word frequency, refers to an article in the occurrence of keywords, such as in a M-word article has n the keyword, then

(Formula 1.1-1)

The word frequency for the keyword in this article.

1.2IDF

Inverse document frequency refers to the inverse text frequency, which is used to measure the weight of the keyword index, by the formula

(Formula 1.2-1)

Calculated, where D is the total number of articles, DW is the number of articles that the keyword has appeared.

2. The Cosine algorithm 2.1 algorithm step based on space vector

preprocessing → text feature selection → weighting → generating a vector space model computes the cosine.

2.2 Step Introduction 2.2.1 Preprocessing

Preprocessing is mainly for the Chinese word segmentation and to stop the use of words, participle of open source code is: Ictclas.

Then, according to the words in the Stop list, the text content in the corpus is recognized to be insignificant but the frequency is very high, the words, symbols, punctuation and garbled characters are removed. The words "This, the, and, will, for" appear almost in any Chinese text, but they make little contribution to the meaning of the text. The process of rejecting a stop word using a list of discontinued words is simply a query process: For each entry, see if it is in the list of inactive words, and if so, delete it from the entry string.

Figure 2.2.1-1 Preprocessing process of Chinese text similarity algorithm

2.2.2 Text feature item selection and weighting

After filtering out frequently used adverbs and auxiliary words, some keywords are determined according to the frequency of the remaining words. Frequency calculation refers to the TF formula.

Weighting is the mechanism that is set for each keyword to reflect the size of the text feature, and the weight value is calculated by referring to the IDF formula.

2.2.3 vector space Model VSM and cosine computation

The basic idea of the vector space model is to simplify the document into an n-dimensional vector representation of the weights of the feature items (keywords).

This model assumes that the word is not related to the word (this premise causes the model to be unable to make semantic correlation judgments, the disadvantage of the vector space model lies in the linear independent hypothesis premise between the keywords), and uses the vector to represent the text, thus simplifying the complex relationship between the keywords in the text, the document is represented by a very simple vector, Makes the model have the computational ability.

In a vector space model, text refers to a variety of machine-readable records.

The text is expressed in D (document), and the feature item (term, denoted by T) indicates that the base language unit in document D, which can represent the contents of the document, is mainly composed of words or phrases, which can be represented as D (T1,T2,...,TN) with a set of features, where TK is a feature item and requires 1 <=k<=n.

The following is an explanation of the vector space model (specifically, the weight vector space).

Suppose a document has a, B, C, d four feature items, then this document can be expressed as

D (A,B,C,D)

For other text to compare with, the order of this feature item is also followed. For text with n feature items, it is usually given a certain weight for each feature item to indicate its importance, i.e.

D=d (T1,W1;T2,W2;...,TN,WN)

Jiangwei

D=d (W1,W2,...,WN)

We call it the weight vector representation of text d, where WK is the weights of TK, 1<=k<=n.

In the above example, assuming that the weights of A, B, C, and D are 30,20,20,10, then the vector of the text is represented as

D (30,20,20,10)

In the vector space model, the cosine of the angle between the common vectors of the content correlation sim (D1,D2) between two text D1 and D2 is represented by the equation:

Among them, w1k and W2K respectively represent the weights of the text D1 and the D2 k feature items, 1<=k<=n.

The following is an example of a cosine calculation using a model.

In automatic collation, we can use a similar method to calculate the relevance of a document to be classified and a class of purpose.

Assuming that the character of the text D1 is a,b,c,d, the weights are 30,20,20,10 respectively, the feature items of the C1 are a,c,d,e, and the weights are 40,30,20,10 respectively, then the D1 vectors are represented as

D1 (30,20,20,10,0)

The vector of the C1 is expressed as

C1 (40,0,30,20,10)

Then the correlation between the text D1 and the class C1 is 0.86 according to the above formula.

So how did the 0.86 specifically derive it?

In mathematics, n-dimensional vectors are

V{V1,V2,V3,...,VN}

Die For

|V|=SQRT (V1*V1+V2*V2+...+VN*VN)

dot product of two vectors

M*n=n1*m1+n2*m2+......+nn*mn

Similarity degree

Sim= (m*n)/(|m|*|n|)

Its physical meaning is the cosine value of the space angle of the two vectors.

The following is the procedure for substituting a formula:

d1*c1=30*40+20*0+20*30+10*20+0*10=2000

|d1|=sqrt (30*30+20*20+20*20+10*10+0*0) =sqrt (1800)

|c1|=sqrt (40*40+0*0+30*30+20*20+10*10) =sqrt (3000)

sim=d1*c1/(|d1|*|c1|) =2000/sqrt (1800*3000) =0.86066

Complete.

2.3 Algorithm Implementation

Open Source code: text-similarity-0.08

Description: Perl script, custom go to stop vocabulary, no semantic recognition function, not suitable for Chinese.

Limitations: Only available in English, no semantic similarity discriminant function

Compile and install:

(1) Enter the/bin in the Code master directory

Modify text_similarity.pl

Change the first line to #!/usr/bin/perl

(2) Return the Code home directory, respectively, to execute

Perl makefile.pl

Make

Make Test

Make install

(3) Re-enter the home directory/bin to test

Figure 2.3-1 Code effects

You can see the statement ".... This is one" and "???? This is the match degree of "0.66";

"..... This is one" and "...." This is the "match" is still 0.66;

"..... This is one" and "..." is the matching degree of "." is 1;

".... This is one" and "... () () This is one "has a matching degree of 1.

Description matching algorithm to disable the word function exists.

2.4 Defects

This kind of algorithm does not solve the natural language problem in the text data very well, namely synonym and Polysemy. This has a significant impact on the accuracy of the search.

2.5 algorithm variants

Figure 2.5-1 Algorithm variant (red)

3. Improved algorithm 3.1 Stealth semantic citation

Implicit semantic indexing (LSI) uses the "Singular value decomposition (SVD)" technique in matrix theory to transform the word frequency matrix into a singular matrix: first, a document matrix is generated from all the document sets, each component of which is an integer value, representing the number of times a particular document matrix appears in a particular document. The matrix is then decomposed by singular values, and the smaller singular values are eliminated. Result singular vectors and singular value matrices are used to map document vectors and query vectors into a subspace in which the semantic relationships from the document matrix are preserved. Finally, we can calculate the angle cosine similarity between vectors by the standardized inner product calculation, and then compare the similarity between the text according to the calculated results. The only change introduced by LSI is the culling of small singular values, since the features associated with small singular values are actually irrelevant in calculating the similarity, and incorporating them will reduce the accuracy of relevance judgments. Features that are preserved are those that have a significant impact on the position of document vectors in m-dimensional space. Culling small singular values transforms the document feature space into a document conceptual space. The calculation of the angle cosine similarity of the inner product is more reliable than that of the original text vector, which is the main reason of using the LSI method. The disadvantage of LSI is that its effect relies on contextual information, and the sparse corpus does not reflect its latent semantics well.

3.2 Text similarity algorithm based on semantic similarity degree

Using the vector space Model (VSM) to represent the text is generally recognized in this field because of its great advantage in the knowledge representation method. In this model, the text content is formalized as a point in the multidimensional space, which is given by vector, which simplifies the processing of the text content into vector space, which greatly reduces the complexity of the problem. However, it is very important to consider only the statistical characteristics of words in the context, assuming that the keywords are linearly independent, without taking into account the semantic information of the word itself, so it has some limitations.

The algorithm flow combined with semantic similarity calculation is as follows:

Figure 3.2-1 Flowchart of semantic similarity algorithm based on vector space

Among them, the semantic correlation computation obtains the similarity degree matrix the direction to have two: is based on the Knowledge network hownet or based on the WordNet.

4. Similarity measurement method for other algorithms 4.1 Chinese fuzzy search algorithm based on phonetic similarity

Different from the traditional matching technology based on keyword matching, this paper proposes to measure the similarity between Chinese character strings by editing distance of phonetic similarity.

The paper proposes three editing distances: the editing distance based on Chinese characters, the editing distance based on pinyin, and the editing distance based on phonetic improvement.

4.2 Longest common sub-sequence

(1) A matrix consisting of two strings, respectively, in rows and columns.

(2) calculates whether the row and column characters of each node are the same, or 1 if the same.

(3) The longest common substring can be obtained by finding the longest diagonal line with a value of 1.

To further enhance the algorithm, we can add the value of the same node to the upper left corner (D[i-1,j-1]) to get the maximum length of the common substring. As a result, you can intercept the maximum substring simply by using the line number and the maximum value.

4.3 Minimum editing distance algorithm

(1) Narrow editing distance

Set A, B is two strings, the narrow editing distance is defined as the minimum required to convert a to B (delete a character in a), insert (insert a character in a) and replace (a character in a to replace the other character) the number of times, with an Ed (A, B) to represent. Intuitively, the more steps required to convert two strings to each other, the greater the difference.

(2) Steps

1. Process the two-part text and replace all non-text characters with the fragment tag "#"

2. The longer text as the benchmark text, traversing the short text after the paragraph, found that the long texts contain short sentences after the length of the article removed, did not find matching words cumulative length.

3. Compare the length of the remaining text with the length of the two paragraphs and the ratio is the mismatch ratio.

5. Summary

Several means of measuring text similarity:

(1) longest common substring (based on the entry space)

(2) The longest common subsequence (based on the weight space, the entry space)

(3) Minimum editing distance method (based on the entry space)

(4) Hamming distance (based on weight space)

(5) Cosine value (based on weight space)

Turn: Text similarity algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.