Introduction to string similarity algorithms (sorting)

Source: Internet
Author: User

Recently, I am working on this application. I have posted the materials I have found. If you need them, refer to them.
1. levenshtein distance)
The editing distance is used to calculate the minimum insert, delete, and replace required to convert the original string (s) to the target string (t ).
The number is widely used in NLP, such as wer and mwer in some evaluation methods. It is also commonly used to calculate the number of changes you have made to the original version. Levenshtein, also called levenshtein distance, was first proposed by Russian scientist levenshtein.
The levenshtein distance algorithm can be considered as dynamic planning. The idea is to compare two strings on the left, record the substring similarity (actually called distance) that has been compared, and then obtain the similarity at the next character position. Use the following example: gumbo and gambol. When we calculate the position of matrix D [3, 3], that is, when we compare gum and GAM, We Need To GU-GAM from the three pairs that have been compared, the smallest difference between GUM-GA and GU-GA is its value. therefore, we need to construct a matrix from top left to bottom right.
Pseudo Algorithm for editing distance:
Integer levenshtein distance (character str1 [1 .. lenstr1], character str2 [1 .. lenstr2])
Declare int d [0... lenstr1, 0... lenstr2]
Declare int I, j, cost
 
If I is equal to 0 to lenstr1
D [I, 0]: = I
For J = from 0 to lenstr2
D [0, J]: = J
If I is equal to 1 to lenstr1
For J = from 1 to lenstr2
If str1 [I] = str2 [J], cost: = 0
Otherwise, cost: = 1
D [I, j]: = Minimum value (
D [I-1, J] + 1, // Delete
D [I, J-1] + 1, // insert
D [I-1, J-1] + cost // replace
)
Returns d [lenstr1, lenstr2].

Double minimum (double A, double B, double C)
{
Double Mi;
 
Mi =;
If (B <mi ){
Mi = B;
}
If (C <mi ){
Mi = C;
}
Return Mi;
}

Int * getcellpointer (int * porigin, int Col, int row, int ncols)
{
Return porigin + Col + (row * (ncols + 1 ));
}

Int getat (int * porigin, int Col, int row, int ncols)
{
Int * pcell;
 
Pcell = getcellpointer (porigin, Col, row, ncols );
Return * pcell;
}

Void putat (int * porigin, int Col, int row, int ncols, double X)
{
Int * pcell;
Pcell = getcellpointer (porigin, Col, row, ncols );
* Pcell = X;
}

// Editing distance
LD (const char * s, const char * t)
{
Int * D; // pointer to Matrix
Int N; // length of S
Int m; // length of T
Int I; // iterates through S
Int J; // iterates through T
Char s_i1; // ith character of S
Char s_i2; // ith character of S
Char t_j1; // jth character of T
Char t_j2; // jth character of T
Int * cost; // cost Matrix
Int result; // result
Int cell; // contents of target cell
Int above; // contents of cell immediately above
Int left; // contents of cell immediately to left
Int diag; // contents of cell immediately above and to left
Int SZ; // number of cells in matrix

// Step 1

N = strlen (s );
M = strlen (t );
If (n = 0)
{
Return m;
}
If (M = 0)
{
Return N;
}
SZ = (n + 1) * (m + 1) * sizeof (INT );
D = (int *) malloc (sz );
Cost = (int *) malloc (sz );

// Step 2

For (I = 0; I <= N; I ++)
{
Putat (D, I, 0, N, I );
}

For (j = 0; j <= m; j ++)
{
Putat (D, 0, J, N, J );
}
For (int g = 0; G <= m; G ++) // Initialize all the cost distance matrices to the same value, in the future, you can determine whether the corresponding square is assigned a value based on this value.
{
For (INT h = 0; H <= N; H ++)
{
Putat (cost, H, G, N, 2 );
}
}
// Step 3

For (I = 1; I <= N; I ++)
{

S_i1 = s [I-1];
S_i2 = s [I];
Bool SBD = false;
Bool TBD = false;
If (s_i1> = ''& s_i1 <= '@' | s_i1> = 'A' & s_i1 <= '~ ')
{// S is a punctuation or other non-Chinese character or number.
SBD = true;
}
// Step 4

For (j = 1; j <= m; j ++)
{

TBD = false;
T_j1 = T [J-1];
T_j2 = T [J];
// Step 5
If (t_j1> = ''& t_j1 <= '@' | t_j1> = 'A' & t_j1 <= '~ ')
{// T is also a punctuation mark
TBD = true;
}
If (! SBD)
{// S is a Chinese character
If (! TBD)
{// T is also a Chinese character
If (s_i1 = t_j1 & s_i2 = t_j2)
{
Bool TT = false;
Int temp = getat (cost, I, j, N );
If (temp = 2)
{
Putat (cost, I, j, N, 0 );
Tt = true;
}
If (TT)
{// Assign values to the three adjacent cells of the price matrix that have not been assigned a value because of the city's Chinese Characters in St
Int temp1 = getat (cost, I + 1, J, N );
If (temp1 = 2)
{
Putat (cost, I + 1, J, N, 0 );
}
Int temp2 = getat (cost, I, j + 1, n );
If (temp2 = 2)
{
Putat (cost, I, j + 1, n, 0 );
}
Int temp3 = getat (cost, I + 1, J + 1, n );
If (temp3 = 2)
{
Putat (cost, I + 1, J + 1, n, 0 );
}
}
}
Else
{
Bool TT = false;
Int temp = getat (cost, I, j, N );
If (temp = 2)
{
Putat (cost, I, j, N, 1 );
Tt = true;
}
If (TT)
{
Int temp1 = getat (cost, I + 1, J, N );
If (temp1 = 2)
{
Putat (cost, I + 1, J, N, 1 );
}
Int temp2 = getat (cost, I, j + 1, n );
If (temp2 = 2)
{
Putat (cost, I, j + 1, n, 1 );
}
Int temp3 = getat (cost, I + 1, J + 1, n );
If (temp3 = 2)
{
Putat (cost, I + 1, J + 1, n, 1 );
}
}
}
}
Else
{// T is the symbol
Bool TT = false;
Int temp = getat (cost, I, j, N );
If (temp = 2)
{
Putat (cost, I, j, N, 1 );
Tt = true;
}
If (TT)
{
Int temp1 = getat (cost, I + 1, J, N );
If (temp1 = 2)
{
Putat (cost, I + 1, J, N, 1 );
}
}

}

}
Else
{// S is the symbol
If (! TBD)
{// T is a Chinese character
Bool TT = false;
Int temp = getat (cost, I, j, N );
If (temp = 2)
{
Putat (cost, I, j, N, 1 );
Tt = true;
}
If (TT)
{
Int temp1 = getat (cost, I, j + 1, n );
If (temp1 = 2)
{
Putat (cost, I, j + 1, n, 1 );
}
}
}
Else
{
If (s_i1 = t_j1)
{
Int temp = getat (cost, I, j, N );
If (temp = 2)
{
Putat (cost, I, j, N, 0 );
}
}
Else
{
Int temp = getat (cost, I, j, N );
If (temp = 2)
{
Putat (cost, I, j, N, 1 );
}
}
}

}

// Step 6

Above = getat (D, I-1, J, N );
Left = getat (D, I, J-1, N );
Diag = getat (D, I-1, J-1, N );
Int curcost = getat (cost, I, j, N );
Cell = Minimum (above + 1, left + 1, diag + curcost );
Putat (D, I, j, N, cell );
}
}

// Step 7

Result = getat (d, n, m, n );
Free (d );
Return result;

}

2. Longest Common substring (LCS)
The LCS problem is to find the longest common substring of two strings. The solution is to use a matrix to record two characters.
The matching condition between two characters in all positions in the string. If it matches, it is 1; otherwise, it is 0. Then we can find the longest 1 series of diagonal lines. The corresponding position is the longest position matching the substring.
The following is the matching matrix between string 21232523311324 and string 312123223445. The former is in the X direction,
The latter is in the Y direction. It is not hard to find. The red part is the longest matching substring. The longest matching substring is 21232.
0 0 0 1 0 0 1 1 0 0 1 0 0 0
0 1 0 0 0 0 0 0 1 1 0 0 0 0
1 0 1 0 1 0 1 0 0 0 0 1 0 0
0 1 0 0 0 0 0 0 1 1 0 0 0 0
1 0 1 0 1 0 1 0 0 0 0 1 0 0
0 0 0 1 0 0 1 1 0 0 1 0 0 0
1 0 1 0 1 0 1 0 0 0 0 1 0 0
1 0 1 0 1 0 1 0 0 0 0 1 0 0
0 0 0 1 0 0 1 1 0 0 1 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 1 0
0 0 0 0 0 0 0 0 0 0 0 0 1 0
0 0 0 0 0 1 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
However, it takes some time to find the longest diagonal series of 1 in the matrix of 0 and 1. By improving the matrix generation method and setting tag variables, you can save this time. The new matrix generation method is as follows:
0 0 0 1 0 0 1 1 0 0 1 0 0 0
0 1 0 0 0 0 0 0 0 2 1 0 0 0
1 0 2 0 1 0 1 0 0 0 0 1 0 0
0 2 0 0 0 0 0 0 1 1 0 0 0 0
1 0 3 0 1 0 1 0 0 0 0 1 0 0
0 0 0 4 0 0 0 2 1 0 1 0 0 0
1 0 1 0 5 0 1 0 0 0 0 2 0 0
1 0 1 0 1 0 1 0 0 0 0 1 0 0
0 0 0 2 0 0 2 1 0 0 1 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 1 0
0 0 0 0 0 0 0 0 0 0 0 0 1 0
0 0 0 0 0 1 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
When matching a character, we do not simply assign 1 to the corresponding element, but the value of the element in the upper left corner plus one. We use two marking variables to mark the position of the element with the largest median value in the Matrix. During the matrix generation process, we can determine whether the value of the currently generated element is the largest. Based on this, we can change the value of the marking variable, by the time the matrix is complete, the longest position and length of the matched substring have come out.

// Longest public substring
Char * LCS (char * left, char * right ){
Int lenleft, lenright;
Lenleft = strlen (left );
Lenright = strlen (right );
Int * c = new int [lenright];
Int start, end, Len;
End = Len = 0;
For (INT I = 0; I <lenleft; I ++ ){
For (Int J = lenRight-1; j> = 0; j --){
If (left [I] = right [J]) {
If (I = 0 | j = 0)
C [J] = 1;
Else
C [J] = C [J-1] + 1;
}
Else
C [J] = 0;
If (C [J]> Len ){
Len = C [J];
End = J;
}
}
}
Char * P = new char [Len + 1];
Start = end-len + 1;
For (I = start; I <= end; I ++)
P [I-start] = right [I];
P [Len] = '/0 ';
Return P;
}
3. cosine theorem (Vector Space algorithm)
The ancient and extensive mathematical concepts of cosine theorem have been widely used in various disciplines and practices. Here we will briefly introduce its application in determining the similarity between two strings.
In the cosine theorem, the basic formula is:

Assume that the string S1 and S2 are compared with the similarity of the two strings, SIM (S1, S2). Assume that S1 and S2 contain N different characters including C1, C2, ...cn, to determine the similarity between strings, convert them to the vector V1 corresponding to two strings, and determine the angle between V2. The larger the cosine value, the smaller the angle between the vectors, and the greater the similarity between S1 and S2.
Introduction to Vector Space algorithms:
In vector space model, text refers to various machine-readable records. A feature item (term, expressed in T) is the basic language unit that points out the content in document D and can represent the content of the document, it is mainly composed of words or phrases. The text can be expressed as D (T1, T2 ,..., Tn), where TK is a feature item, 1 <= k <= n. For example, if a document contains four feature items: A, B, C, and D, this document can be represented as D (A, B, C, D ). For texts containing N feature items, each feature item is usually given a certain weight to indicate its importance. That is, D = D (T1, W1; T2, W2 ;..., TN, wn). The short note is d = D (W1, W2 ,..., Wn), we call it the vector representation of Text D. Where wk is the weight of TK, 1 <= k <= n. In the above example, if the weights of A, B, C, and D are respectively 30, 20, 10, then the vectors of the text are represented as D (, 20, 10 ). In the vector space model, the content relevance between two texts D1 and D2 is expressed by the cosine of the angle between common vectors of SIM (D1, D2). The formula is:


W1k and w2k indicate the weights of the K feature items of the text D1 and D2 respectively, and 1 <= k <= n. We can use a similar method to calculate the relevance between two strings.
This algorithm is not found on the Internet. Although I have written it, it is not universal and I will not post it. It is very simple. If you are interested, you can write one by yourself.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.