Similarity Calculation function of PHP: Introduction to levenshtein

Source: Internet
Author: User

Instructions for use
Let's take a look at the levenshtein () function instructions in the manual:

The levenshtein () function returns the Levenshtein distance between two strings.

Levenshtein distance, also known as the editing distance, refers to the minimum number of edits required to convert a string from one to another. Licensed editing operations include replacing one character with another, inserting one character, and deleting one character.

For example, convert kitten to sitting:

Sitten (k → s)
Sittin (e → I)
The sitting (→ g) levenshtein () function gives each operation the same weight (replacement, insertion, and deletion. However, you can set optional insert, replace, and delete parameters to define the cost of each operation.

Syntax:

Levenshtein (string1, string2, insert, replace, delete)

Parameter description

String1 is required. The first string to be compared.
String2 is required. The second string to be compared.
Insert is optional. The cost of inserting a character. The default value is 1.
Replace is optional. The cost of replacing a character. The default value is 1.
Delete is optional. The cost of deleting a character. The default value is 1.
Tips and comments

If one of the strings exceeds 255 characters, the levenshtein () function returns-1.
The levenshtein () function is case insensitive.
The levenshtein () function is faster than the similar_text () function. However, the similar_text () function provides more accurate results that require less modification.
Example
Copy codeThe Code is as follows:
<? Php
Echo levenshtein ("Hello World", "ello World ");
Echo "<br/> ";
Echo levenshtein ("Hello World", "ello World", 10, 20, 30 );
?>


Output: 1 30

Source code analysis
Levenshtein () is a standard function. There is a file specifically implemented for this function under the/ext/standard/directory: levenshtein. c.

Levenshtein () selects the implementation method based on the number of parameters. When the parameters are 2 and 5, the reference_levdist () function is called to calculate the distance. The difference is that when the last three parameters are set to 2, the default value 1 is used.

In the implementation source code, we found a situation not described in the document: the levenshtein () function can also pass three parameters, and it will eventually call the custom_levdist () function. It uses the third parameter as the implementation of the custom function. The call example is as follows:
Copy codeThe Code is as follows:
Echo levenshtein ("Hello World", "ello World", 'strsub ');


The following message is displayed: The general Levenshtein support is not there yet. This is because this method has not yet been implemented. It is just a pitfall.

The implementation algorithm of the reference_levdist () function is a classic DP problem.

Given two strings x and y, calculate the minimum number of modifications to convert x to y. The modified rule can only be one of the following three types: delete, insert, and change.
A [I] [j] indicates the minimum number of operations required to change the first I character of x to the first j character of y. The state transition equation is as follows:
Copy codeThe Code is as follows:
When x [I] = y [j]: a [I] [j] = min (a [I-1] [J-1], a [I-1] [j] + 1, a [I] [J-1] + 1 );
When x [I]! = Y [j]: a [I] [j] = min (a [I-1] [J-1], a [I-1] [j], a [I] [J-1]) + 1;


Before using the state transition equation, we need to initialize the matrix d of (n + 1) (m + 1) and let the value of the first row and column grow from 0. Scan the two strings (nm-level), compare the characters, and use the state transition equation. The final result is $ a [$ l1] [$ l2.

The simple implementation process is as follows:
Copy codeThe Code is as follows:
<? PHP
$ S1 = "abcdd ";
$ L1 = strlen ($ s1 );
$ S2 = "aabbd ";
$ L2 = strlen ($ s2 );

 
For ($ I = 0; $ I <$ l1; $ I ++ ){
$ A [0] [$ I + 1] = $ I + 1;
}
For ($ I = 0; $ I <$ l2; $ I ++ ){
$ A [$ I + 1] [0] = $ I + 1;
}

For ($ I = 0; $ I <$ l2; $ I ++ ){
For ($ j = 0; $ j <$ l1; $ j ++ ){
If ($ s2 [$ I] ==$ s1 [$ j]) {
$ A [$ I + 1] [$ j + 1] = min ($ a [$ I] [$ j], $ a [$ I] [$ j + 1] + 1, $ a [$ I + 1] [$ j] + 1 );
} Else {
$ A [$ I + 1] [$ j + 1] = min ($ a [$ I] [$ j], $ a [$ I] [$ j + 1], $ a [$ I + 1] [$ j]) + 1;
}
}
}

Echo $ a [$ l1] [$ l2];
Echo "n ";
Echo levenshtein ($ s1, $ s2 );


In the implementation of PHP, the implementer clearly states in the comment that this function only optimizes memory usage without considering the speed. From the perspective of its implementation algorithm, the time complexity is O (m × n ). The optimization is to convert the two-dimensional array in the above state transition equation into two arrays. The simple implementation is as follows:
Copy codeThe Code is as follows:
<? PHP
$ S1 = "abcjfdkslfdd ";
$ L1 = strlen ($ s1 );
$ S2 = "aab84093840932bd ";
$ L2 = strlen ($ s2 );

$ Dis = 0;
For ($ I = 0; $ I <= $ l2; $ I ++ ){
$ P1 [$ I] = $ I;
}

For ($ I = 0; $ I <$ l1; $ I ++ ){
$ P2 [0] = $ p1 [0] + 1;

For ($ j = 0; $ j <$ l2; $ j ++ ){
If ($ s1 [$ I] ==$ s2 [$ j]) {
$ Dis = min ($ p1 [$ j], $ p1 [$ j + 1] + 1, $ p2 [$ j] + 1 );
} Else {
$ Dis = min ($ p1 [$ j] + 1, $ p1 [$ j + 1] + 1, $ p2 [$ j] + 1 ); // note that the last parameter here is $ p2
}
$ P2 [$ j + 1] = $ dis;
}
$ Tmp = $ p1;
$ P1 = $ p2;
$ P2 = $ tmp;
}

Echo "n ";
Echo $ p1 [$ l2];
Echo "n ";
Echo levenshtein ($ s1, $ s2 );

The above is the optimization of the previous classic DP by PHP kernel developers. The optimization point is that two one-dimensional arrays are repeatedly reused, one record the previous result and one record the result. If you assign different values to the three operations according to the PHP parameters, you can change the corresponding value of 1 to the corresponding value of the operation in the above algorithm. The first parameter of the min function corresponds to modification, the second parameter corresponds to deleting the source code sky, and the third parameter corresponds to adding.

Levenshtein distance description
Levenshtein distance was first invented by Russian scientist Vladimir Levenshtein in 1965 and named after him. It can be called edit distance without spelling ). Levenshtein distance can be used:
Spell checking)
Speech recognition (statement recognition)
DNA analysis)
Plagiarism detection (Plagiarism detection) LD uses the matrix of mn to store the distance value.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.