About PHP Similarity calculation function: Introduction to the use of Levenshtein _php tutorial

Source: Internet
Author: User
Instructions for use
First look at the description of the Levenshtein () function on the manual:

The Levenshtein () function returns the Levenshtein distance between two strings.

Levenshtein distance, also known as the editing distance, refers to the minimum number of edit operations required between two strings, converted from one to another. Permission edits include replacing one character with another character, inserting a character, and deleting a character.

For example, convert kitten to sitting:

Sitten (K→s)
Sittin (E→i)
The sitting (→g) Levenshtein () function gives the same weight for each operation (replace, insert, and delete). However, you can define the cost of each operation by setting the optional Insert, replace, and delete parameters.

Grammar:

Levenshtein (String1,string2,insert,replace,delete)

Parameter description

String1 required. The first string to compare.
string2 required. The second string to compare.
Insert is optional. The cost of inserting a character. The default is 1.
Replace is optional. The cost of replacing a character. The default is 1.
Delete is optional. The cost of deleting a character. The default is 1.
Hints and Notes

If one of the strings exceeds 255 characters, the Levenshtein () function returns-1.
The Levenshtein () function is not case sensitive.
The Levenshtein () function is faster than the Similar_text () function. However, the Similar_text () function provides more precise results that require less modification.
Example
Copy the Code code as follows:
Echo Levenshtein ("Hello World", "Ello World");
echo "
";
Echo Levenshtein ("Hello World", "Ello World", 10,20,30);
?>


Output: 1 30

SOURCE Analysis
Levenshtein () is a standard function, and there are files specifically implemented for this function in the/ext/standard/directory: LEVENSHTEIN.C.

Levenshtein () selects the implementation based on the number of parameters, and calls the Reference_levdist () function to calculate the distance for a parameter of 2 and a parameter of 5. The difference is that for the latter three parameters, the parameter is 2 o'clock, and the default value of 1 is used.

And in the implementation of the source code we found a situation that is not documented in the document: the Levenshtein () function can also pass three parameters, which will eventually call the Custom_levdist () function. It takes the third parameter as an implementation of a custom function, and its invocation example is as follows:
Copy the Code code as follows:
Echo Levenshtein ("Hello World", "Ello World", ' strsub ');


Execution will be reported to Warning:the General Levenshtein-support is not there yet. This is because the method has not been implemented yet, just put a hole in it.

The implementation algorithm of the Reference_levdist () function is a classic DP problem.

Given two strings x and Y, the minimum number of modifications will change X to Y. The modified rule can only be one of three of the following: delete, insert, change.
Using A[i][j] Indicates the minimum number of operations required to change the first I character of X to the first J character of Y, then the state transition equation is:
Copy the Code code as follows:
When X[i]==y[j]: a[i][j] = min (a[i-1][j-1], a[i-1][j]+1, a[i][j-1]+1);
When X[i]!=y[j]: a[i][j] = min (a[i-1][j-1], a[i-1][j], a[i][j-1]) +1;


Before using the state transfer equation, we need to initialize (n+1) (m+1) matrix D and let the first row and column values grow from 0 onwards. Scan two strings (nm level), compare characters, use state transition equations, end $a[$l 1][$l 2] for their results.

The simple implementation process is as follows:
Copy the Code code as follows:
$s 1 = "ABCDD";
$l 1 = strlen ($s 1);
$s 2 = "AABBD";
$l 2 = strlen ($s 2);


for ($i = 0; $i < $l 1; $i + +) {
$a [0][$i + 1] = $i + 1;
}
for ($i = 0; $i < $l 2; $i + +) {
$a [$i + 1][0] = $i + 1;
}

for ($i = 0; $i < $l 2; $i + +) {
for ($j = 0; $j < $l 1; $j + +) {
if ($s 2[$i] = = $s 1[$j]) {
$a [$i + 1][$j + 1] = min ($a [$i] [$j], $a [$i] [$j + 1] + 1, $a [$i + 1][$j] + 1);
}else{
$a [$i + 1][$j + 1] = min ($a [$i] [$j], $a [$i] [$j + 1], $a [$i + 1][$j]) + 1;
}
}
}

echo $a [$l 1][$l 2];
echo "n";
Echo Levenshtein ($s 1, $s 2);


In the implementation of PHP, the implementation is clearly stated in the comments: This function only optimizes memory usage, without regard to speed, from its implementation algorithm, the time complexity of O (MXN). The optimization point is to change the two-dimensional array in the state transition equation above into two arrays. The simple implementation is as follows:
Copy CodeThe code is as follows:
$s 1 = "ABCJFDKSLFDD";
$l 1 = strlen ($s 1);
$s 2 = "aab84093840932bd";
$l 2 = strlen ($s 2);

$dis = 0;
for ($i = 0; $i <= $l 2; $i + +) {
$p 1[$i] = $i;
}

for ($i = 0; $i < $l 1; $i + +) {
$p 2[0] = $p 1[0] + 1;

for ($j = 0; $j < $l 2; $j + +) {
if ($s 1[$i] = = $s 2[$j]) {
$dis = min ($p 1[$j], $p 1[$j + 1] + 1, $p 2[$j] + 1);
}else{
$dis = min ($p 1[$j] + 1, $p 1[$j + 1] + 1, $p 2[$j] + 1); Note that the last parameter here is $P2
}
$p 2[$j + 1] = $dis;
}
$tmp = $p 1;
$p 1 = $p 2;
$p 2 = $tmp;
}

echo "n";
echo $p 1[$l 2];
echo "n";
Echo Levenshtein ($s 1, $s 2);

As the PHP kernel developers on the front of the classic DP optimization, its optimization point is the continuous reuse of two one-dimensional array, a record of the previous results, a record this time. If you follow the parameters of PHP to assign different values to three operations, the corresponding 1 will be changed to the corresponding value in the algorithm above. The first parameter of the Min function corresponds to the modification, the second parameter corresponds to the deletion of the source sky, the third parameter corresponds to the addition.

Levenshtein Distance Description
Levenshtein distance was first invented by Russian scientist Vladimir Levenshtein in 1965, named after his name. You can't spell it, you can call it edit distance (editing distance). Levenshtein distance can be used to:
Spell checking (spell check)
Speech recognition (statement recognition)
DNA analysis (DNA profiling)
Plagiarism detection (plagiarism detection) LD stores the distance value with a matrix of MN.

http://www.bkjia.com/PHPjc/326849.html www.bkjia.com true http://www.bkjia.com/PHPjc/326849.html techarticle The instructions for using the Levenshtein () function in the manual are as follows: the Levenshtein () function returns the Levenshtein distance between two strings. Levenshtein distance, also known as editing distance, refers to the ...

  • Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.