Title, the more detailed the better. Thanks, man.
levenshtein("Hello World","ello World");
It only takes 1 steps to add an ' H ' to the second argument! And of course return ' 1 '!
This function is quite simple, but:
levenshtein("Hello World","ello World",10,20,30);
The 3rd parameter: the cost of inserting a character. The default is 1.
4th parameter: The cost of replacing a character. The default is 1.
5th parameter: The cost of deleting a character. The default is 1.
What is the meaning of them?
In this example, it fills in 10,20,30, respectively.
And then return ' 30 ' I don't understand!
What does it mean by ' cost '?
10,20,30 What does it mean?
levenshtein('aaa','aab',0,1,0);
In this example, it just needs to be replaced once enough, why is the number of steps returned is ' 0 '?
Reply content:
Title, the more detailed the better. Thanks, man.
levenshtein("Hello World","ello World");
It only takes 1 steps to add an ' H ' to the second argument! And of course return ' 1 '!
This function is quite simple, but:
levenshtein("Hello World","ello World",10,20,30);
The 3rd parameter: the cost of inserting a character. The default is 1.
4th parameter: The cost of replacing a character. The default is 1.
5th parameter: The cost of deleting a character. The default is 1.
What is the meaning of them?
In this example, it fills in 10,20,30, respectively.
And then return ' 30 ' I don't understand!
What does it mean by ' cost '?
10,20,30 What does it mean?
levenshtein('aaa','aab',0,1,0);
In this example, it just needs to be replaced once enough, why is the number of steps returned is ' 0 '?
@ Yi Red Childe has answered very well, I would like to add to the bottom of the implementation of the main question levenshtein('aaa','aab',0,1,0);
This example, why return 0?
The algorithm used at the bottom of PHP is a classic matrix (slightly modified), that is, s1
s2
each character as a matrix row ( i[0,m]
) and column ( j[0,n]
), each position in order 22 comparison, if equal cost=0
(because no action required), otherwise cost=1
(This cost=1
is the cost of the default operation when we do not pass the next 3 parameters); However, the value of the item in this matrix is M[i,j]
not directly equal to the sum, because it is necessary to ensure the transitivity of the preceding operation (for example, if you insert 1 characters in front of you, the following characters will be followed by a subsequent move.) You removed 2 characters in front, followed by a forward one), the value is equal to, M[i,j]
M[i-1, j]+1
three values of the M[i, j-1]+1
M[i-1, j-1]+cost
smallest (3 values for INSERT, replace, delete the cost, take the minimum value represents the least possible way to take the operation cost). This is until the value of the item M[m, n]
is calculated, which is the "editing distance" we need to ask for. (Finally I posted the PHP bottom C code corresponding to the PHP implementation)
levenshtein('aaa','aab',1,1,1);
The calculated demo, cell [3,3] is the final result:
|
|
a |
a |
a |
|
0 |
1 |
2 |
3 |
A |
1 |
0 |
1 |
2 |
A |
2 |
1 |
0 |
1 |
B |
3 |
2 |
1 |
1 |
levenshtein('aaa','aab',0,1,0);
The calculation shows that the cell [3,3] is the final result (because [1,1]=0, and the cost of the insertion is set to 0, resulting in the subsequent m[i, J-1] results are 0, and 0 is the minimum, resulting in a final return of 0):
|
|
a |
a |
a |
|
0 |
1 |
2 |
3 |
A |
1 |
0 |
0 |
0 |
A |
2 |
0 |
0 |
0 |
B |
3 |
0 |
0 |
0 |
As can be seen here, the 3 parameters passed behind the Levenshtein, is the corresponding insertion, substitution, deletion of the current cost of three operations (that is, replace the value of the above the M[i,j]
time after the 1), and from the angle of the algorithm, any operation of the minimum cost unit is 1, then if we want to get a "reasonable Return value, you cannot pass a value of 0. Passing 0 can result in the return of an unreasonable, or no actual, reference value. Of course, this is closely related to the actual algorithm used, in this case, the best operation should be replaced (AaB B in the replacement of a), but the algorithm used by PHP to take the 3 operating methods of the smallest value, and to bring into the transitivity, resulting in the final result is 0.
#php底层c代码对应的php实现function levenshtein_php ($s 1, $s 2, $cost _ins=1, $cost _rep=1, $cost _del=1) {$l 1 = strlen ($s 1); $l 2 = strlen ($s 2); if ($l 1 = = 0) {return $l 2 * $cost _ins; } if ($l 2 = = 0) {return $l 1 * $cost _del; } $p 1 = array (); $p 2 = array (); for ($i 2 = 0; $i 2 <= $l 2; $i 2++) {$p 1[$i 2] = $i 2 * $cost _ins; } for ($i 1 = 0; $i 1 < $l 1; $i 1++) {$p 2[0] = $p 1[0] + $cost _del; for ($i 2 = 0; $i 2 < $l 2; $i 2++) {$c 0 = $p 1[$i 2] + (($s 1[$i 1] = = $s 2[$i 2])? 0: $cost _rep); $c 1 = $p 1[$i 2 + 1] + $cost _del; if ($c 1 < $c 0) {$c 0 = $c 1; } $c 2 = $p 2[$i 2] + $cost _ins; if ($c 2 < $c 0) {$c 0 = $c 2; } $p 2[$i 2 + 1] = $c 0; } $tmp = $p 1; $p 1 = $p 2; $p 2 = $tmp; } $c 0 = $p 1[$l 2]; return $c 0;} echo levenshtein_php (' aaa ', ' AaB ', 1, 1));
PS: In fact, there is better than the use of matrix structure of the algorithm, it does not unfold the subject. I don't have much algorithm base, try to analyze, welcome to criticize!
In layman's terms, the similarity between two strings is detected, and the fewer steps a string becomes to another string, the more similar it is.
$a = "levenshtein";$b = "levenjdslkfjslkdjfklsjdfljsdlfjsldfjlsdjflsdjltein";$c = "leveshetin";$r = levenshtein($a, $b); //int(40)$s = levenshtein($a, $c); //int(3)
From $a
becoming $b
need to add 40 characters in the middle, from $a
becoming $c
to need to add 2 characters, delete 1 characters, so is 3.
The so-called price is a specific operation of the weight/proportion, such as you set the deletion of the character of the cost is 30, do 1 Delete finally returned to you is 1*30
. By setting this parameter, it helps to do as many actions as possible to avoid an action. As for the latter, I personally understand that the so-called substitution is actually through the "delete" and "add" two steps of the fit, if you add and delete set to 0 is equivalent to prohibit the two operations, the replacement will not be able to operate. If you randomly add and remove a non-0 value, you will always return 1. Of course, this is my personal thoughts, if there is no right can be proposed to correct.