About PHP Similarity computing function: Introduction to the use of Levenshtein _php example

Source: Internet
Author: User
Tags first string memory usage strlen first row

Instructions for use
Look at the description of the Levenshtein () function in the manual:

The Levenshtein () function returns the Levenshtein distance between two strings.

Levenshtein distance, also called edit distance, refers to the minimum number of edits required between two strings and one converted to another. A licensed editing operation involves replacing one character with another, inserting a character, and deleting a character.

For example, convert kitten to sitting:

Sitten (K→s)
Sittin (E→i)
The sitting (→g) Levenshtein () function gives the same weight to each operation (replace, insert, and delete). However, you can define the cost of each operation by setting an optional insert, replace, delete parameter.


Levenshtein (String1,string2,insert,replace,delete)

Parameter description

String1 required. The first string to compare.
string2 required. The second string to compare.
Insert Optional. The cost of inserting a character. The default is 1.
Replace is optional. The cost of replacing one character. The default is 1.
Delete is optional. The cost of deleting a character. The default is 1.
Tips and comments

If one of the strings exceeds 255 characters, the Levenshtein () function returns-1.
The Levenshtein () function is insensitive to capitalization.
The Levenshtein () function is faster than the Similar_text () function. However, the Similar_text () function provides more precise results that require less modification.

Copy Code code as follows:

Echo Levenshtein ("Hello World", "Ello World");
echo "<br/>";
Echo Levenshtein ("Hello World", "Ello World", 10,20,30);

Output: 1 30

SOURCE Analysis
Levenshtein () is a standard function, and in the/ext/standard/directory there are files specifically for this function: LEVENSHTEIN.C.

The Levenshtein () selects the implementation based on the number of parameters, and the Reference_levdist () function is invoked to calculate the distance for the parameter is 2 and the parameter is 5. The difference is in the following three parameters, the parameter is 2 o'clock, using the default value of 1.

And in the implementation of the source code we found a case not stated in the document: the Levenshtein () function can also pass three parameters, it will eventually call the Custom_levdist () function. It takes the third parameter as the implementation of the custom function, and the invocation example is as follows:

Copy Code code as follows:

Echo Levenshtein ("Hello World", "Ello World", ' strsub ');

The executive will report warning:the General Levenshtein support are not there yet. This is because this method has not yet been achieved, just put a hole in there.

The implementation of the Reference_levdist () function is a classical DP problem.

Given two strings x and Y, the minimum number of modifications turns x to Y. The modified rule can only be one of the following three types: Delete, insert, change.
With A[i][j] represents the minimum number of operations required to change the first I character of X to the first J character of Y, the state transition equation is:

Copy Code code as follows:

When X[i]==y[j]: a[i][j] = min (a[i-1][j-1), a[i-1][j]+1, a[i][j-1]+1);
When X[i]!=y[j]: a[i][j] = min (a[i-1][j-1], a[i-1][j], a[i][j-1]) +1;

Before using the state transition equation, we need to initialize the matrix D of (n+1) (m+1) and let the values of the first row and column grow from 0. Scans two strings (nm), contrasts the characters, uses the state transition equation, and eventually $a[$l 1][$l 2] as its result.

The simple implementation process is as follows:

Copy Code code as follows:

? Php
$s 1 = "ABCDD";
$l 1 = strlen ($s 1);
$s 2 = "AABBD";
$l 2 = strlen ($s 2);

for ($i = 0; $i < $l 1; $i + +) {
$a [0][$i + 1] = $i + 1;
for ($i = 0; $i < $l 2; $i + +) {
$a [$i + 1][0] = $i + 1;

for ($i = 0; $i < $l 2; $i + +) {
for ($j = 0; $j < $l 1; $j + +) {
if ($s 2[$i] = = $s 1[$j]) {
$a [$i + 1][$j + 1] = min ($a [$i] [$j], $a [$i] [$j + 1] + 1, $a [$i + 1][$j] + 1);
$a [$i + 1][$j + 1] = min ($a [$i] [$j], $a [$i] [$j + 1], $a [$i + 1][$j]) + 1;

echo $a [$l 1][$l 2];
echo "n";
Echo Levenshtein ($s 1, $s 2);

In the PHP implementation, the implementation is clearly indicated in the annotation: This function only optimizes the memory usage, but does not consider the speed, from its realization algorithm looks, the time complexity is O (MXN). The optimal point is to transform the two-dimensional array in the state transition equation above into two arrays. The simple implementation is as follows:
Copy Code code as follows:

? Php
$l 1 = strlen ($s 1);
$s 2 = "aab84093840932bd";
$l 2 = strlen ($s 2);

$dis = 0;
for ($i = 0; $i <= $l 2; $i + +) {
$p 1[$i] = $i;

for ($i = 0; $i < $l 1; $i + +) {
$p 2[0] = $p 1[0] + 1;

for ($j = 0; $j < $l 2; $j + +) {
if ($s 1[$i] = = $s 2[$j]) {
$dis = min ($p 1[$j], $p 1[$j + 1] + 1, $p 2[$j] + 1);
$dis = min ($p 1[$j] + 1, $p 1[$j + 1] + 1, $p 2[$j] + 1); Note that the last parameter here is $P2.
$p 2[$j + 1] = $dis;
$tmp = $p 1;
$p 1 = $p 2;
$p 2 = $tmp;

echo "n";
echo $p 1[$l 2];
echo "n";
Echo Levenshtein ($s 1, $s 2);

As the PHP kernel developer for the previous classic DP optimization, its optimization point is the continuous reuse of two one-dimensional array, a record of the last results, a record this time the results. If according to PHP parameters, three operations to assign different values, in the above algorithm will be the corresponding 1 to the operation of the corresponding value. The first parameter of the Min function corresponds to the modification, the second parameter corresponds to the deletion of the source sky, and the third parameter corresponds to the addition.

Levenshtein Distance Description
Levenshtein distance was first invented by Vladimir Levenshtein, a Russian scientist, and named after him in 1965. No spelling, you can call it edit distance (editing distance). Levenshtein distance can be used to:
Spell checking (spell check)
Speech recognition (statement recognition)
DNA analyses (DNA analysis)
Plagiarism detection (plagiarism detection) LD uses the MN matrix to store the distance value.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.