PHP improved calculation of string similarity function Similar_text (), Levenshtein (), levenshtein_php tutorial

Source: Internet
Author: User
Tags crc32

PHP improved calculation of string similarity function Similar_text (), Levenshtein (), Levenshtein


Similar_text () Chinese character edition

Copy the Code code as follows:
<?php
Splitting a string
function Split_str ($STR) {
Preg_match_all ("/./u", $str, $arr);
return $arr [0];
}

Similarity detection
function Similar_text_cn ($str 1, $str 2) {
$arr _1 = Array_unique (Split_str ($str 1));
$arr _2 = Array_unique (Split_str ($str 2));
$similarity = count ($arr _2)-Count (Array_diff ($arr _2, $arr _1));

return $similarity;
}

Levenshtein () Chinese character edition

Copy CodeThe code is as follows:
<?php
Splitting a string
function Mbstringtoarray ($string, $encoding = ' UTF-8 ') {
$arrayResult = Array ();
while ($iLen = Mb_strlen ($string, $encoding)) {
Array_push ($arrayResult, Mb_substr ($string, 0, 1, $encoding));
$string = Mb_substr ($string, 1, $iLen, $encoding);
}
return $arrayResult;
}
Edit Distance
function Levenshtein_cn ($str 1, $str 2, $costReplace = 1, $encoding = ' UTF-8 ') {
$count _same_letter = 0;
$d = Array ();
$MB _len1 = Mb_strlen ($str 1, $encoding);
$MB _len2 = Mb_strlen ($str 2, $encoding);
$MB _str1 = Mbstringtoarray ($str 1, $encoding);
$MB _str2 = Mbstringtoarray ($str 2, $encoding);
for ($i 1 = 0; $i 1 <= $mb _len1; $i 1++) {
$d [$i 1] = array ();
$d [$i 1][0] = $i 1;
}
for ($i 2 = 0; $i 2 <= $mb _len2; $i 2++) {
$d [0][$i 2] = $i 2;
}
for ($i 1 = 1; $i 1 <= $mb _len1; $i 1++) {
for ($i 2 = 1; $i 2 <= $mb _len2; $i 2++) {
$cost = ($str 1[$i 1-1] = = $str 2[$i 2-1])? 0:1;
if ($MB _str1[$i 1-1] = = = $MB _str2[$i 2-1]) {
$cost = 0;
$count _same_letter++;
} else {
$cost = $costReplace; Replace
}
$d [$i 1][$i 2] = min ($d [$i 1-1][$i 2] + 1,//insert
$d [$i 1][$i 2-1] + 1,//delete
$d [$i 1-1][$i 2-1] + $cost);
}
}
return $d [$MB _len1][$mb _len2];
return array (' distance ' = $d [$mb _len1][$mb _len2], ' count_same_letter ' = $count _same_letter);
}


Longest common sub-sequence LCS ()


Copy CodeThe code is as follows:
<?php
Longest common sub-sequence English version
function Lcs_en ($str _1, $str _2) {
$len _1 = strlen ($str _1);
$len _2 = strlen ($str _2);
$len = $len _1 > $len _2? $len _1: $len _2;
$DP = Array ();
for ($i = 0; $i <= $len; $i + +) {
$DP [$i] = array ();
$DP [$i][0] = 0;
$DP [0][$i] = 0;
}
for ($i = 1; $i <= $len _1; $i + +) {
for ($j = 1; $j <= $len _2; $j + +) {
if ($str _1[$i-1] = = $str _2[$j-1]) {
$DP [$i] [$j] = $DP [$i -1][$j-1] + 1;
} else {
$DP [$i] [$j] = $DP [$i -1][$j] > $DP [$i] [$j-1]? $DP [$i -1][$j]: $DP [$i] [$j-1];
}
}
}
return $DP [$len _1][$len _2];
}
Splitting a string
function Mbstringtoarray ($string, $encoding = ' UTF-8 ') {
$arrayResult = Array ();
while ($iLen = Mb_strlen ($string, $encoding)) {
Array_push ($arrayResult, Mb_substr ($string, 0, 1, $encoding));
$string = Mb_substr ($string, 1, $iLen, $encoding);
}
return $arrayResult;
}
Longest common sub-sequence Chinese version
function Lcs_cn ($str 1, $str 2, $encoding = ' UTF-8 ') {
$MB _len1 = Mb_strlen ($str 1, $encoding);
$MB _len2 = Mb_strlen ($str 2, $encoding);
$MB _str1 = Mbstringtoarray ($str 1, $encoding);
$MB _str2 = Mbstringtoarray ($str 2, $encoding);
$len = $mb _len1 > $MB _len2? $MB _len1: $MB _len2;
$DP = Array ();
for ($i = 0; $i <= $len; $i + +) {
$DP [$i] = array ();
$DP [$i][0] = 0;
$DP [0][$i] = 0;
}
for ($i = 1; $i <= $mb _len1; $i + +) {
for ($j = 1; $j <= $mb _len2; $j + +) {
if ($MB _str1[$i-1] = = $MB _str2[$j-1]) {
$DP [$i] [$j] = $DP [$i -1][$j-1] + 1;
} else {
$DP [$i] [$j] = $DP [$i -1][$j] > $DP [$i] [$j-1]? $DP [$i -1][$j]: $DP [$i] [$j-1];
}
}
}
return $DP [$MB _len1][$mb _len2];
}


(100 points) [PHP] Write some of your familiar string handler functions!

Addcslashes addslashes bin2hex Chop CHR chunk_split convert_cyr_string Cyrillic
Convert_uudecode convert_uuencode count_chars crc32 crc32 crypt echo explode

fprintf get_html_translation_table Hebrev

Hebrevc
Hex2bin-decodes a hexadecimally encoded binary string
Html_entity_decode-convert all HTML entities to their applicable characters
Htmlentities-convert all applicable characters to HTML entities
Htmlspecialchars_decode-convert Special HTML entities back to characters
Htmlspecialchars-convert special characters to HTML entities
Implode-join array elements with a string
Join

Lcfirst-make A string ' s first character lowercase
Levenshtein-calculate Levenshtein distance between, strings
Localeconv-get Numeric formatting information
Ltrim-strip whitespace (or other characters) from the beginning of a string
Md5_file
Metaphone-calculate the Metaphone key of a string
Money_format-formats a number as a currency string
Nl_langinfo-query Language and locale information
Nl2br

Number_format-format a number with grouped thousands
Ord

Parse_str

Print

Printf

Quoted_printable_decode-convert a quoted-printable string to an 8 bit string
Quoted_printable_encode-convert a 8 bit string to a quoted-printable string
Quotemeta-quote Meta characters
RTrim
Setlocale-set locale Information
Sha1_file

Sha1

Soundex-calculate the Soundex key of a string
Sprintf-return a formatted string
Sscanf-parses input from a string according to a ... Remaining full text >>

For PHP Levenshtein function can give a plain explanation, the manual can't understand

W3school's explanation:
The Levenshtein () function returns the Levenshtein distance between two strings.
Levenshtein distance, also known as the editing distance, refers to the minimum number of edit operations required between two strings, converted from one to another. Permission edits include replacing one character with another character, inserting a character, and deleting a character.
For example, convert kitten to sitting:
Sitten (K→s)
Sittin (E→i)
Sitting (→G)
The Levenshtein () function gives the same weight for each operation (replace, insert, and delete). However, you can define the cost of each operation by setting the optional Insert, replace, and delete parameters.

Note: the "cost" is the weight. Landlord's example, Hello World→ello World, need to "delete" "H", that is, the fifth parameter, the corresponding weight is 30, so return 30.

http://www.bkjia.com/PHPjc/901291.html www.bkjia.com true http://www.bkjia.com/PHPjc/901291.html techarticle PHP Improved calculation of string similarity function Similar_text (), Levenshtein (), Levenshtein Similar_text () Chinese character copy code code is as follows: PHP//split String function Split ...

  • Related Article

    Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.