The native similar_text () function and levenshtein () function of PHP have poor support for Chinese characters. I wrote one by myself and the test is normal. I recommend it to you. If you have any questions, please leave a message.
The native similar_text () function and levenshtein () function of PHP have poor support for Chinese characters. I wrote one by myself and the test is normal. I recommend it to you. If you have any questions, please leave a message.
Similar_text () Chinese character edition
The Code is as follows:
<? Php
// Split the string
Function split_str ($ str ){
Preg_match_all ("/./u", $ str, $ arr );
Return $ arr [0];
}
// Similarity Detection
Function similar_text_cn ($ str1, $ str2 ){
$ Arr_1 = array_unique (split_str ($ str1 ));
$ Arr_2 = array_unique (split_str ($ str2 ));
$ Similarity = count ($ arr_2)-count (array_diff ($ arr_2, $ arr_1 ));
Return $ similarity;
}
Levenshtein () Chinese character edition
The Code is as follows:
<? Php
// Split the string
Function mbStringToArray ($ string, $ encoding = 'utf-8 '){
$ ArrayResult = array ();
While ($ iLen = mb_strlen ($ string, $ encoding )){
Array_push ($ arrayResult, mb_substr ($ string, 0, 1, $ encoding ));
$ String = mb_substr ($ string, 1, $ iLen, $ encoding );
}
Return $ arrayResult;
}
// Editing distance
Function levenshtein_cn ($ str1, $ str2, $ costReplace = 1, $ encoding = 'utf-8 '){
$ Count_same_letter = 0;
$ D = array ();
$ Mb_len1 = mb_strlen ($ str1, $ encoding );
$ Mb_len2 = mb_strlen ($ str2, $ encoding );
$ Mb_str1 = mbStringToArray ($ str1, $ encoding );
$ Mb_str2 = mbStringToArray ($ str2, $ encoding );
For ($ i1 = 0; $ i1 <= $ mb_len1; $ i1 ++ ){
$ D [$ i1] = array ();
$ D [$ i1] [0] = $ i1;
}
For ($ i2 = 0; $ i2 <= $ mb_len2; $ i2 ++ ){
$ D [0] [$ i2] = $ i2;
}
For ($ i1 = 1; $ i1 <= $ mb_len1; $ i1 ++ ){
For ($ i2 = 1; $ i2 <= $ mb_len2; $ i2 ++ ){
// $ Cost = ($ str1 [$ i1-1] = $ str2 [$ i2-1])? 0: 1;
If ($ mb_str1 [$ i1-1] ===$ mb_str2 [$ i2-1]) {
$ Cost = 0;
$ Count_same_letter ++;
} Else {
$ Cost = $ costReplace; // replace
}
$ D [$ i1] [$ i2] = min ($ d [$ i1-1] [$ i2] + 1, // insert
$ D [$ i1] [$ i2-1] + 1, // Delete
$ D [$ i1-1] [$ i2-1] + $ cost );
}
}
Return $ d [$ mb_len1] [$ mb_len2];
// Return array ('distance '=> $ d [$ mb_len1] [$ mb_len2], 'count _ same_letter' => $ count_same_letter );
}
Longest Common subsequence LCS ()
The Code is as follows: