The native similar_text () function and levenshtein () function of PHP have poor support for Chinese characters. I wrote one by myself and the test is normal. I recommend it to you. if you have any questions, please leave a message.
Similar_text () Chinese character edition
The code is as follows:
<? Php
// Split the string
Function split_str ($ str ){
Preg_match_all ("/./u", $ str, $ arr );
Return $ arr [0];
}
// Similarity detection
Function similar_text_cn ($ str1, $ str2 ){
$ Arr_1 = array_unique (split_str ($ str1 ));
$ Arr_2 = array_unique (split_str ($ str2 ));
$ Similarity = count ($ arr_2)-count (array_diff ($ arr_2, $ arr_1 ));
Return $ similarity;
}
Levenshtein () Chinese character edition
The code is as follows:
<? Php
// Split the string
Function mbStringToArray ($ string, $ encoding = 'utf-8 '){
$ ArrayResult = array ();
While ($ iLen = mb_strlen ($ string, $ encoding )){
Array_push ($ arrayResult, mb_substr ($ string, 0, 1, $ encoding ));
$ String = mb_substr ($ string, 1, $ iLen, $ encoding );
}
Return $ arrayResult;
}
// Editing distance
Function levenshtein_cn ($ str1, $ str2, $ costReplace = 1, $ encoding = 'utf-8 '){
$ Count_same_letter = 0;
$ D = array ();
$ Mb_len1 = mb_strlen ($ str1, $ encoding );
$ Mb_len2 = mb_strlen ($ str2, $ encoding );
$ Mb_str1 = mbStringToArray ($ str1, $ encoding );
$ Mb_str2 = mbStringToArray ($ str2, $ encoding );
For ($ i1 = 0; $ i1 <= $ mb_len1; $ i1 ++ ){
$ D [$ i1] = array ();
$ D [$ i1] [0] = $ i1;
}
For ($ i2 = 0; $ i2 <= $ mb_len2; $ i2 ++ ){
$ D [0] [$ i2] = $ i2;
}
For ($ i1 = 1; $ i1 <= $ mb_len1; $ i1 ++ ){
For ($ i2 = 1; $ i2 <= $ mb_len2; $ i2 ++ ){
// $ Cost = ($ str1 [$ i1-1] = $ str2 [$ i2-1])? 0: 1;
If ($ mb_str1 [$ i1-1] ===$ mb_str2 [$ i2-1]) {
$ Cost = 0;
$ Count_same_letter ++;
} Else {
$ Cost = $ costReplace; // replace
}
$ D [$ i1] [$ i2] = min ($ d [$ i1-1] [$ i2] + 1, // Insert
$ D [$ i1] [$ i2-1] + 1, // delete
$ D [$ i1-1] [$ i2-1] + $ cost );
}
}
Return $ d [$ mb_len1] [$ mb_len2];
// Return array ('distance '=> $ d [$ mb_len1] [$ mb_len2], 'Count _ same_letter' => $ count_same_letter );
}
Longest common subsequence LCS ()
The code is as follows:
<? Php
// English version of the longest common subsequence
Function LCS_en ($ str_1, $ str_2 ){
$ Len_1 = strlen ($ str_1 );
$ Len_2 = strlen ($ str_2 );
$ Len = $ len_1> $ len_2? $ Len_1: $ len_2;
$ Dp = array ();
For ($ I = 0; $ I <= $ len; $ I ++ ){
$ Dp [$ I] = array ();
$ Dp [$ I] [0] = 0;
$ Dp [0] [$ I] = 0;
}
For ($ I = 1; $ I <= $ len_1; $ I ++ ){
For ($ j = 1; $ j <= $ len_2; $ j ++ ){
If ($ str_1 [$ I-1] ==$ str_2 [$ j-1]) {
$ Dp [$ I] [$ j] = $ dp [$ I-1] [$ j-1] + 1;
} Else {
$ Dp [$ I] [$ j] = $ dp [$ I-1] [$ j]> $ dp [$ I] [$ j-1]? $ Dp [$ I-1] [$ j]: $ dp [$ I] [$ j-1];
}
}
}
Return $ dp [$ len_1] [$ len_2];
}
// Split the string
Function mbStringToArray ($ string, $ encoding = 'utf-8 '){
$ ArrayResult = array ();
While ($ iLen = mb_strlen ($ string, $ encoding )){
Array_push ($ arrayResult, mb_substr ($ string, 0, 1, $ encoding ));
$ String = mb_substr ($ string, 1, $ iLen, $ encoding );
}
Return $ arrayResult;
}
// Chinese version of the longest common subsequence
Function LCS_cn ($ str1, $ str2, $ encoding = 'utf-8 '){
$ Mb_len1 = mb_strlen ($ str1, $ encoding );
$ Mb_len2 = mb_strlen ($ str2, $ encoding );
$ Mb_str1 = mbStringToArray ($ str1, $ encoding );
$ Mb_str2 = mbStringToArray ($ str2, $ encoding );
$ Len = $ mb_len1> $ mb_len2? $ Mb_len1: $ mb_len2;
$ Dp = array ();
For ($ I = 0; $ I <= $ len; $ I ++ ){
$ Dp [$ I] = array ();
$ Dp [$ I] [0] = 0;
$ Dp [0] [$ I] = 0;
}
For ($ I = 1; $ I <= $ mb_len1; $ I ++ ){
For ($ j = 1; $ j <= $ mb_len2; $ j ++ ){
If ($ mb_str1 [$ I-1] ==$ mb_str2 [$ j-1]) {
$ Dp [$ I] [$ j] = $ dp [$ I-1] [$ j-1] + 1;
} Else {
$ Dp [$ I] [$ j] = $ dp [$ I-1] [$ j]> $ dp [$ I] [$ j-1]? $ Dp [$ I-1] [$ j]: $ dp [$ I] [$ j-1];
}
}
}
Return $ dp [$ mb_len1] [$ mb_len2];
}