PHP improves the similar_text (), levenshtein (), and levenshtein functions for string similarity calculation.
Similar_text () Chinese character edition
Copy codeThe Code is as follows:
<? Php
// Split the string
Function split_str ($ str ){
Preg_match_all ("/./u", $ str, $ arr );
Return $ arr [0];
}
// Similarity Detection
Function similar_text_cn ($ str1, $ str2 ){
$ Arr_1 = array_unique (split_str ($ str1 ));
$ Arr_2 = array_unique (split_str ($ str2 ));
$ Similarity = count ($ arr_2)-count (array_diff ($ arr_2, $ arr_1 ));
Return $ similarity;
}
Levenshtein () Chinese character edition
Copy codeThe Code is as follows:
<? Php
// Split the string
Function mbStringToArray ($ string, $ encoding = 'utf-8 '){
$ ArrayResult = array ();
While ($ iLen = mb_strlen ($ string, $ encoding )){
Array_push ($ arrayResult, mb_substr ($ string, 0, 1, $ encoding ));
$ String = mb_substr ($ string, 1, $ iLen, $ encoding );
}
Return $ arrayResult;
}
// Editing distance
Function levenshtein_cn ($ str1, $ str2, $ costReplace = 1, $ encoding = 'utf-8 '){
$ Count_same_letter = 0;
$ D = array ();
$ Mb_len1 = mb_strlen ($ str1, $ encoding );
$ Mb_len2 = mb_strlen ($ str2, $ encoding );
$ Mb_str1 = mbStringToArray ($ str1, $ encoding );
$ Mb_str2 = mbStringToArray ($ str2, $ encoding );
For ($ i1 = 0; $ i1 <= $ mb_len1; $ i1 ++ ){
$ D [$ i1] = array ();
$ D [$ i1] [0] = $ i1;
}
For ($ i2 = 0; $ i2 <= $ mb_len2; $ i2 ++ ){
$ D [0] [$ i2] = $ i2;
}
For ($ i1 = 1; $ i1 <= $ mb_len1; $ i1 ++ ){
For ($ i2 = 1; $ i2 <= $ mb_len2; $ i2 ++ ){
// $ Cost = ($ str1 [$ i1-1] = $ str2 [$ i2-1])? 0: 1;
If ($ mb_str1 [$ i1-1] ===$ mb_str2 [$ i2-1]) {
$ Cost = 0;
$ Count_same_letter ++;
} Else {
$ Cost = $ costReplace; // replace
}
$ D [$ i1] [$ i2] = min ($ d [$ i1-1] [$ i2] + 1, // insert
$ D [$ i1] [$ i2-1] + 1, // Delete
$ D [$ i1-1] [$ i2-1] + $ cost );
}
}
Return $ d [$ mb_len1] [$ mb_len2];
// Return array ('distance '=> $ d [$ mb_len1] [$ mb_len2], 'count _ same_letter' => $ count_same_letter );
}
Longest Common subsequence LCS ()
Copy codeThe Code is as follows:
<? Php
// English version of the longest common subsequence
Function LCS_en ($ str_1, $ str_2 ){
$ Len_1 = strlen ($ str_1 );
$ Len_2 = strlen ($ str_2 );
$ Len = $ len_1> $ len_2? $ Len_1: $ len_2;
$ Dp = array ();
For ($ I = 0; $ I <= $ len; $ I ++ ){
$ Dp [$ I] = array ();
$ Dp [$ I] [0] = 0;
$ Dp [0] [$ I] = 0;
}
For ($ I = 1; $ I <= $ len_1; $ I ++ ){
For ($ j = 1; $ j <= $ len_2; $ j ++ ){
If ($ str_1 [$ I-1] ==$ str_2 [$ j-1]) {
$ Dp [$ I] [$ j] = $ dp [$ I-1] [$ j-1] + 1;
} Else {
$ Dp [$ I] [$ j] = $ dp [$ I-1] [$ j]> $ dp [$ I] [$ j-1]? $ Dp [$ I-1] [$ j]: $ dp [$ I] [$ j-1];
}
}
}
Return $ dp [$ len_1] [$ len_2];
}
// Split the string
Function mbStringToArray ($ string, $ encoding = 'utf-8 '){
$ ArrayResult = array ();
While ($ iLen = mb_strlen ($ string, $ encoding )){
Array_push ($ arrayResult, mb_substr ($ string, 0, 1, $ encoding ));
$ String = mb_substr ($ string, 1, $ iLen, $ encoding );
}
Return $ arrayResult;
}
// Chinese version of the longest common subsequence
Function LCS_cn ($ str1, $ str2, $ encoding = 'utf-8 '){
$ Mb_len1 = mb_strlen ($ str1, $ encoding );
$ Mb_len2 = mb_strlen ($ str2, $ encoding );
$ Mb_str1 = mbStringToArray ($ str1, $ encoding );
$ Mb_str2 = mbStringToArray ($ str2, $ encoding );
$ Len = $ mb_len1> $ mb_len2? $ Mb_len1: $ mb_len2;
$ Dp = array ();
For ($ I = 0; $ I <= $ len; $ I ++ ){
$ Dp [$ I] = array ();
$ Dp [$ I] [0] = 0;
$ Dp [0] [$ I] = 0;
}
For ($ I = 1; $ I <= $ mb_len1; $ I ++ ){
For ($ j = 1; $ j <= $ mb_len2; $ j ++ ){
If ($ mb_str1 [$ I-1] ==$ mb_str2 [$ j-1]) {
$ Dp [$ I] [$ j] = $ dp [$ I-1] [$ j-1] + 1;
} Else {
$ Dp [$ I] [$ j] = $ dp [$ I-1] [$ j]> $ dp [$ I] [$ j-1]? $ Dp [$ I-1] [$ j]: $ dp [$ I] [$ j-1];
}
}
}
Return $ dp [$ mb_len1] [$ mb_len2];
}
(100 points) [php] write several familiar string processing functions!
Addcslashes addslashes bin2hex chop chr chunk_split convert_cyr_string cyrillic
Convert_uudecode convert_uuencode count_chars crc32 crc32 crypt echo explode
Fprintf get_html_translation_table hebrev
Hebrevc
Hex2bin-Decodes a hexadecimally encoded binary string
Html_entity_decode-Convert all HTML entities to their applicable characters
Htmlentities-Convert all applicable characters to HTML entities
Htmlspecialchars_decode-Convert special HTML entities back to characters
Htmlspecialchars-Convert special characters to HTML entities
Implode-Join array elements with a string
Join
Lcfirst-Make a string's first character lowercase
Levenshtein-Calculate Levenshtein distance between two strings
Localeconv-Get numeric formatting information
Ltrim-Strip whitespace (or other characters) from the beginning of a string
Md5_file
Metaphone-Calculate the metaphone key of a string
Money_format-Formats a number as a currency string
Nl_langinfo-Query language and locale information
Nl2br
Number_format-Format a number with grouped thousands
Ord
Parse_str
Print
Printf
Quoted_printable_decode-Convert a quoted-printable string to an 8-bit string
Quoted_printable_encode-Convert a 8 bit string to a quoted-printable string
Quotemeta-Quote meta characters
Rtrim
Setlocale-Set locale information
Sha1_file
Sha1
Soundex-Calculate the soundex key of a string
Sprintf-Return a formatted string
Sscanf-Parses input from a string according to a ...... the remaining full text>
Can the levenshtein function of php be easily understood?
W3School explanation:
The levenshtein () function returns the Levenshtein distance between two strings.
Levenshtein distance, also known as the editing distance, refers to the minimum number of edits required to convert a string from one to another. Licensed editing operations include replacing one character with another, inserting one character, and deleting one character.
For example, convert kitten to sitting:
Sitten (k → s)
Sittin (e → I)
Sitting (→ g)
The levenshtein () function gives each operation the same weight (replacement, insertion, and deletion. However, you can set optional insert, replace, and delete parameters to define the cost of each operation.
Note: "price" is the weight. In the example of the landlord, Hello World → ello World, you need to "delete" "H", that is, the fifth parameter is used, and the corresponding weight is 30, so 30 is returned.