PHP improves the similar_text (), levenshtein (), and levenshtein functions for string similarity calculation.

Source: Internet
Author: User
Tags crc32

PHP improves the similar_text (), levenshtein (), and levenshtein functions for string similarity calculation.

Similar_text () Chinese character edition

Copy codeThe Code is as follows:
<? Php
// Split the string
Function split_str ($ str ){
Preg_match_all ("/./u", $ str, $ arr );
Return $ arr [0];
}

// Similarity Detection
Function similar_text_cn ($ str1, $ str2 ){
$ Arr_1 = array_unique (split_str ($ str1 ));
$ Arr_2 = array_unique (split_str ($ str2 ));
$ Similarity = count ($ arr_2)-count (array_diff ($ arr_2, $ arr_1 ));

Return $ similarity;
}

Levenshtein () Chinese character edition
 

Copy codeThe Code is as follows:
<? Php
// Split the string
Function mbStringToArray ($ string, $ encoding = 'utf-8 '){
$ ArrayResult = array ();
While ($ iLen = mb_strlen ($ string, $ encoding )){
Array_push ($ arrayResult, mb_substr ($ string, 0, 1, $ encoding ));
$ String = mb_substr ($ string, 1, $ iLen, $ encoding );
}
Return $ arrayResult;
}
// Editing distance
Function levenshtein_cn ($ str1, $ str2, $ costReplace = 1, $ encoding = 'utf-8 '){
$ Count_same_letter = 0;
$ D = array ();
$ Mb_len1 = mb_strlen ($ str1, $ encoding );
$ Mb_len2 = mb_strlen ($ str2, $ encoding );
$ Mb_str1 = mbStringToArray ($ str1, $ encoding );
$ Mb_str2 = mbStringToArray ($ str2, $ encoding );
For ($ i1 = 0; $ i1 <= $ mb_len1; $ i1 ++ ){
$ D [$ i1] = array ();
$ D [$ i1] [0] = $ i1;
}
For ($ i2 = 0; $ i2 <= $ mb_len2; $ i2 ++ ){
$ D [0] [$ i2] = $ i2;
}
For ($ i1 = 1; $ i1 <= $ mb_len1; $ i1 ++ ){
For ($ i2 = 1; $ i2 <= $ mb_len2; $ i2 ++ ){
// $ Cost = ($ str1 [$ i1-1] = $ str2 [$ i2-1])? 0: 1;
If ($ mb_str1 [$ i1-1] ===$ mb_str2 [$ i2-1]) {
$ Cost = 0;
$ Count_same_letter ++;
} Else {
$ Cost = $ costReplace; // replace
}
$ D [$ i1] [$ i2] = min ($ d [$ i1-1] [$ i2] + 1, // insert
$ D [$ i1] [$ i2-1] + 1, // Delete
$ D [$ i1-1] [$ i2-1] + $ cost );
}
}
Return $ d [$ mb_len1] [$ mb_len2];
// Return array ('distance '=> $ d [$ mb_len1] [$ mb_len2], 'count _ same_letter' => $ count_same_letter );
}

 
Longest Common subsequence LCS ()

 
Copy codeThe Code is as follows:
<? Php
// English version of the longest common subsequence
Function LCS_en ($ str_1, $ str_2 ){
$ Len_1 = strlen ($ str_1 );
$ Len_2 = strlen ($ str_2 );
$ Len = $ len_1> $ len_2? $ Len_1: $ len_2;
$ Dp = array ();
For ($ I = 0; $ I <= $ len; $ I ++ ){
$ Dp [$ I] = array ();
$ Dp [$ I] [0] = 0;
$ Dp [0] [$ I] = 0;
}
For ($ I = 1; $ I <= $ len_1; $ I ++ ){
For ($ j = 1; $ j <= $ len_2; $ j ++ ){
If ($ str_1 [$ I-1] ==$ str_2 [$ j-1]) {
$ Dp [$ I] [$ j] = $ dp [$ I-1] [$ j-1] + 1;
} Else {
$ Dp [$ I] [$ j] = $ dp [$ I-1] [$ j]> $ dp [$ I] [$ j-1]? $ Dp [$ I-1] [$ j]: $ dp [$ I] [$ j-1];
}
}
}
Return $ dp [$ len_1] [$ len_2];
}
// Split the string
Function mbStringToArray ($ string, $ encoding = 'utf-8 '){
$ ArrayResult = array ();
While ($ iLen = mb_strlen ($ string, $ encoding )){
Array_push ($ arrayResult, mb_substr ($ string, 0, 1, $ encoding ));
$ String = mb_substr ($ string, 1, $ iLen, $ encoding );
}
Return $ arrayResult;
}
// Chinese version of the longest common subsequence
Function LCS_cn ($ str1, $ str2, $ encoding = 'utf-8 '){
$ Mb_len1 = mb_strlen ($ str1, $ encoding );
$ Mb_len2 = mb_strlen ($ str2, $ encoding );
$ Mb_str1 = mbStringToArray ($ str1, $ encoding );
$ Mb_str2 = mbStringToArray ($ str2, $ encoding );
$ Len = $ mb_len1> $ mb_len2? $ Mb_len1: $ mb_len2;
$ Dp = array ();
For ($ I = 0; $ I <= $ len; $ I ++ ){
$ Dp [$ I] = array ();
$ Dp [$ I] [0] = 0;
$ Dp [0] [$ I] = 0;
}
For ($ I = 1; $ I <= $ mb_len1; $ I ++ ){
For ($ j = 1; $ j <= $ mb_len2; $ j ++ ){
If ($ mb_str1 [$ I-1] ==$ mb_str2 [$ j-1]) {
$ Dp [$ I] [$ j] = $ dp [$ I-1] [$ j-1] + 1;
} Else {
$ Dp [$ I] [$ j] = $ dp [$ I-1] [$ j]> $ dp [$ I] [$ j-1]? $ Dp [$ I-1] [$ j]: $ dp [$ I] [$ j-1];
}
}
}
Return $ dp [$ mb_len1] [$ mb_len2];
}


(100 points) [php] write several familiar string processing functions!

Addcslashes addslashes bin2hex chop chr chunk_split convert_cyr_string cyrillic
Convert_uudecode convert_uuencode count_chars crc32 crc32 crypt echo explode

Fprintf get_html_translation_table hebrev

Hebrevc
Hex2bin-Decodes a hexadecimally encoded binary string
Html_entity_decode-Convert all HTML entities to their applicable characters
Htmlentities-Convert all applicable characters to HTML entities
Htmlspecialchars_decode-Convert special HTML entities back to characters
Htmlspecialchars-Convert special characters to HTML entities
Implode-Join array elements with a string
Join

Lcfirst-Make a string's first character lowercase
Levenshtein-Calculate Levenshtein distance between two strings
Localeconv-Get numeric formatting information
Ltrim-Strip whitespace (or other characters) from the beginning of a string
Md5_file
Metaphone-Calculate the metaphone key of a string
Money_format-Formats a number as a currency string
Nl_langinfo-Query language and locale information
Nl2br

Number_format-Format a number with grouped thousands
Ord

Parse_str

Print

Printf

Quoted_printable_decode-Convert a quoted-printable string to an 8-bit string
Quoted_printable_encode-Convert a 8 bit string to a quoted-printable string
Quotemeta-Quote meta characters
Rtrim
Setlocale-Set locale information
Sha1_file

Sha1

Soundex-Calculate the soundex key of a string
Sprintf-Return a formatted string
Sscanf-Parses input from a string according to a ...... the remaining full text>

Can the levenshtein function of php be easily understood?

W3School explanation:
The levenshtein () function returns the Levenshtein distance between two strings.
Levenshtein distance, also known as the editing distance, refers to the minimum number of edits required to convert a string from one to another. Licensed editing operations include replacing one character with another, inserting one character, and deleting one character.
For example, convert kitten to sitting:
Sitten (k → s)
Sittin (e → I)
Sitting (→ g)
The levenshtein () function gives each operation the same weight (replacement, insertion, and deletion. However, you can set optional insert, replace, and delete parameters to define the cost of each operation.

Note: "price" is the weight. In the example of the landlord, Hello World → ello World, you need to "delete" "H", that is, the fifth parameter is used, and the corresponding weight is 30, so 30 is returned.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.