PHP has a function called similar_text () to calculate the similarity between two strings. A percentage can be obtained to indicate the similarity between the two strings. The results are as follows: similar_text (aaaa, aaaa, $ percent); var_dump ($ percent); float (100) similar_text (aaaa, aaaabbbb, $ percent); var_dump ($ percent );
PHP has a function called similar_text () to calculate the similarity between two strings. A percentage can be obtained to indicate the similarity between the two strings. The effect is as follows: similar_text ('aaa', 'aaa', $ percent); var_dump ($ percent); // float (100) similar_text ('aaa', 'aaaabbbb ', $ percent); var_dump ($ percent );/
PHP has a function called similar_text () to calculate the similarity between two strings. A percentage can be obtained to indicate the similarity between the two strings. The effect is as follows:
similar_text('aaaa', 'aaaa', $percent);var_dump($percent);//float(100)similar_text('aaaa', 'aaaabbbb', $percent);var_dump($percent);//float(66.666666666667)similar_text('abcdef', 'aabcdefg', $percent);var_dump($percent);//float(85.714285714286)
This function can be used for fuzzy search or other functions that require fuzzy match. This function has recently been involved in feature matching in the verification code recognition study.
But what algorithm does this function use? I studied his underlying implementation and summarized it into three steps:
(1) Find the longest segment of the same part of the two strings;
(2) use the same method to find the longest section of the same part in the remaining two sections, and so on until there is no identical part;
(3) similarity = sum of the lengths of all identical parts * 2/sum of the lengths of two strings;
The source code version I studied is PHP 5.4.6, and the relevant code is located in the filePhp-5.4.6/ext/standard/string. cThe2951 ~ 3031Line. The following is the source code after I add comments.
// Find the longest static void php_similar_str (const char * txt1, int len1, const char * txt2, int len2, int * pos1, int * pos2, int * max) {char * p, * q; char * end1 = (char *) txt1 + len1; char * end2 = (char *) txt2 + len2; int l; * max = 0; // start traversing for (p = (char *) txt1; p <end1; p ++) based on the first string) {// traverse the second string for (q = (char *) txt2; q <end2; q ++) {// If the characters are the same, continue searching, l is the length of the same part for (l = 0; (p + l <end1) & (q + l <end2) & (p [l] = q [l]); l ++); // find the longest l in the bubble method, remember the starting position of the same part. if (l> * max) {* max = l; * pos1 = p-txt1; * pos2 = q-txt2 ;}}}} // calculate the total length of the same part of the two strings. static int php_similar_char (const char * txt1, int len1, const char * txt2, int len2) {int sum; int pos1, pos2, max; // find the longest php_similar_str (txt1, len1, txt2, len2, & pos1, & pos2, & max) in the same part of the two strings ); // The initial value of sum is used to determine the value of max. // if max is zero, the two strings do not have any identical characters, ifif (sum = max) {// recursion of the first half segment, accumulating if (pos1 & pos2) {sum + = php_similar_char (txt1, pos1, txt2, pos2);} // recursion of the second half, accumulating if (pos1 + max <len1) & (pos2 + max <len2) with the same segment length )) {sum + = php_similar_char (txt1 + pos1 + max, len1-pos1-max, txt2 + pos2 + max, len2-pos2-max) ;}} return sum ;} // PHP function definition PHP_FUNCTION (similar_text) {char * t1, * t2; zval ** percent = NULL; int ac = ZEND_NUM_ARGS (); int sim; int t1_len, t2_len; // check the parameter validity if (zend_parse_parameters (ZEND_NUM_ARGS () TSRMLS_CC, "ss | Z", & t1, & t1_len, & t2, & t2_len, & percent) = FAILURE) {return;} // if the third parameter if (ac> 2) {convert_to_double_ex (percent) ;}// if both strings have a length of 0, return 0if (t1_len + t2_len = 0) {if (ac> 2) {Z_DVAL_PP (percent) = 0 ;}return_long (0) ;}// call the above function, calculate the similarity sim = php_similar_char (t1, t1_len, t2, t2_len) of the two strings; // you can see the formula for calculating the percent if (ac> 2) {Z_DVAL_PP (percent) = sim * 200.0/(t1_len + t2_len);} RETURN_LONG (sim );}
In addition, PHP provides another function levenshtein () for string similarity calculation. It calculates the distance between two strings to represent string similarity. This is also a common algorithm. Levenshtein () has better performance than similar_text (), because the complexity of similar_text () is O (n ^ 3 ), n indicates the length of the longest string, while levenshtein () is complex as O (m * n), and m and n are the lengths of the two strings respectively.
Original article address: Analysis of the similar_text () Principle of the PHP function. Thank you for sharing it with me.