[Code] PHP analysis function similar

Source: Internet
Author: User
PHP has a function to calculate the phase of two strings (similar_text (). A percentage can be obtained to indicate the phase of the two strings. The effect is as follows: 1similar_text (aaaa, aaaa, $ percent); 2var_dump ($ percent); 3 float (100) 4similar_text (aaaa, aaaabbbb, $ percent); 5var_dump)

PHP has a function to calculate the phase of two strings (similar_text (). A percentage can be obtained to indicate the phase of the two strings. The effect is as follows: 1similar_text ('aaa', 'aaa', $ percent); 2var_dump ($ percent); 3 // float (100) 4similar_text ('aaa', 'aaaabbb ', $ percent); 5var_dump ($ percent)

PHPThere is a similar_text () function used to calculate the similarity between two strings. A percentage can be obtained to indicate the similarity between the two strings. The effect is as follows:


1similar_text('aaaa', 'aaaa', $percent);2var_dump($percent);3//float(100)4similar_text('aaaa', 'aaaabbbb', $percent);5var_dump($percent);6//float(66.666666666667)7similar_text('abcdef', 'aabcdefg', $percent);8var_dump($percent);9//float(85.714285714286)


This function can be used for fuzzy search or other functions that require fuzzy match. This function has recently been involved in feature matching in the verification code recognition study.


But what algorithm does this function use? I studied his underlying implementation and summarized it into three steps:


(1) Find the longest segment of the same part of the two strings;
(2) use the same method to find the longest section of the same part in the remaining two sections, and so on until there is no identical part;
(3) similarity = sum of the lengths of all identical parts * 2/sum of the lengths of two strings;


The source code version I studied is PHP 5.4.6, the relevant code is located in the file php-5.4.6/ext/standard/string. c 2,951st ~ 3031 rows. The following is the source code after I add comments.

01 // find the longest section of the two strings, 02 static void php_similar_str (const char * txt1, int len1, const char * txt2, int len2, int * pos1, int * pos2, int * max) 03 {04 char * p, * q; 05 char * end1 = (char *) txt1 + len1; 06 char * end2 = (char *) txt2 + len2; 07 int l; 08 09 * max = 0; 10 // start traversing 11 for (p = (char *) txt1 based on the first string; p <end1; p ++) {12 // traverse the second string 13 for (q = (char *) txt2; q <end2; q ++) {14 // If any character is found to be the same, continue to find the loop. l is the same part of the length of 15 for (l = 0; (p + l <end1) & (q + l <end2) & (p [l] = q [l]); l ++); 16 // find the longest l in the bubble method, remember the starting position of the same part 17 if (l> * max) {18 * max = l; 19 * pos1 = p-txt1; 20 * pos2 = q-txt2; 21} 22} 23} 24} 25 26 // calculate the total length of the same part of the two strings 27 static int php_similar_char (const char * txt1, int len1, const char * txt2, int len2) 28 {29 int sum; 30 int pos1, pos2, max; 31 32 // find the longest section of the two strings in the same part. 33 php_similar_str (txt1, len1, txt2, len2, & pos1, & pos2, & max); 34 // here is the initial value of sum, which is also the judgment of max value 35 // if max is zero, it indicates that the two strings do not have any identical characters, and the if36 if (sum = max) {37 // recursion of the first half of the string, total length of the same segment 38 if (pos1 & pos2) {39 sum + = php_similar_char (txt1, pos1, 40 txt2, pos2); 41} 42 // recursion of the second half of a pair, total length of the same segment 43 if (pos1 + max <len1) & (pos2 + max <len2) {44 sum + = php_similar_char (txt1 + pos1 + max, len1-pos1-max, 45 txt2 + pos2 + max, len2-pos2-max); 46} 47} 48 49 return sum; 50} 51 52 // PHP function definition 53PHP_FUNCTION (similar_text) 54 {55 char * t1, * t2; 56 zval ** percent = NULL; 57 int ac = ZEND_NUM_ARGS (); 58 int sim; 59 int t1_len, t2_len; 60 61 // check the validity of the parameter 62 if (zend_parse_parameters (ZEND_NUM_ARGS () TSRMLS_CC, "ss | Z", & t1, & t1_len, & t2, & t2_len, & percent) = FAILURE) {63 return; 64} 65 66 // if there is a third parameter 67 if (ac> 2) {68 convert_to_double_ex (percent); 69} 70 71 // if both strings have 0 lengths, return 072 if (t1_len + t2_len = 0) {73 if (ac> 2) {74 Z_DVAL_PP (percent) = 0; 75} 76 77 RETURN_LONG (0); 78} 79 80 // call the above function, calculate the similarity of the two strings 81 sim = php_similar_char (t1, t1_len, t2, t2_len); 82 83 // you can see the formula 84 if (ac> 2) {85 Z_DVAL_PP (percent) = sim * 200.0/(t1_len + t2_len); 86} 87 88 RETURN_LONG (sim); 89}


In addition,PHPAnother function levenshtein () is provided to calculate the string similarity. The levenshtein () is a common algorithm that calculates the distance between two string edits to indicate the string similarity. Levenshtein () has better performance than similar_text (), because the complexity of similar_text () is O (n ^ 3 ), n indicates the length of the longest string, while levenshtein () is complex as O (m * n), and m and n are the lengths of the two strings respectively.


The above is the principle of the PHP analysis function similar_text (). I hope this article will be helpful to php developers. Thank you for reading this article. More informationPhp Technical ProblemsWelcome to group discussion:304224365, Verification code:Csl. If you do not write the verification code, it will not pass.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.