Principle of the PHP function similar_text ()

Source: Internet
Author: User
Tags first string

PHP has a function called similar_text () to calculate the similarity between two strings. A percentage can be obtained to indicate the similarity between the two strings. The effect is as follows:

Similar_text ('aaa', 'aaa', $ percent );

Var_dump ($ percent );

// Float (1, 100)

Similar_text ('aaa', 'aaaabbbb ', $ percent );

Var_dump ($ percent );

// Float (1, 66.666666666667)

Similar_text ('abcdef', 'abcdefg', $ percent );

Var_dump ($ percent );

// Float (1, 85.714285714286)

This function can be used for fuzzy search or other functions that require fuzzy match. This function has recently been involved in feature matching in the verification code recognition study.

But what algorithm does this function use? I studied his underlying implementation and summarized it into three steps:

(1) Find the longest segment of the same part of the two strings;

(2) use the same method to find the longest section of the same part in the remaining two sections, and so on until there is no identical part;

(3) similarity = sum of the lengths of all identical parts * 2/sum of the lengths of two strings;

The source code version I studied is PHP 5.4.6, the relevant code is located in the file php-5.4.6/ext/standard/string. c 2,951st ~ 3031 rows. The following is the source code after I add comments.

// Find the longest segment of the same part of the two strings

Static void php_similar_str (const char * txt1, int len1, const char * txt2, int len2, int * pos1, int * pos2, int * max)

{

Char * p, * q;

Char * end1 = (char *) txt1 + len1;

Char * end2 = (char *) txt2 + len2;

Int l;

* Max = 0;

// Start traversal based on the first string

For (p = (char *) txt1; p <end1; p ++ ){

// Traverse the second string

For (q = (char *) txt2; q <end2; q ++ ){

// If the characters are the same, search for them again. The value l indicates the length of the same part.

For (l = 0; (p + l <end1) & (q + l <end2) & (p [l] = q [l]); l ++ );

// The Bubble Method finds the longest l and remembers the start position of the same part.

If (l> * max ){

* Max = l;

* Pos1 = p-txt1;

* Pos2 = q-txt2;

}

}

}

}

// Calculate the total length of the same part of the two strings

Static int php_similar_char (const char * txt1, int len1, const char * txt2, int len2)

{

Int sum;

Int pos1, pos2, max;

// Find the longest segment of the same part of the two strings

Php_similar_str (txt1, len1, txt2, len2, & pos1, & pos2, & max );

// Here is the initial value of sum, which is also the judgment of max value.

// If max is zero, it indicates that the two strings do not have any identical characters and will jump out of the if

If (sum = max )){

// Recursion of the first half segment, accumulating the same segment length

If (pos1 & pos2 ){

Sum + = php_similar_char (txt1, pos1,

Txt2, pos2 );

}

// Recursion of the second half, accumulating the same segment length

If (pos1 + max <len1) & (pos2 + max <len2 )){

Sum + = php_similar_char (txt1 + pos1 + max, len1-pos1-max,

Txt2 + pos2 + max, len2-pos2-max );

}

}

Return sum;

}

// PHP function definition

PHP_FUNCTION (similar_text)

{

Char * t1, * t2;

Zval ** percent = NULL;

Int ac = ZEND_NUM_ARGS ();

Int sim;

Int t1_len, t2_len;

// Check the validity of parameters

If (zend_parse_parameters (ZEND_NUM_ARGS () TSRMLS_CC, "ss | Z", & t1, & t1_len, & t2, & t2_len, & percent) = FAILURE ){

Return;

}

// If there is a third parameter

If (ac> 2 ){

Convert_to_double_ex (percent );

}

// If both strings are 0 in length, 0 is returned.

If (t1_len + t2_len = 0 ){

If (ac> 2 ){

Z_DVAL_PP (percent) = 0;

}

RETURN_LONG (0 );

}

// Call the above function to calculate the similar libraries of the two strings

Sim = php_similar_char (t1, t1_len, t2, t2_len );

// You can see the formula for calculating the third parameter percent.

If (ac> 2 ){

Z_DVAL_PP (percent) = sim * 200.0/(t1_len + t2_len );

}

RETURN_LONG (sim );

}

In addition, PHP provides another function levenshtein () for string similarity calculation. It calculates the distance between two strings to represent string similarity. This is also a common algorithm. Levenshtein () has better performance than similar_text (), because the complexity of similar_text () is O (n ^ 3 ), n indicates the length of the longest string, while levenshtein () is complex as O (m * n), and m and n are the lengths of the two strings respectively.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.