Code used to calculate string similarity in PHP

Last Update:2018-03-13 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In php, the string similarity similar_text function and the similarity levenshtein function are described in detail. next we will introduce in detail the string similarity introduction similar_text-calculate the similarity between two strings.
Int similar_text (string $ first, string $ second [, float & $ percent])
$ First is required. Specifies the first string to be compared.
$ Second is required. Specifies the second string to be compared.
$ Percent is optional. Specifies the name of the variable that stores the percentage similarity.

The similarity between the two strings is calculated based on the description of Oliver [1993. Note that this implementation does not use the stack in the Oliver virtual code, but is called recursively. This may cause the whole process to slow down or become faster. Note that the complexity of this algorithm is O (N ** 3), and N is the length of the longest string.

For example, we want to find the similarity between the string abcdefg and the string aeg:

The code is as follows:

$ First = "abcdefg ";
$ Second = "aeg ";
Echo similar_text ($ first, $ second); result output 3. if you want to display it as a percentage, you can use its third parameter, as shown below:
$ First = "abcdefg ";
$ Second = "aeg ";
Similar_text ($ first, $ second, $ percent );
Echo $ percent;

Use and implementation of the similar_text function. The similar_text () function is mainly used to calculate the number of matching characters of two strings. It can also calculate the similarity between two strings (expressed in percentages ). Compared with the similar_text () function, the levenshtein () function we will introduce today is faster. However, the similar_text () function provides more accurate results with fewer necessary modifications. The levenshtein () function can be used when the speed is less accurate and the string length is limited.

Instructions for Use

Let's take a look at the levenshtein () function instructions in the manual:

The levenshtein () function returns the Levenshtein distance between two strings.

Levenshtein distance, also known as the editing distance, refers to the minimum number of edits required to convert a string from one to another. Licensed editing operations include replacing one character with another, inserting one character, and deleting one character.

For example, convert kitten to sitting:

Sitten (k → s)
Sittin (e → I)
The sitting (→ g) levenshtein () function gives each operation the same weight (replacement, insertion, and deletion. However, you can set optional insert, replace, and delete parameters to define the cost of each operation.

Syntax:

Levenshtein (string1, string2, insert, replace, delete)

Parameter description

• String1 is required. The first string to be compared.
• String2 is required. The second string to be compared.
• Insert is optional. The cost of inserting a character. The default value is 1.
• Replace is optional. The cost of replacing a character. The default value is 1.
• Delete is optional. The cost of deleting a character. The default value is 1.
Tips and comments

• If one of the strings exceeds 255 characters, the levenshtein () function returns-1.
• The levenshtein () function is case insensitive.
• The levenshtein () function is faster than the similar_text () function. However, the similar_text () function provides more accurate results that require less modification.
Example

The code is as follows:

<? Php
Echo levenshtein ("Hello World", "ello World ");
Echo"
";
Echo levenshtein ("Hello World", "ello World", 10, 20, 30 );
?>

Output: 1 30

The following is a supplement:

By default, php has a function similar_text () used to calculate the similarity between strings. This function can also calculate the similarity between two strings (expressed in percentages ). However, this function is not accurate in Chinese computing, for example:

The code is as follows:

Echo similar_text ("112 people were killed in a fire in Jilin poultry company", "112 people were killed in a fire in Jilin Baoyuanfeng Poultry Company ");

The two news titles are actually the same. if similar_text () is used for similarity, the result is: 42, that is, it is only similar to 42%, so this feeling is very unreliable, today, I just collected a piece of PHP code that is used to compare the similarity between two strings and paste the code directly:

<? Php class LCS {var $ str1; var $ str2; var $ c = array ();/* returns the longest common subsequence of string 1 and string 2 */function getLCS ($ str1, $ str2, $ len1 = 0, $ len2 = 0) {$ this-> str1 = $ str1; $ this-> str2 = $ str2; if ($ len1 = 0) $ len1 = strlen ($ str1); if ($ len2 = 0) $ len2 = strlen ($ str2); $ this-> initC ($ len1, $ len2 ); return $ this-> printLCS ($ this-> c, $ len1-1, $ len2-1);}/* returns the similarity of two strings */function getSimilar ($ str1, $ str2) {$ len1 = strlen ($ str1); $ len2 = strlen ($ str2); $ len = strlen ($ this-> getLCS ($ str1, $ str2, $ len1, $ len2); return $ len * 2/($ len1 + $ len2);} function initC ($ len1, $ len2) {for ($ I = 0; $ I <$ len1; $ I ++) $ this-> c [$ I] [0] = 0; for ($ j = 0; $ j <$ len2; $ j ++) $ this-> c [0] [$ j] = 0; for ($ I = 1; $ I <$ len1; $ I ++) {for ($ j = 1; $ j <$ len2; $ j ++) {if ($ this-> str1 [$ I] ==$ this-> str2 [$ j]) {$ this-> c [$ I] [$ j] = $ this-> c [$ I-1] [$ j-1] + 1 ;} else if ($ this-> c [$ I-1] [$ j] >=$ this-> c [$ I] [$ j-1]) {$ this-> c [$ I] [$ j] = $ this-> c [$ I-1] [$ j];} else {$ this-> c [$ I] [$ j] = $ this-> c [$ I] [$ j-1] ;}}} function printLCS ($ c, $ I, $ j) {if ($ I = 0 | $ j = 0) {if ($ this-> str1 [$ I] ==$ this-> str2 [$ j]) return $ this-> str2 [$ j]; else return "";} if ($ this-> str1 [$ I] ==$ this-> str2 [$ j]) {return $ this-> printLCS ($ this-> c, $ I-1, $ j-1 ). $ this-> str2 [$ j];} else if ($ this-> c [$ I-1] [$ j] >=$ this-> c [$ I] [$ j-1]) {return $ this-> printLCS ($ this-> c, $ I-1, $ j);} else {return $ this-> printLCS ($ this-> c, $ I, $ j-1) ;}}$ lcs = new LCS (); // returns the longest common subsequence $ lcs-> getLCS ("hello word ", "hello china"); // returns similarity echo $ lcs-> getSimilar ("Jilin poultry company killed 112 people in a fire ", "112 people have been killed in a fire in Jilin Baoyuanfeng Poultry Company ");

The output result is 0.90322580645161, which is much more accurate.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Code used to calculate string similarity in PHP

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Code used to calculate string similarity in PHP

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support