Text comparison algorithm III -- calculate Text Similarity

Source: Internet
Author: User

In "text comparisonAlgorithmI -- LD algorithm "describes the calculation of the editing distance.

The longest common substring calculation is introduced in "text comparison algorithm ⅱ -- Needleman/Wunsch algorithm.

In the given string a and string B, LD (a, B) indicates the editing distance, and LCS (a, B) indicates the length of the longest common substring.

How can we measure the similarity between them?

Set S (a, B) to indicate the similarity between string a and string B. Therefore, reasonable similarity should meet the following requirements.

Property 1: 0 ≤ S (a, B) ≤ 100%, 0 indicates completely different, 100% indicates completely equal

Property 2: S (a, B) = S (B,)

Currently, all the similarity calculations introduced on the Internet are not reasonable.

Formula 1: S (a, B) = 1/(LD (a, B) + 1)

Can perfectly meet the second nature.

When LD (a, B) = 0, S (a, B) = 100%, but no matter LD (a, B), take any value, S (a, B)> 0, cannot meet the nature of 1.

 

Formula 2: S (a, B) = 1-ld (A, B)/Len ()

When Len (B)> Len (A), S (a, B) <0. Does not meet Nature 1.

Some people may say that when S (A, B) is <0, the problem is solved by forcibly specifying S (a, B) = 0.

The problem is that S (a, B) = 1-ld (A, B)/Len (A), and S (B, A) = 1-ld (B,) /Len (B ). When Len (a) is less than Len (B), S (A, B) is less than S (B, ). Unsatisfied nature 2

Another example can be used to illustrate the problem.

A = "BC", B = "cd", c = "Ef"

S (a, B) = 1-ld (A, B)/Len (A) = 1-2/2 = 0

S (a, c) = 1-ld (a, c)/Len (A) = 1-2/2 = 0

The similarity between A and B is the same as that between A and C. However, it is obvious that B is closer to a than C.

 

Formula 3: S (a, B) = LCS (A, B)/Len ()

This formula can perfectly satisfy the property 1

However, when Len (a) is less than Len (B), S (A, B) is less than S (B, ). Unsatisfied nature 2

Use an example to illustrate the problem

A = "BC", B = "BCD", c = "bcef"

S (a, B) = LCS (A, B)/Len (A) = 2/2 = 100%

S (a, c) = LCS (a, c)/Len (A) = 2/2 = 100%

The similarity between A and B is the same as that between A and C. However, it is obvious that B is closer to a than C.

 

The above three formulas can be found online. From the above analysis, they all have their own limitations.

 

Let's look at an example:

A = ggatcga, B = gaattcagtta, LD (a, B) = 5, LCS (a, B) = 6

Their matching is:

A:GGA_TC_G__A

B:GAATTCAGTTA

We can see that the blue part above represents the LCS part, and the black part represents the LD part.

Therefore, a new formula is provided.

S (a, B) = LCS (A, B)/(LD (a, B) + LCS (a, B ))

This formula can solve the shortcomings of the above three formulas.

LD (a, B) + LCS (a, B) indicates the length of the two strings A and B. This is unique.

Note that LD (a, B) + LCS (A, B) and max (LEN (A), Len (B) are not completely equal.

 

 

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.