In "text comparisonAlgorithmI -- LD algorithm "describes the calculation of the editing distance.
The longest common substring calculation is introduced in "text comparison algorithm ⅱ -- Needleman/Wunsch algorithm.
In the given string a and string B, LD (a, B) indicates the editing distance, and LCS (a, B) indicates the length of the longest common substring.
How can we measure the similarity between them?
Set S (a, B) to indicate the similarity between string a and string B. Therefore, reasonable similarity should meet the following requirements.
Property 1: 0 ≤ S (a, B) ≤ 100%, 0 indicates completely different, 100% indicates completely equal
Property 2: S (a, B) = S (B,)
Currently, all the similarity calculations introduced on the Internet are not reasonable.
Formula 1: S (a, B) = 1/(LD (a, B) + 1)
Can perfectly meet the second nature.
When LD (a, B) = 0, S (a, B) = 100%, but no matter LD (a, B), take any value, S (a, B)> 0, cannot meet the nature of 1.
Formula 2: S (a, B) = 1-ld (A, B)/Len ()
When Len (B)> Len (A), S (a, B) <0. Does not meet Nature 1.
Some people may say that when S (A, B) is <0, the problem is solved by forcibly specifying S (a, B) = 0.
The problem is that S (a, B) = 1-ld (A, B)/Len (A), and S (B, A) = 1-ld (B,) /Len (B ). When Len (a) is less than Len (B), S (A, B) is less than S (B, ). Unsatisfied nature 2
Another example can be used to illustrate the problem.
A = "BC", B = "cd", c = "Ef"
S (a, B) = 1-ld (A, B)/Len (A) = 1-2/2 = 0
S (a, c) = 1-ld (a, c)/Len (A) = 1-2/2 = 0
The similarity between A and B is the same as that between A and C. However, it is obvious that B is closer to a than C.
Formula 3: S (a, B) = LCS (A, B)/Len ()
This formula can perfectly satisfy the property 1
However, when Len (a) is less than Len (B), S (A, B) is less than S (B, ). Unsatisfied nature 2
Use an example to illustrate the problem
A = "BC", B = "BCD", c = "bcef"
S (a, B) = LCS (A, B)/Len (A) = 2/2 = 100%
S (a, c) = LCS (a, c)/Len (A) = 2/2 = 100%
The similarity between A and B is the same as that between A and C. However, it is obvious that B is closer to a than C.
The above three formulas can be found online. From the above analysis, they all have their own limitations.
Let's look at an example:
A = ggatcga, B = gaattcagtta, LD (a, B) = 5, LCS (a, B) = 6
Their matching is:
A:GGA_TC_G__A
B:GAATTCAGTTA
We can see that the blue part above represents the LCS part, and the black part represents the LD part.
Therefore, a new formula is provided.
S (a, B) = LCS (A, B)/(LD (a, B) + LCS (a, B ))
This formula can solve the shortcomings of the above three formulas.
LD (a, B) + LCS (a, B) indicates the length of the two strings A and B. This is unique.
Note that LD (a, B) + LCS (A, B) and max (LEN (A), Len (B) are not completely equal.