The editing distance is calculated at one o'clock.

Source: Internet
Author: User

The question that keeps me confused is: what is the editing distance between ABC and Ca?

I have asked many students and netizens: the general point is: if the adjacent exchange operation is specified as an atomic operation in the editing distance definition, it should be equal to 2; otherwise, if you define an adjacent exchange operation as an atomic operation in the editing distance definition, the value is 3.

In order to better clarify this problem, two definitions of the editing distance are given first.
1. levenshtein distance (levenshtein distance ). This distance was defined by levenshtein in 1965. There are three atomic operations in this definition system: insertion, deletion, and substitution (for the source, see binary codes capable of correcting, deletions, insertions and reversals);

2. damerau, F, j Distance (D's distance ). This distance was defined by damerau in 1964. There are four atomic operations in this definition system: insertion, deletion, substitution, and transpositionof ajacent symbols (see a technique for computer detection and correction of spelling errors);

Differences between the two definitions:

1. the atomic operation set at L's distance does not include the operation of adjacent switches;

2. According to the Wiki, l distance can handle multiple editing errors, while D Distance can only handle a single editing error.

To sum up:

If the distance between ABC and CA is calculated using the L's editing distance, the result should be 3 (delete B-> A starting with the original string is replaced with C-> C ending with a), which has no objection; if the editing distance between ABC and CA is calculated based on the distance between D, the distance should be 2 (delete B-> the position of the Character A and C at the beginning and end of the original string). Now the problem arises: many books and papers (for example, Kemal oflazor's error-tolerant finite-state recognition with Application to morphological analysis and spelling correction), M. w. du and S. c. chang's "a model and a fast algorithm for multiple errors spelliing correction" uses the definition of D's editing distance, and then gives the following formula:

Formula 1: the formula for calculating the editing distance provided in the above two papers.

The calculated result is 3.

At this time, it will be said that because the intermediate B is deleted first when the result of 2 is obtained, and the "sequential operation" is not satisfied, the error result is obtained. The correct processing order of string ABC should be processing a first, then processing B, and then processing C. The correct calculation should be: delete a-> B to C-> C to. However, the editing distance must satisfy the symmetry. That is to say, the editing distance between ABC and CA is equal to the editing distance between Ca and ABC. To change CA to ABC, perform the following steps: Ca-> AC. Therefore, this statement is not reasonable. Besides, the definition of the editing distance is only a mathematical abstraction of the actual situation.ProgramThere is not much relationship between design issues and "sequential streams.

This problem has plagued me for a long time. Today, I checked wiki to find out the ins and outs of the incident: the general situation is that l and D have no problems with the definition of the editing distance, in line with the three element conditions defined for distance in functional theory. Then some people want to combine the distance definition of L and D, become the damerau-levenshtein distance (hereinafter referred to as D-L distance ), this can not only overcome the limitation that D definition can only identify errors caused by a single edit operation, but also make up for the regret that l definition does not contain adjacent character interchange operations. In fact, the above formula 1 calculates the D-L distance. However, this D-L distance does not meet the three elements of the distance defined in functional theory, it does not meet the triangular inequality, so this definition is problematic, mathematics is not rigorous. Therefore, the editing distance between ABC and CA is incorrectly calculated as 3. However, this error does not affect the application in the project, and the formula can bring convenience to the actual work, so it has been used for a long time.

Reference the relevant sections on Wiki below:

Let us calculate pair-wise distances between the stringsTo,OTAndOstUsing this algorithm. The distanceToAndOTIs 1. The sameOTVs.Ost. But the distanceToAndOstIs 3, even though the strings can be made equal Using One deletion and one transposition. Clearly, the algorithm does not compute precisely the value we want. Furthermore, the Triangle Inequality does not hold.

In reality this algorithm calculates the cost of the so-called optimal string alignment, which does not always equal the edit distance.

Reference: http://en.wikipedia.org/wiki/damerauâlevenshtein_distance

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.