Analysis of text comparison algorithms (1)-how to determine the maximum matching rate

Source: Internet
Author: User

Someone is looking for a text comparisonAlgorithm, Just recently on vacation, I studied it and finally found a simple and effective algorithm to share with you.

The algorithm itself is very simple, but it is complicated to clarify the ideas and principles. I plan to publish the algorithm twice (tomorrow I will go to work !), It corresponds to two main problems in the text comparison algorithm:

1. How to determine the maximum matching rate;
2. How to determine the optimal matching path;

The algorithm itself is based on graph theory, which is too troublesome. So I don't want to introduce the entire idea. I just want to explain the final results to you in detail. If you have any questions, please email me: opensw0001@gmail.com

Analysis of text comparison algorithms (1)-how to determine the maximum matching rate

 

1. First, we assume there are two strings left and right,
Left = "abcacadf"
Right = "bcxcadfesbabcaca"

To analyze the problem intuitively, the first step is to use a table to compare each element of left and right one by one:

In the figure, 1 indicates that an element of left and right matches, and 0 indicates that it does not match. The problem now is how to find a path from the upper left corner of the table to meet the following requirements:
1. the maximum number of cells with a value of "1;
2. Each time, only right, down, or move a grid to the bottom right;
3. If the current position is on the grid with a value of "1", only one cell can be moved to the lower right;
4. If it is moved to the right boundary or lower boundary, it is terminated.

2. This is actually a problem with conditional search for the maximum weight path. However, we are talking about text matching, which is much simpler than graph theory. Because the text is streaming, all the matching relationships between the two texts must be a very regular matrix, which is much simpler than the research in graph theory.

What is the first thought? Iteration and recursion, right? Don't worry. It's not that complicated. Let's analyze it and make plans again.

We first manually mark the maximum number of matching points from each matching point to the boundary, as shown in:

You can do it manually, analyze a limited number of images, or use a brain better than a computer.

3. Let's take a look at it. The basic idea is mathematical induction.

Needless to say, the number of elements on the boundary must be at most one matching point.

For any unit in the table, we use N (L, R) to represent it. For it, according to the above rules, it has three adjacent areas A, B, C.

We use N (L, R) to represent "the maximum number of matching points that can be obtained after matching the nth element of left with the nth element of right ". This statement is a bit difficult to understand. Find a path from the previous one... from the point of view, we can also explain the meaning of N (L, R): "starting from the cell in the r column of row L, the number of cells that can pass through the path that meets all four conditions is "1 ".

Because the next step of N (L, R) must be one of Area A, B, and C, and if (L, R) is a matching point, you can only select to enter Area; if B and C are entered, (L, R) is definitely not a matching point. Therefore, we can get:
N (L, R) = max (V (L, R) + N (region A), n (Region B), n (region C )). "V (L, R) indicates the unit (L. r) value, = 0 indicates that the Unit (L, R) is not a matching point, and = 1 indicates that the Unit (L, R) is a matching point"

The maximum number of matching points in a region is the maximum number of matching points that can be obtained from the entry point of the region, that is, n (region [(A, B), (C, d)]) = N (A, B ).
"Area [(A, B), (C, D)] indicates a rectangular area composed of a vertex (A, B) and a vertex (c, d )"

Then, the prefix becomes:

N (L, R) = max (V (L. R) + N (region A), n (Region B), n (region C ))
= Max (N (L + 1, R + 1) + V (L, R), n (L, R + 1), n (L + 1, R ))

In Excel, we can verify that the formula of cell L4 is set to max (L5, M5 + B4, M4), and then copy the formula until it matches with the previous matrix, the result is as follows:

You can compare it with the results of manual analysis (the rightmost part). You can see that the results are exactly the same.

Yes! It's close to success! Easy! Now, how does oneCodeProblems.

To program the above method, there is another problem: initialization. This is very simple. We can see the results of the Excel calculation. In the above cell formula, the boundary unit references the blank unit, and we know that excel calculates the value of the blank unit according to 0, so it can be initialized to 0.

From the above analysis, we can know:
1. The loop should be from right to left, from bottom to top;
2. The value of each cell only needs to be calculated once;
3. When Calculating N (L, R), three values must be referenced.
Therefore,ProgramAn array is used to cache the values of N (L + 1, R) and N (L + 1, R + 1), and a temporary variable is used to cache the values of N (L, R + 1 ).

Assume that left has m elements, right has n elements, so the time complexity of this program is O (M. N), and the space complexity is max (m, n ).

Well, the calculation of the maximum number of matching between left and right is over. It's easy, isn't it?

If you want to create a file comparison tool, you also need to determine the optimal matching path. This plan will be discussed later. The algorithm is also relatively simple, but now it is more than. It will take you to get up early tomorrow!

==============================================

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.