Objective
I have been doing search performance optimization in the company recently. Before I saw this algorithm, I did not think that I was responsible for the retrieval system performance there is room for improvement. But this algorithm is really too cow break, a full service can increase 50%, I have to share with you. In fact, a period of time in the blog has also written this algorithm, but not detailed, today I am ready to take it out alone, said. To tell the truth, my math skills in general, the algorithm proved not my strengths, so the proof of the article is only I in the author's foundation on the basis of adding their own thinking method, and has not fully proved out, please forgive us! Welcome to the small partners who love thinking to complement. As long as I reach the role, I am contented.
Back to the point, our search service uses the minimum editing distance algorithm, the algorithm itself is a squared magnitude of time complexity, and few people in the post mentioned less than the complexity of the algorithm. But I stumbled upon another more bull algorithm: The column partitioning algorithm, which makes the algorithm performance of this very cow directly increase by one times. Then get to the point.
Column partitioning algorithm
This algorithm is more difficult to understand, from the following papers: "Theoretical and empirical comparisons of approximate string matching algorithms". In Proceedings of the 3rd Annual Symposium on Combinatorial Pattern Matching, number 664 in lecture Notes in computer Scie NCE, pages 175~184. Springer-verlag, 1992. Author:wi Chang, J Lampe. So it is necessary to first popularize some consensus.
Edit Matrix Minimum editing distance in the calculation process using the dynamic programming algorithm to calculate the matrix, understand the algorithm is understood, I do not repeat. But our editing matrix has a feature: the first line is 0, and the advantage is that as long as any of the subsequence in the text string T is less than a fixed value for the editing distance of the pattern string p, it will be found.
For example, text string t=annealing, pattern string p=annual:
Note that the first line is 0, which is the biggest difference from the traditional minimum editing distance, and the rest of the equations are identical.
The diagonal rule Edit matrix is non-descending along the lower-right diagonal, with a maximum difference of 1.
The rules of the row and column are two or 1 adjacent to each line.
Observing the editing distance matrix, we find the following fact: Each column consists of several consecutive digits. So we divide each column of the edit matrix into successive sequences, as shown in:
The red box is a sequence of one and the sequence is contiguous within.
sequence-汛 definition for each element of the edit matrix D[j][i] (j is row, I is column), if J-d[j][i] =δ, we say D[j][i] belongs to the sequence-汛 on column I, we also observed that as j increases, J-d[j][i] is non-diminishing. As shown in the following:
sequence-汛 terminate position each sequence will have a start and end position. The end position of the sequence-汛 is J, if J is the minimum horizontal axis of the sequence-汛, and satisfies d[j+1][i] belongs to the sequence-ε, and ε>δ (that is, j+1-d[j+1][i]>δ).
A sequence of length 0 we find that the value of Δ on each column is not necessarily contiguous, always or absent, if defined as above. So we define a sequence of length 0: when D[j+1][i] < D[j][i], we artificially insert a sequence of length 0 between the sequence-汛 and sequence-(δ+2)-(δ+1). As shown in the following:
So, by this definition, we can make a division of each column of the editing matrix, each of which is a series of consecutive numbers.
What's the use of this definition when you say so much? If we can directly derive the column division of the latter column according to the column division of the previous column, then we can save a lot of calculations, after all, the number of each segment in each division is continuous, which implies that we can directly use a constant time of the addition of an element directly to the value of an edit matrix, Instead of using the minimum editing distance of the dynamic programming algorithm to calculate.
The next point comes, we introduce this derivation formula, please play more than spirit! We introduce this inference according to whether the sequence-汛 length is zero. Because one of the inferences is too cumbersome to understand, I drew a diagram:
Then burn your brains out.
Corollary 1: If the column I on the length of the sequence-汛 0, the end position of J, the column i+1 on the end of the sequence-汛 is j+1.
proof : By inference premise we know δ= J-d[j][i] + 1 (think of the preceding Δ value discontinuity, we are inserting an intermediate value, but the length is 0).
We observe the editing matrix and find the following two facts:
Fact 1:d[j+1][i+1] = D[j][i] (don't ask why, see for yourself, look at whether all this, in fact can be used to disprove the law, we do not prove).
Fact 2:d[j+2][i+1] <= D[j][i].
By fact 1, we know that d[j+1][i+1] does belong to sequence-汛, because J + 1-d[j+1][i+1] = j + 1-d[j][i] =δ.
With fact 2, we know the sequence δ on column i+1, and the terminating position is j+1.
Therefore the deduction of 1 proofs is over.
Inference 2: Text description slightly, see figure
Proof :
Set this sequence length to L, except for the first sequence of each column, the remainder of the sequence is the current editing distance less than or equal to the previous position of the column: D[j-l+1][i]<=d[j-l][i], so we can roll it out: D[j-l+1][i] <= D[j-l][i];
Again according to the edit matrix diagonal non-descending we know, d[j-l+1][i+1] >= d[j-l][i];
In two points we get the following size relationship: D[j-l+1][i+1] >= d[j-l+1][i].
In addition we know that we are the forefront of the sequence-汛 cutoff position is J, also means D[j+1][i] <= D[j][i], also according to the diagonal law, we derive d[j+2][i+1] <= D[j+1][i] + 1 <= d[j][i] + 1.
Next to the most exciting step, we know that the value of column I in the current sequence-汛 is continuous, if the starting edit distance is a, then the end of the editing distance is a+l-1.
And from our deduction can be found: d[j-l+1][i+1] >= a,d[j+2][i+1] <= (a+l-1) + 1 = a+l, and the length of the span between (j+2)-(j-l+1) +1= l+2. We can roll out the sequence from row j-l+1 to row j+2 on column i+1, otherwise d[j+2][i+1] >= a+l+2-1= a+l+1, contradicts our previous derivation. Therefore, there must be a column between J-l+1 and j+2 to terminate, in order to eliminate a sequence number.
In addition we have a question, column i+1 on the sequence-汛 end position must be between j-l+1 and j+1? We have to prove it.
Proof :
Because Δ=j-d[j][i]=j-l+1-d[j-l+1][i]>=j-l+1-d[j-l+1][i+1], the end position of the sequence-汛 on the column i+1 must be j-l+1 or later;
Because of the j+1-d[j+1][i]>δ, according to the diagonal law d[j+2][i+1] <= d[j+1][i]+1, there is j+2-d[j+2][i+1]>=j+2-(d[j+1][i]+1) =j+1-D[j+1][i] >δ, the end position of the sequence-汛 on the fixed i+1 must be before j+2, that is, j-l+1 to j+1.
After the deduction of 2 of the discussion of the situation, I did not prove that the author in the paper in a buoyant sentence "back to prove that he did not go to prove," but it consumes all my brain cells. So, if any of the small partners to prove the remainder of the deduction 2, Welcome to leave me a message, I also learn to learn.
What is the time complexity of this algorithm? The author uses Heuristic method to prove the complexity of the algorithm is about $ O (Mn/\sqrt[2]{b}) $, where B is the character set size.
Code implementation
Then say the code implementation, give me the steps summed up, otherwise it is easy to step on the pit.
- Edit the first column of the matrix, there must be only one sequence.
- Each time the sequence of the preceding column is traversed, the division of the latter column is computed according to inference 1 and inference 2.
- If the previous column has been traversed, but the remaining elements in the next column are not divided. It doesn't matter, the remaining elements in the next column are grouped into a new sequence.
- Preprocess a table that records the position of each character in T in P. You can directly use the hashing algorithm (preferably direct ASCII code) to locate, if the location is not unique, can be zipper. When a column is calculated, the position of the chain is traversed from the previous, until the first one meets the criteria is found to be surprisingly fast. Use as little as possible or do not use the map to locate, the test found quite slow.
Next do the last thing you want to do: Stick a code, it's ugly.
inline int loc (int find[][200], int *len, int ch, int pos) {for (int i = 0; i < len[ch]; ++i) {if (Find[ch][i] > = pos) return find[ch][i]; } return-1;} int new_column_partition (char *p, char *t) {int len_p = strlen (p); int len_t = strlen (t); int find[26][200]; int len[26] = {0}; int part[200]; Record the end position of each sequence//Generate LOC table for quick query for (int i = 0; i < len_p; ++i) {find[p[i]-' a '][len[p[i]-' a ']++] = i + 1; } int pre_cn = 0, NEXT_CN = 1, Min_v = len_p; Part[0] = len_p; for (int i = 0; i < len_t; ++i) {//previous column partition number PRE_CN = NEXT_CN; NEXT_CN = 0; int L = part[0] + 1; int b = 1; int e = l; int tmp; int tmp_value = 0; int pre_v = part[0]; Previous column No. 0 partition length sure >=1 if (len[t[i]-' a '] >0 && (tmp = LOC (Find, Len, T[i]-' a ', b))!! =-1 && TMP <= e) {part[next_cn++] = tmp-1; } else if (PRE_CN >= 2 && part[1]-part[0]! = 0) {part[next_cn++] = part[0] + 1; } else {Part[next_cn++] = part[0]; }//each column First partition tail value tmp_value = part[0]; Traverse the remaining partition for (int j = 1; J < PRE_CN && part[next_cn-1] < len_p; ++j) {int x = Part[j] In the previous column, y = pre_v; Pre_v = Part[j]; L = x-y; if (L = = 0) {part[next_cn++] = x + 1; } else {b = x-l + 2; E = x + 1; if (b <= len_p && len[t[i]-' a '] > 0 && (tmp = LOC (Find, Len, T[i]-' a ', b))! =-1 && tmp & lt;= e) {part[next_cn++] = tmp-1; } else if (j + 1 < pre_cn && part[j + 1]-X! = 0) {part[next_cn++] = x + 1; } else {part[next_cn++] = x; }} L = Part[j]-part[j-1]; if (L = = 0) {//The newly obtained partition length is 0, then the starting value of the next partition is 1 tmp_value-= 1 less than the previous partition tail value; } else {tmp_value + = L-1; }} if (part[next_cn-1]! = len_p) {part[next_cn++] = len_p; Tmp_value + = len_p-part[next_cn-2]-1; if (Tmp_value < Min_v) {Min_v = Tmp_value; }} else {Min_v = Min_v < tmp_value? min_v:tmp_value; }} return Min_v;}
Conclusion
This algorithm is applied to the line, the effect is very obvious, the following comparison.
- Optimize the pre-CPU:
- Optimized CPU:
Limited ability to prove inadequate, interested in small fruit companion can go directly to the original paper, welcome communication, common progress.