Character String Matching Algorithm (3) magic of bitwise operations -- Kr and so

Last Update:2018-12-05 Source: Internet

Author: User

Tags rehash

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Bitwise operations can often make incredible things. For example, how can we exchange two numbers without a temporary variable? One person who has never been familiar with such problems cannot think of it. If we use go as a metaphor, bitwise operations can be called the "Hand band" in programming ". The bit-based storage method can provide the maximum storage space utilization. As the space is compressed, the speed is significantly improved due to the direct support of CPU hardware. For example, to shift an ordinary array, it is the time complexity of O (N). If the shift operation is performed by a bit, it is done by a command. KR Algorithm
Hash is used in the first chapter of the KR algorithm. In my opinion, hash is just a guise. The basic steps of this algorithm are the same as that of the exhaustive method. The difference is that the hash value is compared before each comparison. However, if the hash value cannot be efficiently calculated, such improvements may not even be better. You have to calculate the hash value before you compare it. You have to calculate the hash value. In order to convert a one-character comparison to a comparison of two integers, the KR algorithm treats a string of m length as an integer, which is a base integer of 2. In this case, after calculating this integer for the first time, each time you move the window, you only need to remove the highest bit and add the lowest bit to get a new hash value. However, M is too large to exceed the maximum integer that the computer can process. What should I do? Don't worry, modulo the maximum integer value. with the features of the modulo operation, everything can be done perfectly. In addition, since the modulo operation is performed on the maximum integer value, this step can be ignored. This is the code of the KR algorithm:

# Define REHASH (a, B, h) (h)-(a) * d) <1) + (B ))
Void KR (char * x, int m, char * y, int n ){
Int d, hx, hy, I, j;
/* Preprocessing */
/* Computes d = 2 ^ m-1)
The left-shift operator */
For (d = I = 1; I <m; ++ I)
D = (d <1 );
For (hy = hx = I = 0; I <m; ++ I ){
Hx = (hx <1) + x [I]);
Hy = (hy <1) + y [I]);
}
/* Searching */
J = 0;
While (j <= n-m ){
If (hx = hy & memcmp (x, y + j, m) = 0)
OUTPUT (j );
Hy = REHASH (y [j], y [j + m], hy );
++ J;
}
}

As we can see, the KR algorithm has an O (m) Complexity preprocessing process, and it always feels that its preprocessing does not reflect the characteristics of the pattern itself, as a result, the search process is still O (Mn) complex, but it is generally not reflected. Search for "aaaaaaaaaaaaaaaaaaaaaaaaa" to find out how slow KR is.

In general, the KR algorithm is a little better than the exhaustive algorithm, and the expected value of the number of comparisons is O (m + n ). Shift or Algorithm
To maximize the ability of bitwise operations, shift or algorithms have the biggest drawback: The pattern cannot exceed the machine font length. For general 32-bit machines, the machine Character length is 32, that is, it can only be used to match a pattern not greater than 32 characters. The advantage is that the matching process is O (n) time complex, achieving the speed of the automatic machine. The time and space used for preprocessing are O (m + σ), which is much less than that of automatic machines. Let's take a look at how it cleverly achieves "only one review": Suppose we have an upgrade system with a total of M levels. Each level will include a new person to the level 0th. If all the people in the system pass the test, they will be upgraded to a level. Otherwise, they will kill. For the person who has risen to the highest level, it means that he has passed the M-test continuously. This is the person we want to select. The idea of the KR algorithm is the above upgrade rules. The test is whether the characters at your position are consistent with the text characters. When the value reaches the full level, it indicates that the matching is successful in M consecutive positions consistent with the text characters given continuously. After understanding this idea, the question begins: check which locations are consistent with the text characters and m times are required? So the entire algorithm is O (Mn? This bit operation is now available, right. The idea of this algorithm is very stupid, but the efficiency of my bit operation is high, in advance, I calculated the position where each character in the alphabet appears in the mode. In the form of BITs, the place where each character appears is marked as 0, and the place where it does not appear is marked as 1, in this way, a total of σ integers are used. Similarly, I use an integer to indicate the Update Status. At a certain level, someone is marked as 0, and no one is marked as 1, the entire system upgrade can be performed with "shift". When you check the location, you only need to match the integer "or" indicating the status, so the entire algorithm is O (n). The shift-or algorithm name is like this. It is strange that the setting of 0 and 1 is the opposite of the usual habit. In terms of habits, we like to set existence to 1 and nonexistent to 0. But there is no way here, because the new shift is 0. In this case, the code is much easier to understand:

# Define wordsize sizeof (INT) * 8
# Define asize 256
Int preso (const char * X, int M, unsigned int s []) {
Unsigned Int J, Lim;
Int I;
For (I = 0; I <asize; ++ I)
S [I] = ~ 0;
For (lim = I = 0, j = 1; I <m; ++ I, j <= 1 ){
S [x [I ~ J;
Lim | = j;
}
Lim = ~ (Lim> 1 );
Return (lim );
}
Void SO (const char * x, int m, const char * y, int n ){
Unsigned int lim, state;
Unsigned int S [ASIZE];
Int j;
If (m> WORDSIZE)
Error ("so: use pattern size <= word size ");
/* Preprocessing */
Lim = preso (x, M, S );
/* Searching */
For (State = ~ 0, j = 0; j <n; ++ J ){
State = (State <1) | s [Y [J];
If (State <Lim)
OUTPUT (j-m + 1 );
}
}

In the code, the lim variable is actually a ruler. For example, if the highest level is 01111111, then lim becomes 10000000. Therefore, if the highest level is smaller than lim, it indicates that the highest level 0 appears. In the original article, the description of the Shift-Or algorithm is still difficult to understand. If you look at the code with that description, it feels a bit confusing to the cloud. I still come up with a metaphor for the upgrade directly to the code.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More