[Pin to the top] Rolling Hash (Rabin-Karp Algorithm) matching string and ansible string

Last Update:2018-12-05 Source: Internet

Author: User

Tags integer numbers

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Common scenarios of this algorithm

Search for substrings in the string, and search for substrings in the ansible form in the string.

String search and matching

A string can be interpreted as a character array. Characters can be converted to integers. Their specific values depend on their encoding method (ASCII/Unicode ). This means we can regard the string as an integer array. You can find a way to convert a set of integer numbers into a number, so that we can use an expected input value to hash the string.

Since a string is regarded as an array rather than a single element, it is easy and straightforward to compare two values if two strings are thought. To check whether a and B are equal, We have to enumerate all the elements of a and B to determine a [I] = B [I] for all I. This means that the complexity of string comparison depends on the length of the string. Compare two strings with N length and the complexity is O (n ). In addition, to hash a string is to enumerate the elements of the entire string. Therefore, it takes O (n) time complexity to hash a string with a length of N.

Practice

Assume that the length of a pattern string P (the string to be matched) is l, and the s length of the matched string must be n. One way to find P in S is:

1. Hash P to obtain h (P ). Time Complexity: O (l)

2. Start from the index of S to 0 to enumerate the substrings with the length of L in S. Hash the substrings and calculate the H (p )'. The time complexity is O (NL ).

3. If the hash value of a substring matches h (P), the substring is compared with P. If the hash value does not match, the substring is stopped. If the hash value matches, step 2 is continued. Time Complexity: O (l)

The time complexity of this practice is O (NL ). We can use rollinghash to optimize this practice. In step 2, we can see that for the sub-strings of O (N), O (l) is spent to hash them (as you can imagine, I found a box with the length of L and framed S. Each iteration moves one forward, so it moves n times, for each sub-string in each box, the substring needs to be iterated to calculate the hash value, so the complexity is NL ). However, you can see that many characters in these substrings are repeated. For example, if you look at a substring with a length of 5 in the string "algorithms", the first two substrings are "algor" and "lgori ". If we can use the fact that these two substrings have a common substring "lgor", it will save us a lot of time to process each string. It seems that we should use rollinghash.

"Value" Example

Let's go back to the string. Assume that P and S are both converted to two integer Arrays:

P = [9, 0, 2, 1, 0] (1)

S = [,] (2)

The substring of 5 S is listed below:

S0 = [4, 8, 9, 0, 2] (3)

S1 = [, 1] (4)

S2 = [9, 0, 2, 1, 0] (5)

.... (6)

If we want to know whether P can match a substring of S, we can use the three steps in the above "practice. Our hash function can be:

Or, in other words, we map each value in the integer array with a length of 5 to each digit of a 5-digit number, and then use this value to perform the "Mod" operation with M. H (P) = 90210mod M, H (S0) = 48902mod m, and H (S1) = 98021mod M. Note that we can use H (S0) to help calculate H (S1 ). We get 48902 after removing the first digit from 8902, multiply by 10 to get 89020, and then add the next digit to get 89021. A more general formula is as follows:

We can imagine that this is a sliding window on all the substrings of S. Calculate the hash value of the next substring. Its value is related to two elements. These two elements are exactly at both ends of the sliding window (one comes in and one goes out ). This is very different from the above. Here, except for the first computing of the first substring with the length of L, we will not depend on this element set with the length of L, we only rely on two elements, which makes the complexity of calculating the hash value of the sub-string become the O (1) operation.

In the example of this value, we can see that a simple integer is stored by bit, and the "bottom" is set to 10, so we can easily separate each number. For general purpose, we can use the following general formula:

The hash value of the next substring is calculated as follows:

Back to the string issue

Since the string can be converted to a number, we can use the same method to improve the running efficiency in the string as in the numerical example. The algorithm is implemented as follows:

1. Hash P obtains h (P). The time complexity is O (l)

2. the time complexity of the first substring whose length is L in hash S is O (l)

3. Use the rolling hash method to calculate all the substring O (n) of S, and compare the time complexity with h (P) using the calculated hash value to O (n)

4. If the hash value of a substring is equal to h (P), the substring is compared with P. If the substring matches, otherwise, the complexity of the current match is O (l)

This accelerates the efficiency of the entire algorithm. As long as the total time for comparison is O (n), the time complexity of the entire algorithm is O (n ). Let's go into a problem. If we assume in our hashtable that there is an O (n) "hash Collision" (due to the hash function problem, as a result, multiple keys correspond to the same value), the total complexity of Step 4 is O (NL ). Therefore, we have to ensure that the size of our hashtable is n (that is, we must ensure that each substring can uniquely correspond to a hash key, which depends on the hash function design ), in this way, we can expect the substring to be hit once, so we only need to take step 4O (1. The time complexity of Step 4 is O (L). In this case, we can still ensure that the time complexity of the entire problem is O (n)

Common substrings

The algorithm was designed to look for a pattern string P matching in a string S. However, now we need to deal with another problem: Look at two long strings S and T whose lengths are N and see if they have a common substring whose lengths are L. This seems to be a more difficult problem to solve, but we can still use rollinghash To Make It complexity O (n ). We adopt a similar strategy:

1. the time complexity of the first substring in hash s with the length of L is: O (l)

2. Use rolling hash to calculate all the O (n) substrings of S, and then add each substring to a hash table. The time complexity is O (n)

3. the time complexity of the first substring whose hash T is L is O (l)

4. Use the rolling hash method to calculate all the O (n) substrings of T. Check hashtable for each substrings to see if they can be hit.

5. If a substring of T hits a substring of S, it will be matched. If it is equal, it will continue; otherwise, it will stop matching. Time Complexity: O (l)

However, the number of retained operations is O (n), and we need to pay attention to limiting the number of "hash Collisions" again to reduce unnecessary matching in step 5. This time, if the size of hashtable is O (n), the expected hit complexity for each substring of T is O (1) (worst case ). This result will cause the string to be compared O (n) times, and the total complexity is O (NL) times, which makes the string comparison a bottleneck here. We can expand the hashtable size and modify our hash function so that our hashtable has an O (N square) slot (the slot refers to the unit actually used to store data in the hash table ), to reduce the possible collision to O (1/n) for each t substring ). This can solve our problem and make the complexity of the entire problem still O (N), but we may not need to create such a large hashtable to consume unnecessary resources.

Instead, we will use the advantage of string signature to replace the practice of consuming more storage resources. We will assign a hash value for each substring, which is called H (k )'. Note that the hash function of H (k) will map the string to the square range of 0 to N instead of the above 0 to n. Now, when we generate a hash collision in hashtable, We can first compare the signatures of two strings before we compare the final "expensive" strings. If the signatures do not match, then we can skip string comparison. For two substrings K1 and K2, only when H (K1) = H (K2) and H (K1) '= H (K2, we will make the final string comparison. For a good hash function of H (K), this will greatly reduce the string comparison, making the comparison complexity close to O (N ), the complexity of the common substring problem is limited to O (n ).

Java simple implementation of rollinghash Algorithm

Search for and specify a substring that is ansible, for example:

Getanagram ("abcdbcsdaqdbahs'', "stccb'') => "cdbcs ''.

Because it is not an equivalent string, but a corresponding ansible, the number of comparisons is more than the first example above, however, some mechanisms can also be used to ensure as few hash collisions as possible, thus reducing the number of comparisons and greatly reducing complexity.

The general implementation is as follows:

Package rollinghash;/*** User: yanghua * Date: 5/11/13 * Time: am * copyright (c) 2013 yanghua. all rights reserved. */import Java. util. hashmap; import Java. util. map;/*** rolling Hash (Rabin-Karp Algorithm) exercise * Function ") --> cdbcs [Google Interview Questions] */public class rollinghash {// The simple hash calculate expression is: (A [0] + A [1] + A [2] + .... + A [n]) * factorprivate static final int factor = 41; Private Static long hashvalueofpattern; /*** generate the pattern's hash ** @ Param patternstr the pattern string */Private Static void generatepatternhash (string patternstr) {If (null = patternstr | patternstr. isempty () {Throw new illegalargumentexception ("the Arg: patternstr can not be null or empty");} hashvalueofpattern = 0; int sum = 0; for (I NT I = 0; I <patternstr. length (); I ++) {char c = patternstr. charat (I); sum + = (INT) C;} hashvalueofpattern = sum * factor ;} /*** find the matched anw.str ** @ Param searchingstr the searching string * @ Param patternstr the pattern for searching string * @ return matched count */Private Static int findanargamstr (string searchingstr, string patternstr) {If (null = searchingstr | searchingstr. ISE Mpty () {Throw new illegalargumentexception ("the Arg: searchingstr can not be null or empty");} If (null = patternstr | patternstr. isempty () {Throw new illegalargumentexception ("the Arg: patternstr can not be null or empty");} If (searchingstr. length () <patternstr. length () {return 0;} int COUNT = 0; // generate hashmap and hashvaluegeneratepatternhash (patternstr); long tmphashvalue = 0; int L = patter NSTR. length (); int n = searchingstr. length (); For (INT I = 0; I <n; I ++) {char c = searchingstr. charat (I); // calculate the first sub-string (0: pattern. length ()-1) which length equal to patternif (I <L) {tmphashvalue + = (INT) c) * factor;} else {// new tmphashvalue: (A [in-Index]-A [Out-Index]) * factortmphashvalue + = (INT) C-(INT) searchingstr. charat (I-l) * factor; // If the hash-value matched, c Ompare each character // Note: if a good hash function is used here, or the hash slot is increased or the string is hashed to avoid excessive hash collisions, // here we can greatly simplify isanagram calls, so that the complexity of the entire problem approaches O (n) if (hashvalueofpattern = tmphashvalue) if (isanagram (searchingstr, i-l + 1, I, patternstr) Count ++;} return count;}/*** is the two string ano (because of the existence of hash collision, this function is used to verify whether the strings are indeed equal.) ** @ Param comparedstr compared string * @ Param startindex start Index * @ Param endindex end in Dex * @ Param pattern string * @ return true/false */Private Static Boolean isantern (string comparedstr, int startindex, int endindex, string pattern) {If (null = comparedstr | comparedstr. isempty () {Throw new illegalargumentexception ("the Arg: comparedstr can not be null or empty");} If (null = pattern | pattern. isempty () {Throw new illegalargumentexception ("the Arg: pattern can not B E null or empty ");} If (startindex> endindex | endindex-startindex! = Pattern. length ()-1) {Throw new illegalargumentexception ("the Arg: startindex or endindex is illegal");} Boolean anfound = true; int [] lettercountofpattern = new int [256]; // not only number and letter, contain backspace and special symbolfor (Int J = 0; j <256; j ++) {lettercountofpattern [J] = 0 ;} for (int K = 0; k <pattern. length (); k ++) {++ lettercountofpattern [pattern. charat (k)];} For (INT I = sta Rtindex; I <= endindex; I ++) {-- lettercountofpattern [comparedstr. charat (I)] ;}for (INT m = 0; m <256; m ++) {If (lettercountofpattern [m]! = 0) {Anak = false; break;} return Anak;} public static void main (string [] ARGs) {string searchingstr = "abcdbcsdaqdbahs "; string patternstr = "bqcb"; int COUNT = findanargamstr (searchingstr, patternstr); system. out. println (count );}}

The code is in
GitHub.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More