Rolling Hash (Rabin-karp algorithm) match string

Source: Internet
Author: User
Tags numeric value

You can access this article in my personal blog:

Http://acbingo.cn/2015/08/09/Rolling%20Hash (rabin-karp%e7%ae%97%e6%b3%95)%e5%8c%b9%e9%85%8d%e5%ad%97%e7%ac%a6% e4%b8%b2/

Common scenarios for this algorithm

Finds a substring in a string, finds a substring in the anagram form in a string.

About string lookups and matching

A string can be interpreted as an array of characters. While characters can be converted to integers, their specific values depend on their encoding (Ascii/unicode). This means that we can think of a string as a shaped array. Finding a way to convert a set of shaped numbers into a number allows us to hash strings with an expected input value.
Since strings are treated as arrays rather than as individual elements, it is straightforward to compare two strings without comparing two values. To check whether A and B are equal, we have to determine for all I to a[i]=b[i] by enumerating all the elements of A and B. This means that the complexity of string comparisons depends on the length of the string. Compares two strings of length n, which require an O (n) degree of complexity. In addition, to hash a string is by enumerating the elements of the entire string, so hash a string of length n also requires O (n) time complexity.

Practice
    1. Hash p gets h (p). Time complexity: O (L)
    2. From the index of S to 0, we enumerate the substrings of length l in S, hash substrings and calculate H (P) '. Time complexity is O (NL).
    3. If the hash value of a substring matches H (p), the substring is compared to p, if the mismatch stops, and if the match continues with step 2. Time complexity: O (L)

The time complexity of this procedure is O (NL). We can use Rollinghash to optimize this practice. In step 2, we see that for O (n) substrings, we spend an O (l) to hash them (you can imagine that a box with a length of L, a frame of s, moves forward one bit at a time for each iteration, so it moves n times, and for each substring in each box you need to iterate over the substring to calculate the hash value. So the complexity is NL). However, you can see that many of the characters in these substrings are duplicated. For example, look at a substring of length 5 in the string "Algorithms", with the first two substring lengths "Algor" and "Lgori". If we can take advantage of the fact that these two substrings have a common substring of "Lgor", it will save us a lot of time to deal with each string. It seems that we should use Rollinghash.

numeric example

Let's go back to the string, if we have P and s are all converted to two shaped arrays:
p=[9,0,2,1,0] (1)
s=[4,8,9,0,2,1,0,7] (2)
The substring of s of length 5 is listed below:
s0=[4,8,9,0,2] (3)
s1=[8,9,0,2,1] (4)
s2=[9,0,2,1,0] (5)
... (6)
We want to know if P can match a substring of s, and you can use the three steps in the procedure above. Our hash function can be:

Or in other words, we map each value in an array of length 5 to each of the 5-bit numbers, and then use that value to do a "mod" operation with M. H (P) =90210mod m,h (S0) =48902mod m, and H (S1) =98021mod m. Note that this hash function, we can use H (S0) to help calculate H (S1). We start with 48902, remove the first bit to get 8902, multiply 10 to get 89020, and then add the next one number worth to: 89021. The more general formula is:

We can imagine that this is a sliding window on all of the S ' substrings. Computes the hash value of the next substring which is a value related to two elements, these two elements are exactly at the ends of this sliding window (one comes in one out). There's a big difference here, and here we're not going to rely on a set of elements of length L, except for the first one to calculate the length of the first substring, and we only rely on two elements, which makes the computation of the hash value of the substring into an O (1) operation.
In the example of this numeric value, we see a simple bitwise holding integer, and set the "bottom" to 10, so we can easily isolate each of these numbers. For general purposes, we can use the following general formula:

and calculating the hash value of the next substring is:

It feels like he's not explaining it very clearly.
Here's an understanding of my own, when n=5,b=10
H (si+1) = (h (Si) mod (b^n) *b+s[i+l]) mod m

And the other great God described it this way:
The key idea of Rabin-karp algorithm is that the hash value of a substring can be calculated according to the hash of the previous substring in the constant time, so that the time complexity of the pair can be reduced to O (n-k). Rabin-karp the hash algorithm of the string is the same as described above (by integer binary parsing and then modulo), assuming that the original string is S,h (i) for the first character of the K-string hash value, that is
, (%m is not considered first), then time is constant.
Also by the nature of the% can be:



That is, the hash of the i+1 Virgin string can be calculated directly from the hash of the I debut string, in the middle result%M is mainly to prevent overflow.
M generally choose a very large number, the number of substrings is relatively small, the probability of generating a hash collision is 1/m, can be ignored.
The code is implemented as follows, and there is no fallback check when the hash is consistent. You can see that the bottleneck of rabin-karp is that each inner loop is multiplied and modulo, and the modulo operation is time-consuming, while most other algorithms only need to be transmitting.

Back to the question of string

Since strings can be converted to numbers, we can use the same method to improve operational efficiency on strings as in the case of numerical examples. The algorithm is implemented as follows:

    1. Hash P to get H (p) time complexity of O (L)
    2. The first substring of length L in Hash S has a time complexity of O (L)
    3. Use the rolling hash method to calculate the substring O (n) of S, and compare the time complexity to H (P) with the calculated hash value O (n)
    4. If the hash value of a substring is equal to H (p), then the substring is compared to p, and if the match continues, the current match time is interrupted by the complexity of O (L)

This speeds up the efficiency of the entire algorithm, so long as all comparisons are made to O (n), the time complexity of the entire algorithm is O (n). We enter a question, if we assume in our hashtable that there is an O (n) "Hash Collision" (which causes multiple keys to correspond to the same value due to a hash function problem), then the total complexity of step 4 is O (NL). So we have to make sure that the size of our Hashtable is n (that is, we have to make sure that each substring uniquely corresponds to a hash key, depending on the design of the hash function), so we can expect the substring to be hit once, so we just need to go step 4O (1). And the time complexity of our step 4 is O (L), in which case we can still guarantee that the time complexity of the whole problem is O (n)

Code implementation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
#Include<iostream>
#Include<string>
UsingNamespaceStd
voidRabin_karp(String P,String S,int B,int m) {
int hash_p=0;Hash value of the target string
int hash_i=0;Hash value of the current string
int h=1;
for (int i=0;i<p.size (); i++) {H==pow (B,p.size ());
H= (h*b)%m;
}
for (int i=0;i<p.size (); i++) {
Hash_p= (B*hash_p+p[i])%m;
Hash_i= (B*hash_i+s[i])%m;
}
for (int i=0;i<=s.size ()-p.size (); i++) {
if (hash_i==hash_p) {
Int J;
for (j=0;j<p.size (); j + +) {
if (S[i+j]!=p[j])Break
}
if (J==p.size ())cout<<"Yes" <<i<<endl;
}
if (I<s.size ()-p.size ()) {
Hash_i= (Hash_i%m*b+s[i+p.size ()]+m-s[i]*h%m)%m;Calculate the next hash value
if (hash_i<0) Hash_i=hash_i+m;In fact, this step is not meaningful under the program. The main thing is to remind yourself that when it comes to the problem of redundancy, it may take a negative number and 0
}
}
}
int main () {
string p,s;
p="Rabin";
s="Rabin–karp string search Algorithm:rabin-karp";
int m=101; Prime
int base=; Base, take 26 here.
Rabin_karp (P,S,BASE,M);
return 0;
}
Self-matching problem

Given a string s of length n, find out if there is the same string in its substring with the length of l, and if so, the number of occurrences of the output and the occurrence of the position.
Note that the length of the substring required here is certain, the data is small, then violence can be done.

    1. Hash s The first length of the substring of L is the time complexity of: O (L), put into the map table
    2. Use the rolling hash to calculate all O (n) substrings of s, each one calculated and then compared to the map table, and the map table is updated with the time O (NLOGN)
      Note A "hash collision" may occur. Overall, the size of the M-value determines the size of the map table, and the size of the map table determines the probability of the hash collision. In the event of a collision, the individual believes that the use of buffer or re-hashing is relatively easy to achieve.
Code implementation

Code only to realize the existence of the same substring, ╮ (╯-╰) ╭, no way, LPL immediately start, you have to finish it ~

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
#Include<iostream>
#Include<string>
#Include<map>
UsingNamespaceStd
struct node{
int index;
int num;
};
map<Int,node> Mymap;
voidRabin_karp_self(String S,int L,int B,int m)
{
int h=0;Note Initialization
int t=1;
for (int i=0;i<l;i++) t= (t*b)%m;
for (int i=0;i<l;i++) {Calculates the hash value of the first window
H= ((b*h) +s[i])%m;
}
mymap[h].index=0;mymap[h].num++;
for (int i=1;i<=s.size ()-l;i++) {
Calculate the current hash
H= (h%m*b+s[i-1+l]+m-s[i-1]*T%M)%m;Slide the window to calculate the next hash value
H= ((h*b-s[i-1]*t) +s[i+l-1])%m;
if (h<0) h+=m; Here's the same question.
if (Mymap.count (h)) {
Int J;
for (j=0;j<l;j++) {
if (S[j+mymap[h].index]!=s[i+j])Break
}
if (j==l)cout<<"Yes" <<mymap[h].index<<"" <<i<<endl;
}else {
mymap[h].index=i;mymap[h].index++;
}
int main () {
string s;
int n;
S= //s= "abcabc";
N=5;
int b; int m;
B=10;m=10001;
Rabin_karp_self (s,n,b,m);
return 0;

/span>

If you want to output the number and position, it is also very simple, node add an array, and then modify the next cout that is OK. Also pay attention to the processing of hash collisions.

Variable length of substring

Todo
After all the algorithms of the string matching problem are thoroughly understood, look back at the problem
Personally, the algorithm is not only troublesome, but also not the quickest time.

Common substring problems

The algorithm was designed to look for a match of a pattern string p in a string s. However, now we need to deal with another problem: look at two long strings of length n and T to see if they have a common substring of length L. This seems like a more difficult problem to deal with, but we can still use Rollinghash to make its complexity O (n). We adopt a similar strategy:

    1. The first substring length of the hash s is the time complexity of L: O (L)
    2. Use the rolling hash to calculate all O (n) substrings of S, and then add each substring to a hash table with the time complexity: O (N)
    3. The first length of a hash t is a substring of L with time complexity: O (L)
    4. Use the rolling hash method to calculate all O (n) substrings of T, and check Hashtable for each substring to see if it can hit.
    5. If a substring of T hits a substring of S, then a match is made and if the equality continues, the match is stopped. Time complexity: O (L)

However, the number of times to keep Running is O (n), and again we need to pay attention to limiting the number of "hash collisions" to reduce the need for us to enter step 5来 to make unnecessary matches. This time, if the size of our hashtable is O (n), then the hit complexity we expect for each substring of T is O (1) (worst case scenario). This results in an O (n) Comparison of the strings, with a total complexity of O (NL), which makes the string comparison a bottleneck here. We can enlarge the size of the Hashtable, while modifying our hash function so that our Hashtable has O (square of N) slots (slots refer to the cells that are really used to store data in the hash table), so that for each t substring, the possible collisions are lowered to O (1/n). This solves our problem and makes the whole problem complex still O (n), but we may not need to create such a large hashtable like this to consume unnecessary resources.
Instead, we'll take advantage of the string signature to replace the practice of consuming more storage resources, and we'll assign a hash value to each substring, called H (k). Note that this h (k) ' hash function eventually maps the string to a range of 0 to n squared instead of 0 to n above. Now when we generate a hash collision in Hashtable, we can first compare the signatures of two strings before we make the final "expensive" string comparisons, and we can skip string comparisons if the signatures do not match. For two substrings K1 and K2, we do a final string comparison only if H (K1) =h (K2) and H (K1) ' =h (K2) '. For a good H (k) ' hash function, this will greatly reduce the string alignment, making the complexity of the alignment close to O (n), limiting the complexity of the common substring problem to O (n).

Two-dimensional expansion

http://novoland.github.io/%E7%AE%97%E6%B3%95/2014/07/26/Hash%20&%20Rabin-Karp%E5%AD%97%E7%AC%A6%E4%B8%B2% E6%9f%a5%e6%89%be%e7%ae%97%e6%b3%95.html
Reference from:
http://blog.csdn.net/yanghua_kobe/article/details/8914970
http://novoland.github.io/%E7%AE%97%E6%B3%95/2014/07/26/Hash%20&%20Rabin-Karp%E5%AD%97%E7%AC%A6%E4%B8%B2% E6%9f%a5%e6%89%be%e7%ae%97%e6%b3%95.html
http://blog.csdn.net/chenhanzhun/article/details/39895077

Rolling Hash (Rabin-karp algorithm) match string

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.