[Knowledge Point] string hash

Source: Internet
Author: User
Tags modulus blizzard

1. Preface

Some of the main algorithms of the string are mentioned, and now it is said that there is no algorithm, but very common things-string hash.

2, the concept of hash

Hash more detailed concept is not much to say, its role is to be able to the complex state of the simple expression, more convenient for weighing. There are similar practices when searching, or when moving rules. In practical applications, it is also very important, which is why there is a blizzard of the company's hash algorithm and so on, encryption is also the important point of the hash, MD5 code is a classic example.

String hash in a variety of ways, the emphasis on explaining the simplest, most commonly used.

3, Bkdrhash

As a competitor, Bkdrhash must be the first choice! Through a variety of tests and studies, bkdrhash in a simple hash is the least amount of conflict is a hash method, processing is also very good understanding. Let's look at a short code:

------------------------------------------------------------------------------------------------------

#define X 131
#define MOD 1000000007

int H[MAXN];
Char A[MAXN];

void Gethash ()
{
int len = strlen (a);
for (int i = 0; I <= len-1; i++) h[i] = (h[i-1] * x + a[i])% MOD;
}

------------------------------------------------------------------------------------------------------

This method is similar to the binary number representation. For an I of a string a, by the previous x^i, in the case of the number of characters in the x>= string, the hash value and the string must be one by one mapped, and only if it corresponds to one by one, our hash value has the meaning of existence.

But obviously, there is a very serious problem, if a string is specified only 10 characters, the longest length is 8, then the limit of the hash value is 10^8. It is possible to put in an array of hash values. But in many cases, the length of the string and the character area are not so small. This is where I started to be confused, seemingly without solution. At this time, we can only choose to sacrifice the perfect correctness to meet the conditions.

For a number exceeding a certain limit, we give him a mod. Is this like a funny look? We simply analyze the occurrence of duplication, after random data to calculate the hash value modulus 1e9+7, the conflict rate is 0.05% up and down. In algorithmic contests, this error rate is almost negligible.

This is also asked, in the case of non-100% accuracy, if the author is insane, this will not be used to card your data? You know, the flexibility of string hash is very strong! As we can see in the code above, X and mod are defined by ourselves, and it does not depend on the data of the topic (of course x is the one that must be greater than the character set).

In the above code, our X Set value is 131, I do not know why to use this number, it is true that many of the online practices and tutorials are also 131, but in fact the limitations are not so strong. As long as the number is greater than or equal to the number of characters, it can be satisfied, for example, the mention of all lowercase letters, x>=26 can be. This should not exist for card data.

The key is the MoD's attention. In fact, through the online, seniors of various statements, the way to take MoD is also a variety of tricks. The larger the value, the better this is of course inevitable! But it is also best to take prime numbers on this basis. Why this is a matter of particular, may be in Noip and some of the comparative basis of the game/exam, this does not affect, but in the big exam, such as HNOI/CTSC is said to have appeared in the special card specific mod number situation. For example, 1e9+7,1e6+7,1e9+1 are more common, some people deliberately try to go to the card, causing some players to reduce the score. This time you can take the model, as long as it is a larger number, preferably a prime, such as your birthday, such as 19990522, who will know.

In the average case, using pseudo-random data, Mod<=5e6 will appear a wrong point, so take a good grasp of it.

3, multi-hash value

I believe that, like me, many people still have worries, so there are a number of ways to make this conflict rate smaller. Because of the algorithmic race, this 0.05% does not seem worth mentioning, but for large projects, such as software, systems, games, etc., in the face of a large number of global customers, it is not difficult to appear some bugs. Blizzard (Bilzzard) has its own efficient and ingenious hash value, which is not mentioned here; they seem to be still unsure about it, and decided to take a multi-hash judgment.

For a string, we assume that we give him two hash values (Bkdrhash and other types, see several tutorials on the web), and we consider the two strings equal when and only if the two hash values of the two strings are the same. According to the multiplication principle, it is obvious that the conflict rate will be the sum of the product of the two hash methods, which can be seen as a small amount.

Similarly, you can choose 3 hash,4 hash, but as the number increases, the complexity will also rise, so generally take two or three is enough. Blizzard set a conflict rate of 1:1.,888,946,593,147,86e,+22, about 10^ (-22.3), and is safe enough for a game program.

4, hash hanging chain

The above-mentioned multi-hash value compared to the chain is better understand some, blizzard company never hang chain, but as a method, here still have to mention. The hash chain is the real way to ensure perfect correctness. The same first piece of code.

------------------------------------------------------------------------------------------------------

Vector <int> H[MAXN];

void Add (int o)
{
int t = o% MOD;
for (int i = 1; I <= h[t].size (); i++)
if (h[t][i] = = O) return;
H[t].push_back (o);
}

------------------------------------------------------------------------------------------------------

The container that holds the hash value is vector to save space, and it acts like a list (you can also choose a handwritten list). For a number to be obtained after the modulus is T, it corresponds to the vector is h[t]. We put the original value of T into o after all modulo, so that we can easily find out if there is a conflict, just scan it in the vector.

It is true that this correctness is guaranteed, but in contrast, time complexity and spatial complexity do not seem to be ideal, and certainly can be used in situations where it is guaranteed.

5. Summary

String hash, the function is infinite. Kmp,sa and so on, often appear to use string hash instead of case.

[Knowledge Point] string hash

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.