Data structure: Hash Table

Source: Internet
Author: User

Http://www.cnblogs.com/lucifer1982/archive/2008/06/18/1224319.html

Angel Lucifer

Introduction

This article still does not speak of parallel/concurrency.

Hash table, a considerable part of the domestic book will be translated into a hash table, but Bo master himself like to call it a hash list.

The hash list supports any Key-value -based insert, retrieve, or delete operations.

For example, in the. NET 1.x version, we can use this:

Ten namespace Lucifer.CSharp.Sample

11 {

Class Program

13 {

public static void Main ()

15 {

Hashtable table = new Hashtable ();

17

18//Insert operation

TABLE[1] = "A";

Table. ADD (2, "B");

TABLE[3] = "C";

22

23//Retrieval operation

String a = (string) table[1];

String b = (string) table[2];

String c = (string) table[3];

27

28//delete operation

Table. Remove (1);

Table. Remove (2);

Table. Remove (3);

32}

33}

34}

In. NET 2.0 and above, we use the following:

Ten namespace Lucifer.CSharp.Sample

11 {

Class Program

13 {

public static void Main ()

15 {

Dictionary<int, string> table =

Dictionary<int New, string> ();

18

19//Insert operation

TABLE[1] = "A";

Table. ADD (2, "B");

TABLE[3] = "C";

23

24//Retrieval operation

String a = Table[1];

A string b = table[2];

string C;

Table. TryGetValue (3, out C);

29

30//delete operation

Table. Remove (1);

Table. Remove (2);

The table. Remove (3);

34}

35}

36}

It is well known that if an index is known in an array, the value at that index position is known. In the same way, in a hash table, all we have to do is to know the position of Value in the tables based on Key . The function of Key is only used to indicate position. Finding a location by Key means that the lookup time is found in order from O (N), binary find O ( LgN) is reduced to O (1).

So how do we convert a key that might be a string, a number, etc into a table index? This step is done in. NET by the GetHashCode method. Of course, the hash list also needs to be based on the hash Code to further calculate, but we now think that through Key GetHashCode Method We can already find Value . In fact, there's really no need for external developers to do any more work. This is why the Object class has a GetHashCode virtual method. When we use collections such as stack<t>,list<t>,queue<t> , there is no need to care about GetHashCode methods, but if you want to use Dictionary<tkey, Tvalue>,hashset<t>(. NET 3.5 New) collection, you must correctly override the GetHashCode method, Otherwise, these collections do not work correctly. Of course, use. NET primitive types do not have any problems, because Microsoft has implemented good overloads for these types.

In the book that explains the data structure, the work done by the GetHashCode method is called "hash function".

hash function

So how does the hash function work? Typically, it maps a number or a type that can be converted to a number to a fixed number of digits. For example, the GetHashCode method of. NET returns a 32-bit signed integer. When we map 64 or more digits to 32 bits, it is clear that this poses a complex problem: two or more different keys may be hashed to the same location, causing collision collisions. This is difficult to avoid because the number of keys is more than the location. Of course, if you can avoid collisions, it's perfect. We call the hash function, which is capable of doing this, called the full hash function (perfect hash functions).

In terms of definition and implementation, the hash function is actually a pseudo-random number generator (PRNG). Generally, the performance of the hash function is generally acceptable, and the hash function can be compared as a pnrg. Theoretically, there is a complete hash function. It never lets the data collide. In fact, it is too difficult to find such a hash function and the actual application that applies the hash function. Even its minimum variant is quite limited.

In practice, there are a number of data permutations. Some are very random, others are fairly formatted. A hash function makes it difficult to generalize all of the data types, even for a particular type of data. All we can do is try to find the hash function that best suits our needs. This is also one of the reasons why you must override the GetHashCode method.

Here are two main elements of our analysis of the selection hash function:

    1. data distribution. This is the yardstick by which the hash function generates hash values. Analyze this need to know the number of collisions that occur within the dataset, that is, non-unique hash values.
    2. the efficiency of the hash function. This is the measure of how quickly the hash function generates hash values. In theory, hash functions are very fast. However, it should also be noted that the hash function does not always maintain the time complexity of O (1) .

So how do you implement a hash function? There are basically two major methodologies:

    1. Addition and multiplication. The main idea of this method is to construct the hash value by traversing the data and then in some form of computation. It is usually multiplied by a prime number. As shown in the following:



      Currently, there is no mathematical method to prove the relationship between prime numbers and hash functions. However, in practice the use of some prime numbers can be very good results.
    2. Displacement. As the name implies, the hash value is obtained by displacement processing. Each time the processing result is accumulated, and the value is returned. As shown in the following:

In addition, there are a number of methods that you can use to calculate hash values. However, these methods are not only the above two kinds of variants or integrated use. To be honest, a good hash function is largely learned by experience. In addition, there is no cure. Fortunately, the predecessors left behind a number of classic hash function implementations. Next, let's take a look at these classic hash functions. Note that the hashing functions described in this article cannot be used in areas such as encryption, digital signatures, and so on.

hash functions for integer and floating-point types, because they are simple, are no longer elaborated here. Interested can use the Reflector and other anti-compilation tools to see their GetHashCode implementations themselves. It is worth mentioning that floating-point types should be aware of the results of the hash values of +0.0 and 0.0, as well as the 128-bit decimal type implementation.

A few string hash functions are described in detail next.

Let's look at what the Java string hash function looks like. Note that the code in this article is written in C #, the same as below. The code is as follows:

8 public int Javahash (string str)

9 {

Ten int hashcode = 0;

one for (int i = 0; i < str. Length; i++)

12 {

Hashcode = * hashcode + str[i];

14}

return hashcode;

16}

The above hash function, generally speaking, is quite good. However, if the string is very long, we may need to modify it. It actually comes from K & R 's "TheC programming Language". The prime number we use can be replaced by 131, 1313, 13131, 131313, ... etc. It looks similar to the hash function below.

public int Djbhash (string str)

19 {

int hashcode = 5381;

for (int i = 0; i < str. Length; i++)

22 {

Hashcode = ((hashcode << 5) + hashcode)

+ Str[i];

25}

Hashcode return;

27}

This function was first shown by Professor Daniel J. Bernstein in the newsgroup comp.lang.c , and is one of the most efficient hashing functions.

Let's take a look at the string hash function in. NET. The code is as follows:

public unsafe int Dotnethash (string str)

30 {

Fixed (char* charptr = new String (str. ToCharArray ()))

32 {

hashcode int = (5381 << 16) + 5381;

numeric int = hashcode;

int* intPtr = (int*) charptr;

36

PNS for (int i = str.) Length; i > 0; I-= 4)

38 {

Hashcode = ((hashcode << 5) + Hashcode +

(hashcode >>) ^ intptr[0];

if (I <= 2)

42 {

break;

44}

numeric = ((numeric << 5) + numeric +

(numeric >>) ^ intptr[1];

IntPtr + = 2;

48}

return hashcode + numeric * 1566083941;

50}

51}

The above code is actually a variant of Daniel Donald in his "TheArt of computer programming Volume 3". Because the hash function of old Tang is problematic in some cases,. NET does not fully adopt the old Tang method. The hash function provided by Lao Tang is as follows:

Dekhash public int (String str)

54 {

hashcode int = str. Length;

A-Z for (int i = 0; i < str. Length; i++)

57 {

Hashcode = ((hashcode << 5) ^ (hashcode >> 27))

(str[i ^);

60}

Hashcode return;

62}

In addition, there is a widely used hash function on the UNIX platform. The code is as follows:

+ Public int Elfhash (string str)

65 {

hashcode int = 0;

numeric int = 0;

(int i = 0; i < str.) Length; i++)

69 {

Hashcode = (hashcode << 4) + str[i];

if (numeric = hashcode & 0xf0000000l)! = 0)

72 {

Hashcode ^= (hashcode >> 24);

74}

Hashcode &= ~numeric;

76}

Hashcode return;

78}

As mentioned earlier, the hash function brings collision conflicts, so how to solve this situation? Please tell.

Finally, let us ask you a question: is the multiplication operation fast, or is the displacement Operation fast?

In the previous article, we learned that the hash function would cause the Key collision to occur.

So, how does the. NET Hashtable class solve the problem?

Very simple, probing.

We first use the hash function GetHashCode () to get the hash value of Key. To ensure that the value is within the array index range, let it be modeled with the array size. This gives the Value of the Key corresponding to the actual position within the array, i.e. f (K) = (GetHashCode () & 0x7FFFFFFF)% Array.Length.

When the hash value of multiple keys repeats (that is, when a collision occurs), the algorithm attempts to place the value in the next appropriate position, and if the position is already occupied, continue looking until the appropriate free position is found. If the number of collisions is more, then the number of searches will be more, the less efficient (either the linear detection method, two detection method, the double hashing method will look like this, but the search for the offset position algorithm is different, the. NET Hashtable class uses a double hash method). The entire process is as follows:

If the capacity of the hash table is close to saturation, it will be difficult to find a suitable free position, and there is a great chance of collision collisions. This time, the hash table will be expanded. So what are we going to do to determine the expansion? Based on the hash list internal array capacity and filling factor. When the number of hash elements = array size * Reload factor, it should be expanded.

The default reload factor for the. NET Hashtable class is 1.0. But in fact its default reload factor is that 0.72,microsoft thinks this value is not easy for developers to remember, so it's changed to 1.0. All filling factors entered from the constructor are multiplied by 0.72 within the Hashtable class. This is a demanding number, some time the filling factor increases or decreases by 0.01, your Hashtable access efficiency may increase or decrease by 50%, the reason is that the filling factor determines the hash table capacity, and the hash table capacity affects the conflict probability of Key, thereby affecting performance. 0.72 is a more balanced value from Microsoft's extensive experiments. (The algorithm for what is appropriate and conflict resolution is also relevant, 0.72 is not necessarily suitable for other structures of the hash table, such as Java hashmap<k, v> The default loading factor is 0.75).

Scale-up is a time-consuming and surprisingly internal operation, and Hashtable's write efficiency is only 1/10 orders of magnitude for read efficiency, and frequent expansion is a factor. When expanding, the hash table will be re-new to a larger array, and then copy the contents of the original array to the new array and re-hash. How new this larger array has also been fastidious. The initial capacity of a hash table is generally a prime number. When expanding, the size of the new array is set to a similar prime number of the original array double-sized. To avoid the extra overhead of generating primes,. NET has an array of prime numbers that record the number of primes that are commonly used. As shown below:

163 internal static readonly int[] primes =

164 {

165 3, 7, 11, 17, 23, 29, 37, 47, 59, 71, 89, 107,

166 131, 163, 197, 239, 293, 353, 431, 521, 631, 761,

167 919, 1103, 1327, 1597, 1931, 2333, 2801, 3371,

168 4049, 4861, 5839, 7013, 8419, 10103, 12143, 14591,

169 17519, 21023, 25229, 30293, 36353, 43627, 52361,

170 62851, 75431, 90523, 108631, 130363, 156437,

171 187751, 225307, 270371, 324449, 389357, 467237,

172 560689, 672827, 807403, 968897, 1162687, 1395263,

173 1674319, 2009191, 2411033, 2893249, 3471899,

174 4166287, 4999559, 5999471, 7199369

175};

When the size of the array to be expanded exceeds the prime number, the prime number generation algorithm is used to obtain a prime number similar to twice times its size. Under normal circumstances, we may not store so much content. Careful you may find that it consumes memory. Yes, this is really a very expensive memory resource. For example, when we want to add 8 elements to a Hashtable with a capacity of 11. Because 8/11 > 0.72, so to enlarge. According to the algorithm, the prime number similar to 2 * 11 is 23. See how wasted it is. Even if the capacity is set to 17 through the constructor, 9 spaces are wasted. If you have the need for key-value mapping and are more demanding on memory, consider using a dictionary or map constructed by a red-black tree.

What about the Dictionary<tkey, tvalue>?

Instead of using the Hashtable class, it uses a more popular, space-saving approach: Separating the link hashing method (separate chaining hashing).

Using the Dictionary<tkey of the decoupled link method, Tvalue> maintains a list of linked arrays internally. For this list array L0,L1,...,LM-1, the hash function will tell us where the element X should be inserted into the list. Then, in the find operation, tell us which table contains X. The idea of this approach is that although searching for a linked list is linear, if the table is small enough, the search is very fast (and that's true, and it's not always an O (1) to find, insert, delete, etc.). In particular, it is not limited by the loading factor.

In this case, the common filling factor is 1.0. The lower filling factor does not significantly improve performance, but it requires additional space. Dictionary<tkey, the default reload factor for tvalue> is 1.0. Microsoft does not even consider it necessary to modify the reload factor, so we can see that dictionary<tkey, tvalue>, does not find information about the reload factor in the constructor. Java hashmap<k, the v> default reload factor is 0.75. The reason for this is that it can reduce the time to retrieve. When I was testing, I found that Java hashmap<k, v> retrieval time is indeed more than. NET Dictionay<tkey, tvalue> retrieval time is less, but the gap is very small. At the same time, Hashmap<k, v> of the insertion times with Dictionary<tkey, tvalue> poor boss a cut, almost the latter 3~8 times. At first, I thought it was an illusion. Because of hashmap<k, v> does not take a modulo operation, but rather a displacement operation, and the capacity it uses is also growing at an exponential level of 2. These are some acceleration operations. is very puzzled, looking for the talent answer.

The attraction of the split-link hashing is not only that the performance is unaffected when the loading factor is increased moderately, but also that it is time-consuming to avoid re-hashing when expanding.

Finally, when we want to use Hashtable or Dictionary<tkey in the application, tvalue>, try to evaluate the number of elements to be inserted, as this effectively avoids the expansion and re-hashing operations. At the same time, the filling factor uses 1.0 as much as possible.

Data structure: Hash Table

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.