The basic principle and implementation of the hash table

Source: Internet
Author: User

This post mainly introduces the principle and implementation of the common data structure of hash table. Because the personal level is limited, the article inevitably exists inaccurate or unclear place, hoped that everybody may correct:

I. Overview

The symbol table is a data structure for storing key-value pairs (Key-value pair), and the array we usually use can also be seen as a special symbol table, in which the "key" is an array index and the value is the corresponding array element. That is, when all the keys in the symbol table are small integers, we can use the array to implement the symbol table, the index of the array as the key, and the array element at the index is the value corresponding to the key, but this representation is limited to all the keys are relatively small integers, otherwise a very large array may be used. A hash table is an "upgrade" to the above policy, but it can support arbitrary keys without too much qualification. For the symbol table based on the hash list implementation, the following steps are required if we are looking for a key:

    • First we use the hash function to convert the given key into an "index of the array", ideally, different keys will be converted to different indexes, but in practice we will encounter different key to the same index situation, this is called collision . The solution to the collision will be described in detail later.
    • With the index, we can access the corresponding key-value pairs as if we were accessing the array.

This is the core idea of the hash table, which is a classic example of the tradeoff between time and space. When our space is infinitely large, we can use a large array to hold the key-value pairs, and with the key as the array index, because the space is not limited, so our keys can be infinitely large, so to find any key only need to do a normal array access. Conversely, if there is no time limit on the lookup operation, we can use the linked list to hold all the key-value pairs directly, thus minimizing the use of the space, but searching only in order. In practical applications, our time and space are limited, so we have to make a trade-off between the two, the hash table on the use of time and space to find a good balance point. One advantage of the hash table is that we can adjust the tradeoff between time and space by simply adjusting the corresponding parameters of the hashing algorithm without making any changes to the other parts of the code.

Two, hash function

Before introducing the hash function, let's introduce the basic concepts of several hash lists. Inside the hash table, we use buckets to hold the key-value pairs, and the array index we mentioned earlier is the bucket number, which determines which bucket of the given key is stored in the hash table. The number of buckets owned by the hash table is called the capacityof the hash table (capacity).

Now suppose we have m buckets in our hash list, with a bucket number of 0 to M-1. The function of our hash function is to convert any given key to an integer on [0, M-1]. We have two basic requirements for hash functions: one is to calculate the time to be short, and the other is to distribute the keys in different buckets as much as possible. For different types of keys, we need to use different hash functions to ensure a better hash effect.

The hash function we use should satisfy the uniform hash hypothesis as much as possible, and the following definition of the uniform hash hypothesis comes from the book of Sedgewick's algorithm:

(Uniform hashing hypothesis) we use a hash function that evenly and independently spreads all the keys between 0 and m–1.

There are two keywords in the definition above, the first one is uniform, meaning that we have a "candidate" for the bucket number calculated for each key, and uniformity requires that the probability of the M value being selected is equal; the second keyword is independent, which means that each bucket number is selected or not independent of each other, and whether the other bucket number is selected is irrelevant. In this way, to satisfy the uniformity and independence to ensure that the key value of the distribution in the hash table as evenly as possible, there will be "many key-value pairs are hashed to the same bucket, while many barrels are empty" case.

Obviously, it is not easy to design a better hash function that satisfies the uniform hash hypothesis, and the good news is that we usually don't need to design it, because we can directly use some efficient implementations based on probability statistics, For example, many of the commonly used classes in Java override the Hashcode method (the Hashcode method of the object class returns the memory address of the objects by default), which returns a hashcode for the type object. Usually we can get a bucket number by dividing this hashcode by the remainder of the bucket number m. Let's take a look at some of the classes in Java to introduce the implementation of hash functions for different data types.

1. Hashcode method of the String class

The Hashcode method of the string class is as follows

 Public int hashcode () {    int h = hash;     if (h = = 0 && value.length > 0)        {char val[] = value        ;  for (int i = 0; i < value.length; i++) {            = + * H + val[i];        }         = h;    }     return h;}

The value in the Hashcode method is a char[] array that stores each character of the string. We can see that at the very beginning of the method we will assign the hash to H, and this hash represents the hashcode of the previous calculation, so that if we have computed the hashcode of the string object before, this time we will not have to calculate it, we can go back to the previous calculation. This policy of hashcode caching is valid only for immutable objects, because the hashcode of immutable objects is immutable.

According to the above code we can know that if H is null, it means that we are the first time to calculate the HASHCODE,IF statement in the body is the hashcode of the specific calculation method. Assuming that our string object, str, contains 4 characters, and CK represents the K-character in the string (starting from 0), then STR's hashcode is equal to: * (* (C0 + C1) + C2) +c3.

2. Hashcode method for numeric types

Here we take an integer and a double as an example to introduce the general implementation of the Hashcode method of the numeric type.

The Hashcode method for the integer class is as follows:

 Public int hashcode () {    return  integer.hashcode (value);}  Public Static int hashcode (int  value) {    return  value;}

Where value represents the integer value that the integer object wraps, so the Hashcode method of the integer class simply returns its own value.

Let's take another look at the Hashcode method of the Double class:

@Override  Public int hashcode () {    return  double.hashcode (value);}  Public Static int hashcode (double  value) {    long bits = doubletolongbits (value) ;     return (int) (Bits ^ (bits >>> a));}

We can see that the Hashcode method of the double class first converts its value to a long type, and then returns an XOR result of low 32-bit and high 32-bit as hashcode.

3. Hashcode method for the date class

The data type we introduced earlier can be regarded as a numeric type (string can be regarded as an integer array), so how to calculate the hashcode of non-numeric type objects, here we take the date class as an example to briefly introduce. The Hashcode method for the date class is as follows:

 Public int hashcode () {    longthis. GetTime ();     return (int) HT ^ (int) (HT >> +);}

As we can see, the implementation of its Hashcode method is very simple, but it returns the XOR result of the low 32-bit and high 32-bit of the time that the date object encapsulates. From the implementation of the hashcode of the date class we can see that for the calculation of non-numeric types of hashcode, we need to select some of the instance domains that can differentiate each class instance as a factor for the calculation. For example, for the date class, date objects that usually have the same time are considered equal and therefore have the same hashcode. Here we need to explain that for an equivalent of two objects (that is, calling the Equals method to return True), their hashcode must be the same, and vice versa.

4. Get the bucket number from hashcode

Before we introduce some methods of calculating object hashcode, how do we get the bucket number after we get the hashcode? A straightforward approach is to divide the hashcode directly by the capacity (the number of buckets) and then use the remainder as the bucket number. However, in Java, hashcode is an int, and the int in Java is signed, so if we use the returned hashcode directly, we may get a negative number, obviously the bucket is not negative. So we first convert the returned hashcode to a nonnegative integer, then divide it by the capacity to take the remainder, as the corresponding bucket number of key, the code is as follows:

Private int Hash (K key) {    return (X.hashcode () & 0x7fffffff)% M;}  

Now that we know how to get the bucket number by a key, let's go through the second step of using a hash table lookup-dealing with collisions.

Third, using the Zipper method to deal with the collision

Using different collision processing methods, we get the different implementations of the hash table. The first thing we want to introduce is the implementation of a hash table using the Zipper method to deal with collisions. A hash table implemented in this way, with a linked list in each bucket. Initially all linked lists are empty, when a key is hashed to a bucket, the key becomes the first node of the linked list in the corresponding bucket, and then if another key is hashed into the bucket (that is, a collision occurs), the second key becomes the second node of the list, and so on. This way, when the number of buckets is M, and the number of key-value pairs stored in the hash table is n, the list of nodes in each bucket contains n/m. Thus, when we look for a key, we first determine the bucket it is in by hashing the function, which takes time O (1), then we compare the key of the node in the bucket with the given key, and if it is equal it finds the specified key value pair, which takes the time O (n/m). So the time to find the operation is O (n/m), and usually we can guarantee that N is a constant number of M, so the time complexity of the lookup operation of the hash table is O (1), so we can get the complexity of the insert operation also O (1).

Understanding the above description, the implementation of the zipper based on the hash table is very easy, here for the sake of simplicity, we directly use the previous seqsearchlist as a chain list in the bucket, the reference code is as follows:

 Public classChaininghashmap<k, v>  {    Private intNum//the total number of key-value pairs in the current hash table    Private intcapacity;//Number of barrels    PrivateSeqsearchst<k, v>[] St;//array of linked list objects     PublicChaininghashmap (intinitialcapacity) {Capacity=initialcapacity; St= (seqsearchst<k, v>[])NewObject[capacity];  for(inti = 0; I < capacity; i++) {St[i]=NewSeqsearchst<>(); }    }        Private intHash (K key) {return(Key.hashcode () & 0x7fffffff)%capacity; }         PublicV get (K key) {returnSt[hash (Key)].get (key); }     Public voidput (K key, V value) {St[hash (key)].put (key, value); }}

In the above implementation, we fixed the number of buckets of the hash table, and when we know for sure that the number of key pairs we want to insert can only reach the constant number of buckets, the number of fixed barrels is perfectly feasible. But if the number of key-value pairs grows much larger than the number of barrels, we need the ability to dynamically adjust the number of buckets. In fact, the ratio of the logarithm of a key value to the number of buckets in a hash table is called a load factor (payload factor). Typically, the smaller the load factor, the shorter the time we need to find, and the greater the use of space; If the load factor is large, the lookup time will be longer, but the space usage will be reduced. For example, the HashMap in the Java Standard library is a hash table based on the Zipper method, which has a default load factor of 0.75. HashMap the way to dynamically adjust the number of buckets is based on the formula Loadfactor = maxsize/capacity, where MaxSize is the largest key-value logarithm that supports storage, The Loadfactor and capacity (buckets) are either specified by the user at initialization or assigned by the system to default values. When the number of key-value pairs in HashMap reaches MaxSize, the number of buckets in the hash table is increased.

The above code also uses the SEQSEARCHST, in fact, this is a list based on the symbol table implementation, support to add Key-value pair, find the specified key using the sequential lookup, its code is as follows:

 Public classSeqsearchst<k, v> {    PrivateNode first; Private classNode {K key;        V Val;        Node Next;  Publicnode (K key, V Val, node next) { This. Key =key;  This. val =Val;  This. Next =Next; }    }     PublicV get (K key) { for(Node node = first; Node! =NULL; node =node.next) {if(Key.equals (Node.key)) {returnNode.val; }        }        return NULL; }     Public voidput (K key, V val) {//find out if the corresponding key already exists in the table firstnode node;  for(node = first; Node! =NULL; node =node.next) {if(Key.equals (Node.key)) {Node.val=Val; return; }        }        //the corresponding key does not exist in the tableFirst =NewNode (Key, Val, first); }}

Four, the use of linear detection method to deal with collisions 1. Fundamentals and implementation

The method of linear detection is another method of implementing the strategy of hash list, which is called open addressing method. The main idea of open addressing method is to save n key-value pairs with an array of size m, where M > N, the empty array is used to solve the collision problem.

The main idea of the linear detection method is that when a collision occurs (a key is hashed to an array position that already has a key-value pair), we examine the next position of the array, which is called a linear probe. Linear probing can produce three kinds of results:

    • Hit: The key of the location is the same as the key to find;
    • Miss: the location is empty;
    • The key for this location differs from the key being looked up.

When we look for a key, we first get an array index through the hash function, and then we start checking to see if the key in the corresponding position is the same as the given key, and if it is different then continue looking (if you don't find it at the end of the array) until the key is found or an empty position is encountered. By the process of linear probing we can know that if we insert a new key into it when the array is full, we will fall into an infinite loop.

Understanding the above principles, it is not difficult to implement a hash table based on the linear detection method. Here we use the array keys to hold the keys in the hash table, the values of the array are saved in the hash list, and the elements in the same position on the two array together determine the key-value pairs in a hash table. The specific code is as follows:

 Public classLinearprobinghashmap<k, v> {    Private intNum//number of key-value pairs in the hash table    Private intcapacity; Privatek[] keys; Privatev[] values;  PublicLinearprobinghashmap (intcapacity) {Keys= (k[])NewObject[capacity]; Values= (v[])NewObject[capacity];  This. Capacity =capacity; }    Private intHash (K key) {return(Key.hashcode () & 0x7fffffff)%capacity; }     PublicV get (K key) {intindex =hash (key);  while(Keys[index]! =NULL&&!key.equals (Keys[index])) {Index= (index + 1)%capacity; }        returnValues[index];//if a given key exists in the hash table, it returns the value, otherwise null is returned.    }     Public voidput (K key, V value) {intindex =hash (key);  while(Keys[index]! =NULL&&!key.equals (Keys[index])) {Index= (index + 1)%capacity; }        if(Keys[index] = =NULL) {Keys[index]=key; Values[index]=value; return; } Values[index]=value; Num++; }
}

2. Dynamically adjust the size of the array

In our implementation above, the size of the array is twice times the number of buckets and does not support the dynamic resizing of the array. In practical applications, when the load factor (the ratio of the logarithm of the key value to the size of the array) approaches 1 o'clock, the time complexity of the lookup operation is close to O (n), and when the load factor is 1 o'clock, the while loop becomes an infinite loop, based on our implementation above. Obviously we don't want to degrade the complexity of the lookup operation to O (n), and we don't want to get into an infinite loop. It is therefore necessary to implement a dynamic growth array to maintain the constant time complexity of the find operation. When the number of key-value pairs is very small, if the space is more tense, the array can be reduced dynamically, depending on the actual situation.

To dynamically change the size of an array, simply add the following judgment at the beginning of the put method above:

    if (num = = Capacity/2) {        Resize (2 * capacity);    }

The logic of the Resize method is also simple:

    Private void Resize (int  newcapacity) {        linearprobinghashmapNew Linearprobinghashmap<>(newcapacity);          for (int i = 0; i < capacity; i++) {            ifnull) {                hashmap.put ( Keys[i], values[i]);            }        }        Keys  = hashmap.keys;         = hashmap.values;         = hashmap.capacity;    }

With regard to the relationship between the load factor and the performance of the lookup operation, a conclusion in the algorithm (Sedgewick, etc.) is posted here:

In a linear probe-based hash table of a size M with n = *m (a load factor) key, if the hash function satisfies the uniform hash hypothesis, the number of probes required for hit and miss Lookups are: ~ 1 + 1/(1-a) and ~1/2* (1 + 1/(1-a) ^2 )

As for the above conclusions, we only need to know that when a is about 1/2, the number of probes required to find hits and misses is 1.5 and 2.5 times respectively. Another point is that when a approaches 1 o'clock, the accuracy of the estimates in the above conclusions will decrease, but we do not allow the load factor to close to 1 in practice, in order to maintain good performance, we should maintain a not more than 1/2 in the above implementation.

V. References

The algorithm (fourth edition) (Sedgewick, etc.)

The basic principle and implementation of the hash table

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.