In the previous series, we introduced sequential lookups based on unordered lists, binary lookups based on ordered arrays, balanced lookup trees, and red-black trees, which are their time complexity on average and worst-case scenarios:
It can be seen that in the time complexity, the red and black trees in the average situation of inserting, finding and deleting the LGN time to achieve the complexity.
So there is no more efficient data structure, the answer is that this article will introduce the hash table, also known as hash table (hash table)
What is a hash table
A hash table is a structure that stores data in key-value (key-indexed), and we can find its corresponding value simply by entering the value to be looked up as key.
The idea of hashing is very simple, if all the keys are integers, then you can use a simple unordered array to implement: The key as an index, the value is its corresponding value, so that you can quickly access the value of any key. This is the case for a simple key, which we extend to a key that can handle more complex types.
There are two steps to using a hash lookup:
- Use the hash function to convert the lookup key to an array index. In an ideal situation, different keys are converted to different index values, but in some cases we need to handle the case where multiple keys are hashed to the same index value. So the second step of the hash lookup is to handle the conflict
- Handles a hash collision conflict. There are many ways to deal with hash collision collisions, and the zipper and linear detection methods are described later in this article.
A hash table is a classic example of a trade-off between time and space. If there is no memory limit, you can directly index the key as an array. Then all the lookup time complexity is O (1), and if there is no time limit, then we can use unordered arrays and order lookups, which requires very little memory. The hash table uses a modest amount of time and space to find a balance between these two extremes. You only need to adjust the hash function algorithm to make time and space choices.
hash Function
The first step in hash lookup is to use a hash function to map keys to indexes. This mapping function is a hash function. If we have a save 0-m array, then we need a hash function that can convert any key to an index in the range of that array (0~m-1). The hash function needs to be easy to calculate and distribute all the keys evenly. For example, for example, using a mobile phone number three is better than the first three, because the first three mobile phone number is very high repetition rate. Another example is the use of the identity card number is better than the number of years before the use of the number of digits.
In practice, our keys are not all numbers, there may be strings, there may be a combination of several values, etc., so we need to implement our own hash function.
1. Positive integers
The most common way to get a positive integer hash is to use the remainder method. That is, for an array of size m, for any positive integer k, the remainder of K divided by M is computed. M generally takes prime numbers.
2. String
When we use a string as a key, we can also use it as a large integer, using the retention method of the remainder. We can take each character that makes up a string and then hash it, for example
GetHashCode (str) { char[] s = str. ToCharArray (); hash = 0; (i = 0; i < s.length; i++) { hash = S[i] + (* hash); } hash; }
The above hash value is a method that Horner computes a string hash, with the formula:
h = s[0] 31l–1 + ... + s[l–3] 312 + s[l–2] 311 + s[l–1] 310
For example, to get the hash value for "call", the Unicode corresponding to 99,a for the string C corresponds to the Unicode of 97,l, which is 108, so the hash value of the string "call" is 3045982 = 99 313 + 97 312 + 108 311 + 108 310 = 108 + 31 · (108 + 31 · (97 + 31 · (99)))
If the hash value for each character can be time-consuming, you can save time by taking n characters at intervals to get the Hasi value, for example, you can get a hash value for every 8-9 characters:
GetHashCode (str) { char[] s = str. ToCharArray (); hash = 0; Math. Max (1, S.LENGTH/8); (i = 0; i < s.length; i+=skip) { hash = S[i] + (* hash); } Hash;}
However, for some cases, different strings produce the same hash value, which is the hash conflict (hash collisions) mentioned earlier, such as the following four strings:
If we take a hash of every 8 characters, we get the same hash value. So here's how to solve a hash collision:
Avoid hash collisionsZipper Method (Separate chaining with linked lists)
With a hash function, we can convert the key to an array index (0-M-1), but for two or more keys that have the same index value, we need a way to handle this conflict.
A straightforward approach is to point each element of an array of size m to a linked list, and each node in the list stores a hash value as the key value pair for that index, which is the zipper method. Clearly describes what is the zipper method.
In the diagram, "John Smith" and "Sandra Dee" point to the 152 index through a hash function, which points to a list of the two strings that are stored in the list.
The basic idea of this method is to select a large enough m to make all the lists as short as possible to ensure the efficiency of the search. The hash implementation using the Zipper method is divided into two steps, the first is to find the linked list of one should according to the hash value, and then find the corresponding key along the chain list order. We now implement our hash table here using the lookup table sequentsearchsymboltable that we introduced in the symbol table using unordered list implementations. Of course, you can also use the. NET built-in linklist.
First we need to define the total number of a list, within which we define an array of sequentsearchsymboltable. Then each one is mapped to the index to save one such array.
Public classSeperatechaininghashset<tkey, tvalue>:Symboltables<tkey, tvalue>whereTKey:IComparable<tkey>,IEquatable<tkey>{private intM;//hash Table sizePrivatesequentsearchsymboltable<tkey, tvalue>[] St;// PublicSeperatechaininghashset (): This(997) {} PublicSeperatechaininghashset (intm) { This. m = m; St =Newsequentsearchsymboltable<tkey, tvalue>[m]; for(inti = 0; i < m; i++) {St[i] =Newsequentsearchsymboltable<tkey, tvalue> (); } } private intHash (TKey key) {return(key. GetHashCode () & 0x7fffffff)% M; } Public OverrideTValue Get (TKey key) {returnSt[hash (Key)]. Get (key); } Public override voidPut (TKey key, TValue value) {St[hash (key)]. Put (key, value); }}
You can see that the implementation uses the
- Get method to get the value of the specified key, we first find the key corresponding index value by the hash method, that is, find the lookup table that stores the element in the sequentsearchsymboltable array, and then invoke the Get method of the lookup table. Find the corresponding value according to key.
- The Put method is used to store key-value pairs, first by hashing the corresponding hash value of the key, and then find the sequentsearchsymboltable array to store the element's lookup table, and then call the lookup table put method, the key value pairs are stored.
- Hashing method to calculate the hash value of key, here first by the fetch and & operation, the symbol bit is removed, and then the remainder of the method to take the key to the 0-m-1 range, which is our lookup table array index range.
Implement a hash table based on the zipper, the goal is to select the appropriate array size m, so that the empty list is not wasted because of memory space, and not because the list is too much time wasted on the lookup. The advantage of the zipper table is that the selection of this array size m is not critical, and if the key is more than expected, then the lookup will only take longer than the larger array selection, and we can use a more efficient structure instead of the linked list store. If you deposit fewer keys than you expect, there is a waste of space, but the search speed will be fast. So when the memory is not tense, we can choose a large enough m, can make the lookup time into a constant, if the memory is tight, choose the largest possible m can still improve performance m times.
Linear detection Method
The linear detection method is a method of solving the hash conflict with open addressing method, the basic principle is that using an array of size m to hold N key-value pairs, where m>n, we need to use the space in the array to resolve collision collisions. As shown in the following:
Compared to the previous zipper method, "Ted Baker" has a unique hash value of 153 in this figure, but because 153 is occupied by "Sandra Dee". The original "Snadra Dee" and "John Smith" hash value is 152, but in the "Sandra Dee" hash of the time to find that 152 is already occupied, so look down to find 153 is not occupied, so stored on 153, and then "Ted Baker" Hash to 153, found to have been occupied, so look down, found that 154 is not occupied, so the value is saved to 154.
The simplest of the open addressing method is the linear detection method: When the collision occurs when the hash value of a key is occupied by another key, the next position in the hash list is directly checked to add 1 to the index value, and there are three results for the linear probe:
- Hit, the key of the location is the same as the key being found
- Missing, key is empty
- Continue the lookup, which differs from the key being looked up.
The implementation of the linear detection method is also very simple, we only need two of the same size of the array of keys and value are recorded separately.
Public classLinearprobinghashset<tkey, tvalue>:Symboltables<tkey, tvalue>whereTKey:IComparable<tkey>,IEquatable<tkey>{private intN;total number of key-value pairs in the//symbol tableprivate intM = +;size of the//linear probe tablePrivatetkey[] keys; Privatetvalue[] values; PublicLinearprobinghashset () {keys =NewTkey[m]; Values =NewTvalue[m]; } private intHash (TKey key) {return(key. GetHashCode () & 0xFFFFFFF)% M; } Public OverrideTValue Get (TKey key) { for(inti = hash (key); Keys[i]! =NULL; i = (i + 1)% M) { off(key. Equals (Keys[i])) {returnvalues[i];} } return default(TValue); } Public override voidPut (TKey key, TValue value) {inthashcode = hash (key); for(inti = hashcode; Keys[i]! =NULL; i = (i + 1)% M) { if(Keys[i]. Equals (key))//If it is equal to the existing key, overwrite it with the new value{Values[i] = value; return; } //Insertkeys[i] = key; Values[i] = value; } }}
The linear probing (Linear probing) approach is simple, but there are some problems that can lead to the aggregation of homogeneous hashes. There is a conflict at the time of deposit, and the conflict persists at the time of the search.
Performance Analysis
As we can see, the hash table stores and finds data in two steps, the first step is to map the key through the hash function to the index in the array, this process can be considered to be only a constant time. The second step is, if there is a hash value conflict, how to resolve, the previous introduction of the Zipper method and the linear detection method is discussed in the following two ways:
For the Zipper method, the efficiency of the lookup is the length of the list, the general we should ensure that the length between M/8~M/2, if the length of the list is greater than M/2, we can expand the list length. If the length is 0~M/8, we can narrow the list.
This is true for linear probing, but the size of the dynamically adjusted array requires that all values be re-hashed and inserted into the new table.
This dynamically adjusts the size of the linked list or array to improve query efficiency, regardless of the zipper or hash method, and should also consider the cost of dynamically changing the size of the linked list or array. The double insertion of the hash table requires a lot of probing, and this cost of averaging needs to be considered in many cases.
Hash Collision Attack
We know that if the hash function is improperly chosen, it will cause a large number of keys to be mapped to the same index, whether it is the use of zippers or open addressing methods to resolve the conflict, in the subsequent search will need to do multiple probing or lookup, in many cases the hash table lookup efficiency degradation, not constant time. The degraded hash table is clearly described:
Hash table attack is through the careful construction of the hash function, so that all the keys are mapped to the same or several indexes after the hash function, the hash table is degraded to a single linked list, so that the various operations of the hash table, such as INSERT, the lookup has been degraded from O (1) to the list of links, which will consume a lot of CPU , which causes the system to fail to respond, thus achieving the purpose of denial of service (denial of services, Dos). This problem was also seen in ASP., which was preceded by the "non-random" hash of a variety of programming languages and a DOS security vulnerability with hash collisions.
In the internal implementation of the hash value of string in. NET, the problem is limited by the use of hash-value randomization, and the hashing function is randomized by setting a threshold for the number of collisions, which is also a way to prevent hash table degradation. The following is the implementation of the GetHashCode method of the string type in the BCL, and you can see that when collisions exceed a certain number of times, conditional compilation is turned on, and the hash function is randomized.
[ReliabilityContract (Consistency.willnotcorruptstate, Cer.mayfail), SecuritySafeCritical, __DynamicallyInvokable]Public override unsafe intGetHashCode () {if(hashhelpers.s_userandomizedstringhashing) {returnInternalmarvin32hashstring ( This, This. Length, 0L); }fixed(Char* str = ((Char*) This)) {Char* chptr = str;intnum = 0x15051505;intnum2 = num;int* Numptr = (int*) Chptr;intLength = This. Length; while(Length > 2) {num = ((num << 5) + num) + (num >> 0x1b) ^ numptr[0]; Num2 = ((((num2 << 5) + num2) + (num2 >> 0x1b)) ^ numptr[1]; Numptr + = 2; Length-= 4; }if(Length > 0) {num = ((num << 5) + num) + (num >> 0x1b) ^ numptr[0]; }return(num + (NUM2 * 0x5d588b65)); }}
. Implementation of hashing in net
We can view the implementation of dictionary, type in. NET through the online source code, we know that any value added to the dictionary as a key will first get the hashcode of the key and then map it to a different bucket:
Dictionary (iequalitycomparer<TKey> comparer) { (capacity < 0) Throwhelper.throwargumentoutofrangeexception (exceptionargument.capacity); (Capacity > 0) Initialize (capacity); This equalitycomparer<tkey>. Default;}
In the case of dictionary initialization, if the size is passed in, the bucket is initialized to call the Initialize method:
Initialize (capacity) { size = hashhelpers.getprime (capacity); New int [Size]; (i = 0; i < buckets. Length; i++) Buckets[i] =-1; entry[size]; FreeList =-1;}
We can look at the Add method of Dictonary, where the Add method calls the Insert method internally:
private voidInsert (TKey key, TValue value,BOOLAdd) {if(Key = =NULL) {throwhelper.throwargumentnullexception (Exceptionargument.key); }if(Buckets = =NULL) Initialize (0);intHashcode = comparer. GetHashCode (key) & 0x7FFFFFFF;intTargetbucket = hashcode% buckets. Length;#ifFeature_randomized_string_hashingint collisioncount = 0;#endif for(inti = Buckets[targetbucket]; I >= 0; i = entries[i].next) {if(Entries[i].hashcode = = Hashcode && comparer. Equals (Entries[i].key, key)) {if(add) {throwhelper.throwargumentexception (exceptionresource.argument_addingduplicate); } entries[i].value = value; version++;return; }#ifFeature_randomized_string_hashingcollisioncount++;#endif}intIndexif(Freecount > 0) {index = freeList; FreeList = Entries[index].next; freecount--; }Else{if(count = = entries. Length) {Resize (); Targetbucket = hashcode% buckets. Length; } index = count; count++; } Entries[index].hashcode = Hashcode; Entries[index].next = Buckets[targetbucket]; Entries[index].key = key; Entries[index].value = value; Buckets[targetbucket] = index; version++;#ifFeature_randomized_string_hashingif (Collisioncount > Hashhelpers.hashcollisionthreshold && hashhelpers.iswellknownequalitycomparer ( Comparer)) {comparer = (iequalitycomparer<tkey>) hashhelpers.getrandomizedequalitycomparer (compa RER); Resize (entries. Length, True); }#endif}
First, get its hashcode based on key, then map Hashcode divided by Backet to the target backet, and then traverse the bucket store's linked list, if you find the same value as key, if you do not allow the added key to be the same as the existing key replacement value ( Add), the exception is thrown, and if allowed, the previous value is replaced and then returned.
If not found, the newly added value is placed in the new bucket, and when the free space is insufficient, the capacity is expanded (Resize) and then hashed to the target bucket. It is important to note that the resize operation consumes more resources.
Summary
In the previous articles, we introduced sequential lookups based on unordered lists, binary lookups based on ordered arrays, balanced search trees, and red-black trees, this article finally introduced the last kind of symbol table and hash table in the search algorithm, and introduced the hashing function and two methods of dealing with hash conflicts: Zipper method and Linear detection method. The time complexity of various operations under the worst and average conditions of various search algorithms such as:
In the actual writing code, how to choose the appropriate data structure needs to be based on the specific data size, to find the efficiency requirements, time and space constraints to make the appropriate choice. I hope this article and several previous articles are helpful to you.
Resources
- PHP Hash Table Collision Attack principle
- Is it really O (1)? Do you have any idea?
- String.gethashcode Method
- . NET Dictionary Source Code
Talking about algorithm and data structure: 11 hash table