. NET Dictionary<tkey, tvalue> is a very common key-value data structure, which is actually the legendary hash table. NET also has a type called Hashtable, two types are hash tables. Both types can implement key-value-to-store functionality, except that a generic one is not and the internal implementation is somewhat different. Look into it today. NET Dictionary<tkey, tvalue>, and some related issues.
guid:33b4b911-2068-4513-9d98-31b2dab4f70c
If there are errors in the text, hope to point out.
What is a hash table
The definition of hash table in Wikipedia is this:
In computing, a hash table (also hash map) was a data structure used to implement an associative array, a structure that CA n map keys to values.
It is a data structure that accesses the memory storage location directly through the keyword, which is a data structure in all the data structure textbooks, and does not do much research here. But there are a couple of concepts to mention, because it's a big part of our understanding of dictionary's internal implementations.
More contents of the hash table: Wikipedia,hashtable Blog
Collision (collision) and handling
Since the hashing algorithm we used in the data structure is not a perfect hashing algorithm, we limit the memory space we use to store it. So collisions are impossible to avoid, so dealing with collisions is a very important factor to consider when designing a hash table.
There are a number of ways to deal with collisions, such as open adressing, and the separate link method (separate chaining). The dictionary is a method called separate chaining with linked lists.
It is clear from the Wikipedia diagram below that this is a kind of method.
With two arrays, the buckets array holds only one address, which points to an instance (entry) in the entries array. When the hash value conflicts, it is necessary to add a new instance to the end of the linked list of the currently pointed instance.
Loading factor (Load Factor)
The filling factor exists because in the open addressing method, when the contents of the array are more and more, the probability of the conflict becomes larger, and the solution of the conflict in the open addressing method is to use the detection method, and this kind of conflict can bring a great loss of performance. This diagram in Wikipedia compares the relationship of the CPU cache misses with the Distributed link and the linear detection method in the case of different loading factors.
In this method used by dictionary, the filling factor is not an important factor that does not have a significant impact on performance, so dictionary uses 1 by default and considers it unnecessary to provide any interface to set this value.
How to implement dictionary internal
Let's start by introducing several important variables in dictionary:
int[] buckets
AndEntry[] entries
IEqualityComparer<TKey> comparer
.
- These two are the two arrays mentioned above, the so-called separate chaining with linked lists.
- When adding a pair of new values to dictionary, it is necessary to calculate the hashcode of the key, and it is necessary to determine whether the two value is equal when conflicting. The comparer is here to do this, so why not just call the key overloaded GetHashCode and equal methods here? This will be described later in this article.
Insert
Use an example to illustrate what dictionary did when it was inserted.
Dictionary<int, string= "" > dict = new dictionary<int, string= "> ();d ICT. ADD (0, "zero");d ICT. ADD ("Twelve");d ICT. ADD ("Fiften");d ICT. ADD (4, "four");
The following "figure" can be seen in the two arrays in the insertion operation of the changes, combined with the source code to know what happened.
------------------|buckets| |entries| | -------| |-------|| 0 | | -->| Hashcode=0,key=0,next=-1,value= "Zero" |-------| |-------|| -1 | | Empty | | -------| |-------|| -1 | | Empty | | -------| |-------|------------------|buckets| |entries| | -------| |-------|| 1 | | -->| Hashcode=0,key=0,next=-1,value= "Zero" |-------| |-------|| -1 | | -->| Hashcode=12,key=12,next=0,value= "Twelve" |-------| |-------|| -1 | | Empty | | -------| |-------|------------------|buckets| |entries| | -------| |-------|| 2 | | -->| Hashcode=0,key=0,next=-1,value= "Zero" |-------| |-------|| -1 | | -->| Hashcode=12,key=12,next=0,value= "Twelve" |-------| |-------|| -1 | | -->| Hashcode=15,key=15,next=1,value= "Fiften" |-------| |-------|------------------|buckets| |entries| | -------| |-------|| 0 | | -->| Hashcode=0,key=0,next=-1,value= "Zero" |-------| |-------|| 2 | | -->| Hashcode=12,key=12,next=-1,value= "Twelve" |-------| |-------|| -1 | | -->| Hashcode=15,key=15,next=-1,value= "Fiften" |-------| |-------|| -1 | | -->| Hashcode=4,key=4,next=-1,value= "four" |-------| |-------|| 3 | | Empty | | -------| |-------|| 1 | | Empty | | -------| |-------|| -1 | | Empty | | -------| |-------|
Expansion
At the last insertion in the example above, dictionary did an expansion. It expands the size of the dictionary from the original 3 to 7. You can see that the elements in the entries change little in the capacity of the expansion, but next has some changes, because after the expansion of their hash value%length no longer mapped to the same value, there is no need to share a value. I think there are two points worth mentioning here.
- How does the next size of the expansion be obtained, and why is it 7 in the example above?
- Expansion is a change of two arrays.
For the first problem, because the array length in the dictionary is limited, it is through key. GetHashCode ()% length to get the position in a bucket array and then change entry's next. Then we have to ensure that as much as possible to reduce the number of collisions caused by the model, then the prime can be very good to ensure that the module can be dispersed in the array as far as possible.
Dictionary will expand the current capacity of the first time, and then in a prime number table to find a larger than this value of the nearest prime. This prime table is the same as this:
public static readonly int[] primes = { 3, 7, 11, 17, 23, 29, 37, 47, 59, 71, 89, 107, 131, 163, 197, 239, 293, 353, 4 521, 631, 761, 919, 1103, 1327, 1597, 1931, 2333, 2801, 3371, 4049, 4861, 5839, 7013, 8419, 10103, 12143, 14591,
17519, 21023, 25229, 30293, 36353, 43627, 52361, 62851, 75431, 90523, 108631, 130363, 156437, 187751, 225307, 27037 1, 324449, 389357, 467237, 560689, 672827, 807403, 968897, 1162687, 1395263, 1674319, 2009191, 2411033, 2893249, 34718 99, 4166287, 4999559, 5999471, 7199369};
As for the second question, as we mentioned earlier, this method does not need to re-hash the content stored in the hash table when it is scaled up. We just need to re-modulo the hash value of the distribution element in the bucket and put it in a new position, which is very fast.
Find
Key. GetHashCode ()% Length--traverse list to find equal key
Delete
Same lookup.
Several considerations performance issues
When we use dictionary, the general habit should be the same as in Code 1, which is not a problem when we use the built-in type as key, but we need to be careful if we need to take a custom value type (struct) as a key. There is a very easy to ignore problem that can lead to a lot of unnecessary performance overhead when using dictionary.
When we need to define some custom structures and put them in the collection we tend to take value types instead of defining them as a class, and value types are much better at performance than classes if they only have data. (Choosing between Class and Struct)
Let's start with an experiment to compare the difference in the performance of value types and classes as keys. The experiment code below, in this code i insert 1000 to 10,000 data to get the time required.
Public class/struct customkey{public int Field1; public int Field2; public override int GetHashCode () {return Field1.gethashcode () ^ Field2.gethashcode (); public override bool Equals (object obj) {customkey key = (customkey) obj; return this. Field1 = = key. Field1 && this. Field2 = = key. Field2; }}dictionary<customkey, int> dict = new Dictionary<customkey, int> (); int trycount = 50;double totalTime = 0.0; for (int count = N; count < 10000; count + = +) {for (int j = 0; J < Trycount; J + +) {Stopwatch WA Tcher = Stopwatch.startnew (); for (int i = 0; i < count; i++) {Customkey key = new Customkey () {Field1 = i * 2, Field2 = i * 2 + 1}; Dict. ADD (key, i); } watcher. Stop (); Dict. Clear (); TotalTime + = Watcher. Elapsedmilliseconds; } Console.WriteLine ("{0},{1}", count, Totaltime/trycount);}
The result is this:
WTF? Why not the same as my expectations, not the value type to be fast, right? Orz ....
It is necessary to mention the one that has just been mentioned above IEqualityComparer<TKey> comparer
, and the Dictioanry internal comparisons are made through this example. But we didn't specify it, so it used to be EqualityComparer<TKey>.Default
. Let's take a look at the source code to see how this default is coming from, and CreateComparer
we can see that if our type is not, does not byte
implement the IEquatable<T>
interface, is not Nullable<T>
, is not, it enum
will default to us to create one ObjectEqualityComparer<T>()
.
and the ObjectEqualityComparer<T>()
equal and GetHashCode methods don't seem to matter, so what's the problem?
Performance problems related to value types can immediately be thought of as a result of the performance loss of boxing and unpacking. Is there any such operation here? Let's take a look at the following two sections of code to understand.
Objectequalitycomparer.equals (t x, t y) Il code//Methods.method public hidebysig virtual instance bool Equals (! T x,! T y) cil managed {//Method begins at RVA 0x62a39//Code size (0x32). maxstack 8il_0000:ldarg.1il_0001:box! TIL_0006:BRFALSE.S Il_0026il_0008:ldarg.2il_0009:box! TIL_000E:BRFALSE.S IL_0024IL_0010:LDARGA.S Xil_0012:ldarg.2il_0013:box! til_0018:constrained. ! Til_001e:callvirt instance bool System.object::equals (Object) il_0023:retil_0024:ldc.i4.0il_0025:retil_0026: Ldarg.2il_0027:box! TIL_002C:BRFALSE.S Il_0030il_002e:ldc.i4.0il_002f:retil_0030:ldc.i4.1il_0031:ret}//End of method ObjectEqualityCom Il code for Parer ' 1::equalsobjectequalitycomparer.equals (t x, t y). Method public Hidebysig Virtual instance int32 GetHashCode ( ! T obj) cil managed {. Custom instance void System.runtime.targetedpatchingoptoutattribute::.ctor (String) = (3b 50 65 6f, 6d, 6e, 6372-----------------------------------6f0 6d61 (6f) (6e)//Method begins at RVA 0x62a6c//Code size (0x18). Maxstack 8il_ 0000:ldarg.1il_0001:box! TIL_0006:BRTRUE.S Il_000ail_0008:ldc.i4.0il_0009:retil_000a:ldarga.s objil_000c:constrained. ! Til_0012:callvirt instance int32 System.object::gethashcode () Il_0017:ret}//End of Method Objectequalitycomparer ' 1:: GetHashCode
As you can see from the above two pieces of code, there are a lot of box (see certain) operations in the default implementation of Objectequalitycomparer, which is used to boxing value types into reference types. This operation is time-consuming because it requires creating an object and copying the values from the value type to the newly created object. (There is also a unbox operation in the Customkey.equal method).
How did it break?
I think it's OK to avoid boxing, so we create a comparer ourselves.
public class mykeycomparer:iequalitycomparer{ #region iequalitycomparer the public bool Equals (customkey x, Customkey y) { return x.field1 = = Y.field1 && x.field2 = = y.field2; } public int GetHashCode (Customkey obj) { return obj. Field1.gethashcode () ^ obj. Field2.gethashcode (); } #endregion}
Let's change the experiment code slightly ( Dictionary<CustomKey, int> dict = new Dictionary<CustomKey, int>(new MykeyComparer());
) in a test. This time the results show a lot of performance improvements.
Thread Safety
This shipment is not thread-safe, requires multi-threading operations or maintains synchronization by itself or uses thread-safe Dictionary-->concurrentdictionary<tkey, tvalue>
Let's get here first.