Read a lot of articles on the internet, feeling not enough to speak clearly (forgive me for not reading the book, these basic knowledge is to read the blog Self-study). So today I decided to talk about the hash table
Hash table also known as the hash list, C # in the most classic is Hashtable and dictionary. Especially behind the dictionary, we all use very much. They are stored in the form of key-value pairs that can be found by key, and the query is fast. How is it implemented internally? Why is the query fast? Does it have any shortcomings? Here's one by one instructions.
First, we're going to open up a container (array) to store the elements we want to insert, and since we know that the dictionary is stored in the form of key-value pairs, our container should be like this.
// Container Entity Private struct Buckets { publicobject key; Public Object value; } Private null; // Container
For example, I now want to insert a key for 3b,value to 1, we will pass key. GetHashCode () Gets a hash value (int), and then computes the value, converts the index of the array, and stores the pair key and value at the index location.
K for Key,v on behalf of value
With this in hand, we can start building code.
classProgram {Static voidMain (string[] args) {myhashtable table=Newmyhashtable (); Table. ADD ("3B",1); Table. ADD ("WEA",2); varA = table. Find ("3B"); } } Public classmyhashtable {Private structBucket { Public Objectkey; Public Objectvalue; } Privatebucket[] Buckets =NULL;//Container Publicmyhashtable () {Buckets=Newbucket[Ten]; } Public voidADD (ObjectKeyObjectvalue) { UINTindex = (UINT) key. GetHashCode ()% (UINT) buckets. Length; Bucket Temp=NewBucket () {key=Key, Value=value}; Buckets[index]=temp; } Public ObjectFind (Objectkey) { UINTindex = (UINT) key. GetHashCode ()% (UINT) buckets. Length; varitem =Buckets[index]; if(Item.key! =key)Throw NewException ("The key does not exist"); returnItem.value; } }
Add when we call Key.gethashcode () can get its hash value (int) and then convert it to a 32-bit positive integer, and then the length of the% array. Gets the location of the index. the A=b%c,a range is 0~c-1. So it must not cross the border . But here's the problem.
Q1. What if different keys get the same index after they are computed? How do we store and find?
Table. ADD ("WEA", 2);
Table. ADD ("QQ1", 2);
At this point they pass the uint index = (UINT) key. GetHashCode ()% (uint) buckets. Length; The calculated index is 8, which we call the conflict in the following
Q2. When the container size is initialized to 10, if I insert more than 10 elements, I want to expand the container. The container (the length of the array) will find a change after the expansion,
UINT index = (UINT) key. GetHashCode ()% (uint) buckets. Length
The calculated results will appear different, how can we solve?
A1. We first retrofit the container (array buckets), and each element of the array is not a bucket. But a bucket list (if you don't know the list can be when it's a collection). This allows us to have the same position for multiple elements of the same index after conversion. We index this bucket list at the time of the query. Then find the same key element.
A2. We can rearrange the permutations (equivalent to calling the Add method)
For example, we initialize the array length to 10. When all is occupied, we extend the array to 20. Then we will
UINT index = (UINT) key. GetHashCode ()% ;
It is then inserted into the new array, and the arrangement is correct. A little remapping of the flavor.
Attaching a point, the size of the container (length, which is that of the array) has a relationship with the number of elements added (Elementcount). If the size of the container does not change, the greater the number of elements, the greater the likelihood of a conflict.
Like my container size is 2, I now add an element to the position of subscript 0. and the probability of a conflict when I add the next element is 50%. If the container size is 3, then the probability of repetition is only 33.33%.
It is impossible to avoid repetition by making the container very large in the beginning, as we cannot cook 1 cups of rice with the oversized pot of the canteen.
It is also impossible to let it conflict (anyway, we look at the list one by one) so lost it to find the advantages of performance, the equivalent of a linear search.
How do we grasp the balance of the middle?
float r=elementcount/length; R better be in between 0.6~0.7. Microsoft uses 0.62 (near the golden Divide). If r>0.62, we will expand the container.
classProgram { Public Static voidMain () {MyhashSet=NewMyhash (); List<string> list =Newlist<string>() { "EF","AB","FF","GG","ee","ZF","ASE","FGE","Qweg","Qalspo","Goo2","1qwe","wet93" }; for(inti =0; I < list. Count; i++) { varitem =List[i]; Set. ADD (item, i); } varValue =Set. Find ("GG"); } } Public classMyhash {//Container Entity Private structBucket { Public Objectkey; Public Objectvalue; } Privatelinkedlist<bucket>[] Buckets =NULL;//Container Private intCount//number of saved elements Private intStep =Ten;//Increased number of extensions PublicMyhash () {ininialbuckets (step); } Private voidIninialbuckets (intlength) { if(Buckets = =NULL)//if empty, the container is initialized{Buckets=NewLinkedlist<bucket>[length]; return; } //Otherwise, the extension containerLength = length +buckets. Length; varNewbuckets =NewLinkedlist<bucket>[length]; Count=0; foreach(Linkedlist<bucket> linklistinchbuckets) { if(Linklist = =NULL) Continue; foreach(Bucket iteminchlinklist) { intindex =GetIndex (item.key, length); Inserintobuckets (index, Item.key, Item.value, newbuckets);// Rearrange}} Buckets=newbuckets; } Private intGetIndex (ObjectKeyintlength) { return(int)((UINT) key. GetHashCode ()% (UINT) length); } Private voidInserintobuckets (intIndexObjectKeyObjectValue, linkedlist<bucket>[] buckets) { varLinklist =Buckets[index]; if(Linklist = =NULL) linklist=NewLinkedlist<bucket>(); if(Linklist.count (x = X.key = = key) >0) Throw NewException ("key already exists:"+key); Bucket Item=NewBucket () {key=Key, Value=value}; Linklist.addlast (item); Buckets[index]=linklist; Count++; } Public voidADD (ObjectKeyObjectvalue) { if((float) Count/(float) Buckets. Length >0.62)//if it is greater than 0.62. We'll expand our container.{ininialbuckets (step); } intindex =GetIndex (key, buckets. Length); Inserintobuckets (index, key, value, buckets); } Public ObjectFind (Objectkey) { intindex =GetIndex (key, buckets. Length); varLinklist =Buckets[index]; if(Linklist = =NULL) Throw NewException ("This key does not exist"); Bucket Item= Linklist.firstordefault (x = X.key = =key); if(Item.key = =NULL) Throw NewException ("This key does not exist"); returnItem.value; } }
When we look at value based on a key, we convert the key to an index (UINT) key. GetHashCode ()% (uint) length
So in the ideal case (different keys are computed by hash after the index is different, that is, no conflict), the time complexity of the lookup is O (1). This is the advantage of it. If there is a conflict, then our search is linear. So it can be seen that the hash table is the most important data structure of the algorithm itself.
Now you can see the advantages of the hash table, regardless of the amount of data is large, the search is O (1) (ideally). But its premise is to sacrifice space.
The general principle is this, but we often use the hashtable and so on they are not constructed, they resolve the conflict using the double hashing method, it detects the address method is as follows:
H (key, I) = H1 (key) + I * H2 (key)
The formula for the hash function H1 and H2 is as follows:
H1 (key) = key. GetHashCode ()
H2 (Key) = 1 + (((H1 (key) >> 5) + 1)% (buckets.length-1))
Because the two-degree hash is used, the value of the final H (key, I) is likely to be greater than the container, so the H (key, i) is modulo and the hash address of the final calculation is:
Hash address = h (key, i)% buckets.length
And the container expansion is also fastidious.
Because this knowledge is not very good understanding to the new contact friend, therefore did not put in the article to go. As long as the principle is understood, it becomes very simple.
Also comes with Microsoft's Dictionary implementation, interested can go to see what Microsoft is doing
http://referencesource.microsoft.com/#mscorlib (of course, you want to search for dictionary)
Parse Hash table