Parse Hash table

Source: Internet
Author: User
Tags repetition

Read a lot of articles on the internet, feeling not enough to speak clearly (forgive me for not reading the book, these basic knowledge is to read the blog Self-study). So today I decided to talk about the hash table

Hash table also known as the hash list, C # in the most classic is Hashtable and dictionary. Especially behind the dictionary, we all use very much. They are stored in the form of key-value pairs that can be found by key, and the query is fast. How is it implemented internally? Why is the query fast? Does it have any shortcomings? Here's one by one instructions.

First, we're going to open up a container (array) to store the elements we want to insert, and since we know that the dictionary is stored in the form of key-value pairs, our container should be like this.

// Container Entity        Private struct Buckets        {            publicobject  key;              Public Object value;        }         Private null; // Container

For example, I now want to insert a key for 3b,value to 1, we will pass key. GetHashCode () Gets a hash value (int), and then computes the value, converts the index of the array, and stores the pair key and value at the index location.

K for Key,v on behalf of value

With this in hand, we can start building code.

classProgram {Static voidMain (string[] args) {myhashtable table=Newmyhashtable (); Table. ADD ("3B",1); Table. ADD ("WEA",2); varA = table. Find ("3B"); }    }     Public classmyhashtable {Private structBucket { Public Objectkey;  Public Objectvalue; }        Privatebucket[] Buckets =NULL;//Container         Publicmyhashtable () {Buckets=Newbucket[Ten]; }         Public voidADD (ObjectKeyObjectvalue) {            UINTindex = (UINT) key. GetHashCode ()% (UINT) buckets.            Length; Bucket Temp=NewBucket () {key=Key, Value=value}; Buckets[index]=temp; }         Public ObjectFind (Objectkey) {            UINTindex = (UINT) key. GetHashCode ()% (UINT) buckets.            Length; varitem =Buckets[index]; if(Item.key! =key)Throw NewException ("The key does not exist"); returnItem.value; }    }

Add when we call Key.gethashcode () can get its hash value (int) and then convert it to a 32-bit positive integer, and then the length of the% array. Gets the location of the index. the A=b%c,a range is 0~c-1. So it must not cross the border . But here's the problem.

Q1. What if different keys get the same index after they are computed? How do we store and find?

Table. ADD ("WEA", 2);
Table. ADD ("QQ1", 2);

At this point they pass the uint index = (UINT) key. GetHashCode ()% (uint) buckets. Length; The calculated index is 8, which we call the conflict in the following

Q2. When the container size is initialized to 10, if I insert more than 10 elements, I want to expand the container. The container (the length of the array) will find a change after the expansion,

UINT index = (UINT) key. GetHashCode ()% (uint) buckets. Length

The calculated results will appear different, how can we solve?

A1. We first retrofit the container (array buckets), and each element of the array is not a bucket. But a bucket list (if you don't know the list can be when it's a collection). This allows us to have the same position for multiple elements of the same index after conversion. We index this bucket list at the time of the query. Then find the same key element.

A2. We can rearrange the permutations (equivalent to calling the Add method)

For example, we initialize the array length to 10. When all is occupied, we extend the array to 20. Then we will

UINT index = (UINT) key. GetHashCode ()% ;

It is then inserted into the new array, and the arrangement is correct. A little remapping of the flavor.

Attaching a point, the size of the container (length, which is that of the array) has a relationship with the number of elements added (Elementcount). If the size of the container does not change, the greater the number of elements, the greater the likelihood of a conflict.

Like my container size is 2, I now add an element to the position of subscript 0. and the probability of a conflict when I add the next element is 50%. If the container size is 3, then the probability of repetition is only 33.33%.

It is impossible to avoid repetition by making the container very large in the beginning, as we cannot cook 1 cups of rice with the oversized pot of the canteen.

It is also impossible to let it conflict (anyway, we look at the list one by one) so lost it to find the advantages of performance, the equivalent of a linear search.

How do we grasp the balance of the middle?

float r=elementcount/length; R better be in between 0.6~0.7. Microsoft uses 0.62 (near the golden Divide). If r>0.62, we will expand the container.

    classProgram { Public Static voidMain () {MyhashSet=NewMyhash (); List<string> list =Newlist<string>()            {                "EF","AB","FF","GG","ee","ZF","ASE","FGE","Qweg","Qalspo","Goo2","1qwe","wet93"            };  for(inti =0; I < list. Count; i++)            {                varitem =List[i]; Set.            ADD (item, i); }            varValue =Set. Find ("GG"); }    }     Public classMyhash {//Container Entity        Private structBucket { Public Objectkey;  Public Objectvalue; }        Privatelinkedlist<bucket>[] Buckets =NULL;//Container        Private intCount//number of saved elements        Private intStep =Ten;//Increased number of extensions         PublicMyhash () {ininialbuckets (step); }        Private voidIninialbuckets (intlength) {            if(Buckets = =NULL)//if empty, the container is initialized{Buckets=NewLinkedlist<bucket>[length]; return; }            //Otherwise, the extension containerLength = length +buckets.            Length; varNewbuckets =NewLinkedlist<bucket>[length]; Count=0; foreach(Linkedlist<bucket> linklistinchbuckets) {                if(Linklist = =NULL)                    Continue; foreach(Bucket iteminchlinklist) {                    intindex =GetIndex (item.key, length); Inserintobuckets (index, Item.key, Item.value, newbuckets);// Rearrange}} Buckets=newbuckets; }        Private intGetIndex (ObjectKeyintlength) {            return(int)((UINT) key. GetHashCode ()% (UINT) length); }        Private voidInserintobuckets (intIndexObjectKeyObjectValue, linkedlist<bucket>[] buckets) {            varLinklist =Buckets[index]; if(Linklist = =NULL) linklist=NewLinkedlist<bucket>(); if(Linklist.count (x = X.key = = key) >0)                Throw NewException ("key already exists:"+key); Bucket Item=NewBucket () {key=Key, Value=value};            Linklist.addlast (item); Buckets[index]=linklist; Count++; }         Public voidADD (ObjectKeyObjectvalue) {            if((float) Count/(float) Buckets. Length >0.62)//if it is greater than 0.62. We'll expand our container.{ininialbuckets (step); }            intindex =GetIndex (key, buckets.            Length);        Inserintobuckets (index, key, value, buckets); }         Public ObjectFind (Objectkey) {            intindex =GetIndex (key, buckets.            Length); varLinklist =Buckets[index]; if(Linklist = =NULL)                Throw NewException ("This key does not exist"); Bucket Item= Linklist.firstordefault (x = X.key = =key); if(Item.key = =NULL)                Throw NewException ("This key does not exist"); returnItem.value; }    }

When we look at value based on a key, we convert the key to an index (UINT) key. GetHashCode ()% (uint) length

So in the ideal case (different keys are computed by hash after the index is different, that is, no conflict), the time complexity of the lookup is O (1). This is the advantage of it. If there is a conflict, then our search is linear. So it can be seen that the hash table is the most important data structure of the algorithm itself.

Now you can see the advantages of the hash table, regardless of the amount of data is large, the search is O (1) (ideally). But its premise is to sacrifice space.

The general principle is this, but we often use the hashtable and so on they are not constructed, they resolve the conflict using the double hashing method, it detects the address method is as follows:

H (key, I) = H1 (key) + I * H2 (key)

The formula for the hash function H1 and H2 is as follows:

H1 (key) = key. GetHashCode ()

H2 (Key) = 1 + (((H1 (key) >> 5) + 1)% (buckets.length-1))

Because the two-degree hash is used, the value of the final H (key, I) is likely to be greater than the container, so the H (key, i) is modulo and the hash address of the final calculation is:

Hash address = h (key, i)% buckets.length

And the container expansion is also fastidious.

Because this knowledge is not very good understanding to the new contact friend, therefore did not put in the article to go. As long as the principle is understood, it becomes very simple.

Also comes with Microsoft's Dictionary implementation, interested can go to see what Microsoft is doing

http://referencesource.microsoft.com/#mscorlib (of course, you want to search for dictionary)

Parse Hash table

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.