Transferred from: http://www.java3z.com/cwbwebhome/article/article8/83560.html?id=4649
——————————————————————————————————————————————————————————————————
Explore some of the principles/concepts in the hash table and, based on these principles/concepts, design a hash table for storing/locating data and comparing it to the HashMap class in the JDK. Let's take a seven-step process. One. Hash Table Concept Two. The method of hash constructor and its application scope Three. Hash processing conflict methods, their respective characteristics Four. Hash lookup process Five. Implement a scene--hash search algorithm using hash data, insert algorithm Six. Implementation of HashMap in JDK Seven. The comparison between hash table and HashMap, performance analysis
One. Hash Table Concept In the hash table, there is a definite relationship between the position of the record in the table and its keywords. This allows us to know in advance the position of the keyword being looked up in the table, so that we can locate the record directly from the subscript.
|
|
1) The hash function is an image that maps a collection of keywords to an address set, which is flexible,
As long as the size of this address set does not exceed the allowable range;
2) because the hash function is a compressed image, in general, it is easy to create a "conflict" phenomenon,
That is: Key1!=key2, and f (key1) = f (key2).
3). You can only minimize conflicts and not completely avoid conflicts, because the common keyword collection is larger, and its elements include all possible keywords,
The elements of the address collection are only the address values in the hash table. When constructing this special "lookup table", you need to choose a "good" (with as few conflicts as possible)
hash function, you also need to find a way to "handle conflicts".
Two. The method of hash constructor and its application scope
Direct Addressing method
Digital Analysis Method
The method of square take
Folding method
In addition to the residue remainder method
Random number method
(1) Direct addressing method:
The hash function is a linear function of the keyword, h (key) = key or H (key) = A * key + b
(2) Digital analysis method:
Assuming that each keyword in the keyword collection is composed of S-bit numbers (U1, U2, ..., US), the whole of the keyword set is analyzed,
And extract the evenly distributed number of bits or combinations of them as addresses.
This method is suitable for: the frequency of the various numbers appearing on each of the keywords can be estimated beforehand.
(3) The method of square take:
The middle of the square value of the keyword as the storage address. "The square value of the keyword" is to "widen the difference",
At the same time the square value of the middle you can be affected by the whole key words of you.
(4) Folding method:
Divide the keywords into sections and then take their overlays and hash addresses. Two methods of superposition processing: Shift Overlay:
Adds the lower-level alignment of the divided sections, overlaps the bounding layer: folds back and forth from one end to the next, and then aligns the additions.
This method is suitable for: the number of digits of the keyword is particularly numerous.
(5) In addition to the remainder method:
Set the hash function to: H (key) = key MOD P (p≤m), where M is the table length, p is not less than m, or does not contain 20 or less of the mass factor
(6) Random number method:
Set the hash function to: H (key) = random (key) where random is a pseudo-random function
When actually watchmaking, the method used to construct the hash function depends on the case of the set of key words (including the range and morphology of the keywords).
As well as the hash table length (hash address range), the general principle is to make the likelihood of conflict as small as possible.
Three. Hash processing conflict methods, their respective characteristics
The actual meaning of "handling conflicts" is to look for the next hash address for the keyword that generated the conflict.
Open addressing Method
Re-hash method
Chain Address method
(1) Open addressing method:
An address sequence is obtained for the conflicting keyword address H (key): H0, H1, H2, ..., Hs 1≤s≤m-1,hi = (H (key) +di) MOD m,
of which: I=1, 2, ..., s,h (key) is a hash function; M is a hash table long;
(2) Chain address method:
All records with the same hash address are linked in the same linked list.
(3) Re-hashing:
Method: Constructs several hash functions that, when a conflict occurs, computes the next hash address based on another hash function until the conflict no longer occurs.
That is: Hi=rhi (key) i=1,2,...... K, where: rhi--different hash functions, features: Increased computational time
Four. Hash lookup process
For the given value K, the hash address i = H (K) is computed, if r[i] = NULL The lookup is unsuccessful, and if R[i].key = K is found to be successful,
Otherwise "ask for the next address Hi" until r[hi] = NULL (lookup unsuccessful) or R[hi].key = K (lookup succeeded).
Five. Implement a scene with hash data-------Hash lookup algorithm, insert algorithm
Suppose we are going to design a data sheet to hold all the students ' personal information in central South University. Because the number of students in school is not particularly large (8W),
Each student's school number is unique, so we can simply apply the direct addressing method, declaring a 10W size array, each student's number as the primary key.
Then every time you want to add or find students, you just need to do as needed.
However, it is clear that this is very brain-crippled. The scalability and reusability of this system is very poor, such as the number of people over 10W a day?
What if it was used to hold other data? Or do I just need to save 20 records? Declaring an array of size 10W is obviously too wasteful.
If we were to keep large amounts of data (such as the number of users in a bank, 4 users should have 3.5 billion?) ), when we calculated the
Hashcode is likely to be in conflict, and our system should have the ability to "deal with conflicts", where we "deal with conflicts" by means of the catenary method.
If our data volume is very large and continues to increase, if we just deal with the conflict by means of chaining, maybe our chain hangs
Tens of thousands of data, this time again through static search to find the linked list, obviously performance is very low. Therefore, our system should also be able to achieve automatic expansion,
When the capacity reaches a certain proportion, that is, automatic expansion, so that the loading factor is kept at a fixed level.
To sum up, our basic requirements for this hash container should have the following points:
Meet the hash table Lookup requirements (nonsense)
Enables automatic transformation from small data volumes to large data volumes (automatic expansion)
Resolve conflicts using the Hang chain method
Well, since all the analysis to this step, I would like to gossip less, directly start the code bar.
public class mymap< K, v> {private int size;//current capacity private static int init_capacity = 16;//default capacity Private entry< K, v>[] container;//an array object that actually stores data private static float Load_factor = 0.75f;//load factor privat e int max;//The largest number =capacity*factor//self-setting capacity and load factor of the constructor public MyMap (int init_capaticy, float load_factor) { if (Init_capaticy < 0) throw new illegalargumentexception ("Illegal initial capacity:" + Init_capaticy); if (load_factor <= 0 | | Float.isnan (Load_factor)) throw new IllegalArgumentException ("Illegal load factor:" + load_factor); This. Load_factor = Load_factor; max = (int) (Init_capaticy * load_factor); container = new Entry[init_capaticy]; }//constructor with default parameters public MyMap () {This (init_capacity, load_factor); }/** * * * @param k * @param v * @returN/public Boolean put (K K, v V) {///1. Calculates the hash value of k//because it is difficult to write a hash algorithm that is applicable to different types, call the Hcode () method to calculate the hash value int hash = K.hashcode (); Encapsulates all information as a Entry entry< k,v> temp=new Entry (K,v,hash); if (setentry (temp, container)) {//size plus one size++; return true; } return false; }/** * Method of Expansion * * @param newSize * NEW container size */private void ReSize ( int newSize) {//1. Declaring a new array entry< K, v>[] newtable = new Entry[newsize]; max = (int) (newSize * load_factor); 2. Copy the existing elements, that is, traverse all elements, and then save each element again for (int j = 0; J < Container.length; J + +) {entry< K, v> en try = container[j]; Because each array element is actually a linked list, so ... while (null! = Entry) {setentry (entry, newtable); Entry =Entry.next; }}//3. Change point to container = newtable; }/** * Adds the specified node temp to the specified hash table in table * When added to determine if the node already exists * If it already exists, returns false * Add successful return True * @param temp * @param table * @return */Private Boolean setentry (entry< k,v> temp,entry [] table) {//find subscript int index = indexfor (Temp.hash, table.length) according to hash value; Find the corresponding element according to subscript entry< K, v> Entry = Table[index]; 3. If there is an if (null! = Entry) {//3.1 traverses the entire list, determine if equality while (null! = Entry) { When judging the conditions of equality, it should be noted that, except for the same address, the equivalence of the reference pass is compared with the Equals () method//equality does not exist and returns False if (Temp.key = = en try.key| | Temp.key.equals (Entry.key)) && Temp.hash = = entry.hash&& (temp.value==entry.value| | Temp.value.equals (Entry.value)) {return false; } else if (tEmp.key = = Entry.key && temp.value! = entry.value) {entry.value = Temp.value; return true; }//Unequal compares the next element else if (temp.key! = Entry.key) {//arrives at the end of the queue, interrupts the loop if (Null==entry.next) {break; }//did not reach the end of the team, continue to traverse the next element entry = Entry.next; }}//3.2 when traversing to the end of the queue, if none of the same elements, then the element is hung in the tail addentry2last (entry,temp); return true; }//4. If not present, directly set initialization element Setfirstentry (temp,index,table); return true; } private void Addentry2last (entry< K, v> Entry, entry< K, V> temp) {if (Size > Max ) {reSize (Container.length * 4); } entry.next=temp; } /** *Adds the specified node temp to the specified hash table in the specified subscript index of the table * @param temp * @param index * @param table */Private void Setfirstentry (entry< K, v> temp, int index, entry[] table) {//1. Determine if the current capacity is exceeded, if exceeded, call the expansion method if (Size > Max) {reSize (Table.length * 4); }//2. No exceeding, or after expansion, set element table[index] = temp; !!!!!!!!!!!!!!! Because after each set is a new list, need to be followed by the nodes are removed//nnd, less this line of code card Brother 7 hours (code refactoring) Temp.next=null; }/** * * * @param k * @return */public V get (k k) {entry< K, V> entry = null; 1. Calculate the hash value of k int hash = K.hashcode (); 2. Find subscript int index = indexfor (hash, container.length) according to hash value; 3. Find linked list entry = Container[index] According to index; 3. If the linked list is empty, returns NULL if (null = = entry) {return null; }//4. If not NULL, traverse the list, compare K equality, if K is equal,Returns the value while (null! = Entry) {if (k = = entry.key| | Entry.key.equals (k)) {return entry.value; } entry = Entry.next; }//Returns an empty return NULL if the traversal is not equal; /** * Calculates the subscript value of the hash code in the container array according to the hash code, the length of the container array * * @param hashcode * @param containerlength * @return */public int indexfor (int hashcode, int containerlength) {return hashcode & (Conta INERLENGTH-1); }/** * is used to actually save the internal class of data, because the use of catenary method to resolve the conflict, this inner class design for the list form * * @param < K>key * @param < v> * Value */class entry< K, v> {entry< K, v> next;//next node K key;//key V value;//value int hash;//This key corresponds to the hash code, as a member variable, when the next need to use the time can not be recalculated//construction method Entry (k K, v V, int hash) {this.key = k; This.value = v; This.hash = hash; }//the appropriate getter () method}}
When the first initialization is added, because the next of each element is empty, and the expansion capacity resize (),
Because conflict handling is chain-structured, when they are re-added to the hash, the next element of the repeating bird element must be set to NULL.
Seven. Performance analysis:
1. Because of the existence of the conflict, its search length can not reach O (1)
2 The average lookup length of a hash table is a function of loading factor A, not n.
3. When you use a hashtable to construct a lookup table, you can select an appropriate filling factor to limit the average lookup length to a range.
Finally give us the performance of this HashMap
Test code
public class Test {public static void Main (string[] args) { mymap< String, string> mm = new mymap< Strin G, string> (); Long Abegintime=system.currenttimemillis ();//Record begintime for (int i=0;i< 1000000;i++) { mm.put ("" +i, "" +i *100); } Long Aendtime=system.currenttimemillis ();//Record Endtime System.out.println ("Insert time-->" + ( Aendtime-abegintime)); Long Lbegintime=system.currenttimemillis ();//Record begintime mm.get ("" +100000); Long Lendtime=system.currenttimemillis ();//Record Endtime System.out.println ("Seach Time--->" + ( Lendtime-lbegintime)); } }
100W data, all storage time is a little more than 1S, and the search time is 0
Hash table (HASHMAP) analysis and implementation (JAVA)