Java Data structures and algorithms (13)--Hash table

Source: Internet
Author: User
Tags int size static class

Hash tables are also known as hash lists, and there are direct translations of hash tables, which are data structures that are accessed directly from the keyword value (key-value). It is based on an array, by mapping the keyword to an array of subscript to speed up the search speed, but also arrays, lists, trees and other data structures, find a keyword in these data structures, usually to traverse the entire data structure, that is, O (N) time level, but for the hash table, only O (1) time level.

Note that an important question here is how to convert a keyword to an array subscript, which is called a hash function (also called a hash function), and the process of conversion is called hashing.

1, the introduction of the hash function

Everyone has used dictionaries, the advantage of the dictionary is that we can quickly navigate to the word we are looking for through the previous directory. A hash table is a good choice if we want to write every word of an English dictionary, from A to Zyzzyva (this is the last word in the Oxford Dictionary), to the memory of the computer for quick reading and writing.

Here we will narrow the range, for example, to store 5,000 English words in memory. We may think that each word will occupy an array of cells, then the size of the array is 5000, while you can use the array subscript access to the word, so the idea is perfect, but the array subscript and the word how to establish a connection?

First we want to establish the relationship between the word and the number (array subscript):

We know that ASCII is a kind of encoding, where a means that 97,b represents 98, and so on, and so on to 122 for z, and each word is composed of these 26 letters, we can not use ASCII encoding the size of the number, we design a set of ASCII-like encoding, such as a means that 1,b represents 2, And so on, z means 26, so that means we know.

How do you combine the numbers of individual letters into numbers that represent the whole word?

  ①, add the numbers

First, a simple method is to add the number of each letter of the word, and the index of the array.

For example, the word cats is converted into numbers:

Cats = 3 + 1 + 20 + 19 = 43

Then the word cats stored in the array subscript is 43, all the English words can use this method to convert the subscript. But is this really feasible?

Suppose we agree that a word has a maximum of 10 letters, then the last word of the dictionary is zzzzzzzzzz, which is converted to a number:

zzzzzzzzzz = 26*10 = 260

Then we can get the word encoding range is from 1-260. Obviously, this range is not enough to store 5,000 words, then there must be a location to store multiple words, each array of data items on average to store 192 words (5000 divided by 260).

How can we solve the above problem?

  The first approach: consider that each array item contains a sub-array or a sub-list, this method of saving data items is really fast, but if we want to find one of the 192 words, then it is still very slow.

  The second way: Why should so many words occupy the same data item? That is, we do not divide the word enough, the array can represent too few elements, we need to expand the array of subscript, so that each location is only a single word.

For the second method above, the problem arises, how do we extend the subscript of an array?

  The multiplication of ② and power

We split the number of words represented by multiplying them with the appropriate power of 27 (because there are 26 possible characters, and spaces, altogether 27), and then add the product, which gives the unique number of each word.

For example, convert the word cats to a number:

Cats = 3*273 + 1*272 + 20*271 + 19*270 = 59049 + 729 + 540 + 19 = 60337

This process creates a unique number for each word, but it is important to note that we are only calculating 4-letter words, if the word is very long, such as the longest 10-letter word zzzzzzzzzz, just 279 result is more than 7000000000000, The result is huge, and in real memory it's impossible to allocate so much space for an array.

So the problem with this solution is that although each word is assigned a unique subscript, only a small portion of the word is stored, a large part of it is empty. Now, you need a way to compress the large integer range obtained in the multiplication system of the power of a number into an acceptable array range.

For the English dictionary, suppose that there are only 5,000 words, here we select the capacity of 10000 of the array space to store (we will explain why it takes more than one more space). Then we need to compress the range from 0 to over 7000000000000 to the range from 0 to 10000.

The first method: Take the remainder, and get a number that is removed by another integer. First we assume that the number from 0-199 (expressed in largenumber) is compressed to a number from 0-9 (denoted by Smallnumber), the latter has 10 numbers, so the value of the variable Smallrange is 10, and the expression for this conversion is:

Smallnumber = largenumber% Smallrange

When a number is divisible by 10, the remainder must be between 0-9, so that we compress the number from 0-199 to 0-9 and the compression rate to 20:1.

  

We can also use a similar method to compress the index of the number of words that represent the word only:

arrayindex = largernumber% Smallrange

  This is also the hash function. It hashes (transforms) a large range of numbers into a small range of numbers that correspond to the subscripts of the arrays. When you use a hash function to insert data into an array, the array is a hash table.

2. Conflict

  By compressing the huge range of numbers into a smaller range of numbers, it is certain that there will be several different words Hashiha to the same array subscript, which creates a conflict .

The conflict may cause the hash scheme to be impossible to implement, and we said that the specified array range size is twice times the actual stored data, so maybe half of the space is empty, so when the conflict arises, one method is to find an empty slot in the array by means of the system, and fill in the word, Instead of using the hash function to get the array subscript, this method is called open address method. For example, the result of adding a word cats hash is 5421, but its position has been occupied by the word parsnip, then we will consider storing the word cats in a position 5422 behind the parsnip.

Another way, as we mentioned earlier, is that each data item in the array creates a sub-linked list or sub-array, then the array is not directly stored in the word, and when there is a conflict, the new data item is stored directly into the linked list of the array subscript, this method is called the chain address method.

3. Open Address Law

In the development address method, if the data item cannot be stored directly in the array subscript computed by the hash function, it is necessary to look for other locations. There are three methods: linear detection, two detection, and re-hashing.

①, linear detection

In linear probing, it looks for blank cells linearly. For example, if 5421 is the location to insert data, but it is already occupied, then use 5422, if 5422 is also occupied, then use 5423, and so on, the array subscript increments until a blank position is found. This is called linear probing because it looks up the empty cell in step-by-step order along the array.

Full code:

Package Com.ys.hash;public class MyHashtable {private dataitem[] Hasharray;//dataitem class, representing each data item information private int arraySize The initial size of the//array the private int itemnum;//array actually stores how much data the private DataItem nonitem;//used to delete the data item public myhashtable (int arraySize) { This.arraysize = Arraysize;hasharray = new Dataitem[arraysize];nonitem = new DataItem (-1);//deleted data items are labeled -1}// Determines whether the array is stored full public boolean isfull () {return (Itemnum = = arraySize);} Determines whether the array is empty public boolean isEmpty () {return (Itemnum = = 0);} Print array contents public void display () {System.out.println ("Table:"), for (int j = 0; J < ArraySize; J + +) {if (hasharray[j]! = null) {System.out.print (Hasharray[j].getkey () + "");} Else{system.out.print ("* *");}}} The array subscript public int hashfunction (int key) {return key%arraysize;} was converted by a hash function. Insert data item public void Insert (DataItem item) {if (Isfull ()) {//Extension hash table System.out.println ("Hash table full, re-hash ..."); extendhashtable ();} int key = Item.getkey (); int hashval = Hashfunction (key); while (hasharray[hashval]! = null && hasharray[hashval]. GetKey ()! =-1) {++hashval;hashvaL%= arraySize;} Hasharray[hashval] = item;itemnum++;} The/** * Array has a fixed size and cannot be extended, so the extended hash table can only create another larger array, and then insert the data from the old array into the new array. * However, the hash table calculates the location of the given data based on the size of the array, so the data items cannot be placed in the new array in the same position as the old array. * Therefore cannot be copied directly, you need to traverse the old array sequentially, and insert each data item into the new array using the Insert method. * This process is called re-hashing. This is a time-consuming process, but this process is necessary if the array is to be extended. */public void extendhashtable () {int num = Arraysize;itemnum = 0;//Re-count because the following is to transfer the original data to the new expanded array arraySize *= 2;// Array size doubled dataitem[] Oldhasharray = Hasharray;hasharray = new Dataitem[arraysize];for (int i = 0; i < num; i++) {insert (old Hasharray[i]);}} Delete data item public DataItem Delete (int key) {if (IsEmpty ()) {System.out.println ("Hash Table is empty!"); return null;} int hashval = hashfunction (key), while (hasharray[hashval]! = null) {if (Hasharray[hashval].getkey () = = key) {DataItem Temp = Hasharray[hashval];hasharray[hashval] = Nonitem;//nonitem denotes an empty item with a key of -1itemnum--;return temp;} ++hashval;hashval%= arraySize;} return null;} Find data item public DataItem find (int key) {int hashval = hashfunction (key), while (hasharray[hashval]! = null) {if (hasharray[ HashvaL].getkey () = = key) {return hasharray[hashval];} ++hashval;hashval%= arraySize;} return null;} public static class Dataitem{private int idata;public DataItem (int iData) {this.idata = IData;} public int GetKey () {return iData;}}}

It is important to note that when the hash table becomes too full, we need to extend the array, but note that the data item cannot be placed in the new array in the same position as the old array, but rather to recalculate the insertion position based on the size of the array. This is a time-consuming process, so generally we want to determine the scope of the data, given a good array size, and no longer capacity.

In addition, when the hash table becomes quite full, every time we insert a new data, we frequently probe the insertion position, because there may be many locations that are occupied by the data that was inserted earlier, which is called aggregation. The more filled the array, the more likely the aggregation will occur.

It's like a crowd, when someone fainted in the mall, the crowd slowly gathered. The first crowd came together because they saw the fallen man, and the people who came together because they wanted to know what they were looking at together. The bigger the crowd, the more people will attract.

②, filling factor

The ratio of data items and table lengths that have been filled in to a hashtable is called a reload factor, such as a 10,000-cell hash table filled with 6,667 data with a filling factor of 2/3. When the filling factor is not too large, the aggregation distribution is more coherent, and the filling factor is relatively large, then the aggregation occurs very much.

We know that linear detection is a step-by-step backward detection, when the loading factor is relatively large, will be frequently generated aggregation, then if we detect the larger unit, rather than step-by-step detection, this is the following two times to be explored.

③, two-time detection

Two-measurement detection is a way to prevent aggregation, and the idea is to detect units that are far apart, not adjacent to the original location.

In linear probing, if the original subscript computed by the hash function is x, the linear probe is x+1, x+2, x+3, and so on, whereas in two probes, the detection process is x+1, x+4, X+9, x+16, and so on, and the distance from the original position is the square of the step number. Two probes eliminate the original aggregation problem, but produce another finer aggregation problem, called two aggregates: for example, 184,302,420 and 544 are inserted into the table sequentially, their mappings are 7, then 302 need to be measured in steps of 1, 420 need to be measured in steps of 4, 544 requires 9 for step detection. As long as one of its keywords is mapped to 7, a longer step detection is required, a phenomenon called two aggregates. Two aggregation is not a serious problem, but two probes are not often used because there are good workarounds, such as hashing.

  

④, re-hash method

To eliminate the original aggregation and two aggregates, we use another method: the hash method.

We know that the reason for the two-time aggregation is that the detection sequence step produced by the two-detector algorithm is always fixed: 1,4,9,16 and so on. So what we're thinking about is that we need to produce a probe sequence that relies on keywords, not every keyword, so different types of keywords can use different probe sequences, even if they map to the same array subscript.

The method is to hash the keyword with a different hash function, using the result as a step. For the specified keyword, the step length is constant throughout the probe, although different keywords use different steps.

A second hash function must have the following characteristics:

One, unlike the first hash function

Two, can not output 0 (otherwise, there will be no step, each probe is in situ, the algorithm will fall into the dead loop).

Experts have found that the following form of hash function works very well: stepsize = constant-key% constant; Where constant is prime and less than the array capacity.
The re-hashing method requires that the capacity of the table be a prime number, if the table length is 15 (0-14), non-prime, there is a specific keyword mapped to 0, the step is 5, then the probe sequence is 0,5,10,0,5,10, and so on has been circulating. The algorithm only tries these three units, so it is impossible to find some empty cells, and the final algorithm causes crashes. If the array size is 13, prime, the probe sequence will eventually access all the cells. That is, 0,5,10,2,7,12,4,9,1,6,11,3, go down, as long as there is a vacancy in the table, it can be detected.

Full re-hash code:

Package Com.ys.hash;public class Hashdouble {private dataitem[] Hasharray;//dataitem class, representing each data item information private int arraysize;/ /The initial size of the array private int itemnum;//array actually stores how much data private DataItem nonitem;//used to delete the data item public hashdouble () {this.arraysize = 13; Hasharray = new Dataitem[arraysize];nonitem = new DataItem (-1);//The deleted data item is labeled -1}//to determine whether the array is stored full of public boolean isfull () { return (Itemnum = = arraySize);} Determines whether the array is empty public boolean isEmpty () {return (Itemnum = = 0);} Print array contents public void display () {System.out.println ("Table:"), for (int j = 0; J < ArraySize; J + +) {if (hasharray[j]! = null) {System.out.print (Hasharray[j].getkey () + "");} Else{system.out.print ("* *");}}} The array subscript public int hashFunction1 (int key) {return key%arraysize;} was converted by a hash function. public int hashFunction2 (int key) {return 5-key%5;} Insert data item public void Insert (DataItem item) {if (Isfull ()) {//Extension hash table System.out.println ("Hash table full, re-hash ..."); extendhashtable ();} int key = Item.getkey (), int hashval = HashFunction1 (key), int stepsize = HashFunction2 (key), and//The probe step count is computed with a second hash function while ( HasharrAy[hashval]! = null && hasharray[hashval].getkey ()! =-1) {hashval + = Stepsize;hashval%= arraysize;//probing backwards with specified number of steps} Hasharray[hashval] = item;itemnum++;} The/** * Array has a fixed size and cannot be extended, so the extended hash table can only create another larger array, and then insert the data from the old array into the new array. * However, the hash table calculates the location of the given data based on the size of the array, so the data items cannot be placed in the new array in the same position as the old array. * Therefore cannot be copied directly, you need to traverse the old array sequentially, and insert each data item into the new array using the Insert method. * This process is called re-hashing. This is a time-consuming process, but this process is necessary if the array is to be extended. */public void extendhashtable () {int num = Arraysize;itemnum = 0;//Re-count because the following is to transfer the original data to the new expanded array arraySize *= 2;// Array size doubled dataitem[] Oldhasharray = Hasharray;hasharray = new Dataitem[arraysize];for (int i = 0; i < num; i++) {insert (old Hasharray[i]);}} Delete data item public DataItem Delete (int key) {if (IsEmpty ()) {System.out.println ("Hash Table is empty!"); return null;} int hashval = HashFunction1 (key), int stepsize = HashFunction2 (key), while (hasharray[hashval]! = null) {if (hasharray[ Hashval].getkey () = = key) {DataItem temp = Hasharray[hashval];hasharray[hashval] = Nonitem;//nonitem indicates an empty item, Its key is -1itemnum--;return temp;} Hashval + = Stepsize;hashval%= ArraySize;} return null;} Find data item public DataItem find (int key) {int hashval = HashFunction1 (key), int stepsize = HashFunction2 (key); while (Hasharray [Hashval]! = null) {if (Hasharray[hashval].getkey () = = key) {return hasharray[hashval];} Hashval + = Stepsize;hashval%= arraySize;} return null;} public static class Dataitem{private int idata;public DataItem (int iData) {this.idata = IData;} public int GetKey () {return iData;}}}
4. Chain Address method

In open address law, by re-hashing to find a vacancy to resolve the conflict, another way is to set up a list (that is, the chain address method) in each cell of a hash table, the key value of a data item or a cell that is mapped to a hash table as usual, and the data item itself is inserted into the list of the cell. Other data items that are also mapped to this location need to be added to the list only, and do not need to look for empty spaces in the original array.

  

Ordered linked list:

Package Com.ys.hash;public class Sortlink {private Linknode first;public Sortlink () {first = null;} public Boolean isEmpty () {return (first = = null);} public void Insert (Linknode node) {int key = Node.getkey (); Linknode previous = null; Linknode current = First;while (current! = null && Current.getkey () < key) {previous = Current;current = current. Next;} if (previous = = null) {first = node;} Else{node.next = Current;previous.next = node;}} public void Delete (int key) {Linknode previous = null; Linknode current = First;if (IsEmpty ()) {System.out.println ("Linked is EMPTY!!!"); return;} while (current! = null && current.getkey ()! = key) {previous = Current;current = Current.next;} if (previous = = null) {first = First.next;} Else{previous.next = Current.next;}} Public Linknode find (int key) {Linknode current = First;while (current! = null && current.getkey () <= key) {if (cur Rent.getkey () = = key) {return current;}} return null;} public void DisplayLink () {System.out.println ("Link" (first->last)"); Linknode current = First;while (current! = null) {Current.displaylink (); current = Current.next;} System.out.println ("");} Class Linknode{private int idata;public linknode next;public linknode (int iData) {this.idata = IData;} public int GetKey () {return iData;} public void DisplayLink () {System.out.println (IData + "");}}}

Link Address method:

Package Com.ys.hash;import Com.ys.hash.sortlink.linknode;public class Hashchain {private sortlink[] hasharray;// The array holds the linked list private int arraysize;public hashchain (int size) {arraySize = Size;hasharray = new Sortlink[arraysize];//new Out of each empty list initializes an array for (int i = 0; i < arraySize; i++) {Hasharray[i] = new Sortlink ();}} public void Displaytable () {for (int i = 0; i < arraySize; i++) {System.out.print (i + ":"); Hasharray[i].displaylink ();} }public int hashfunction (int key) {return key%arraysize;} public void Insert (Linknode node) {int key = Node.getkey (); int hashval = Hashfunction (key); Hasharray[hashval].insert ( node), or//add to the list directly}public linknode Delete (int key) {int hashval = hashfunction (key); Linknode temp = find (key), Hasharray[hashval].delete (key);//Find the data item to delete from the list, delete the return temp directly;} Public Linknode find (int key) {int hashval = hashfunction (key); Linknode node = hasharray[hashval].find (key); return node;}}

In the chain address method, the filling factor (the ratio of the number of data items to the size of the hash table) differs from the open address method in that the chain address method requires an array of n cells to be transferred to n or more data items, so the filling factor is typically 1, or greater than 1 (it is possible that some locations contain two or more data items in the list).

The time level of O (1) is found for the initial unit, while the time of the search list is proportional to M, and M is the average number of items contained in the list, that is, the time level of O (M).

5, barrels

Another method is similar to the chain address method, which is to use a sub-array in each data item, rather than a linked list. Such an array is called a bucket.

This method is obviously not as effective as the linked list, because the capacity of the bucket is not good choice, if the capacity is too small, may be back overflow, if too large, but also cause a waste of performance, and the list is dynamically allocated, there is no problem. Therefore, generally do not use barrels.

6. Summary

A hash table is based on an array, similar to the storage form of a key-value, where the keyword value is mapped to an array subscript by a hash function, which is called a conflict if a keyword is Hashiha to an occupied cell. There are two ways to resolve conflicts: The Open Address method and the link address method. In the developing address method, the conflicting data items are placed elsewhere in the array, and in the chain address method, each cell contains a list of all data items that are mapped to the same array, and all the entries are inserted into the list.

Java Data structures and algorithms (13)--Hash table

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.