A hash table is a data structure that provides fast insert and delete operations. Regardless of the amount of data in the hash table, insertion or deletion only takes time close to the constant, that is, the time level of O (1. It is much faster than the tree. Tree operations usually require O (n) time.
Disadvantage: It is array-based and is difficult to maintain after the array is created. When some hash tables are basically filled up, the performance decline is very serious. In addition, there is no way to traverse data items in any order (for example, from large to small.
If you need to take the word as the key (array subscript) to get the value (data), you can break the word into a combination of letters, the letter into their digital code (A-1, B-2, C-3 ...... Z-26, space-27), multiply each number by the power of the corresponding 27 (because there are 27 possible letters, including spaces), then the result is added, each word corresponds to a unique number.
For example, cats: 3*273 + 1*272 + 20*271 + 19*270 = 60337
This scheme will make the Array too long, and only a few subscripts have data.
Hashing
Arrayindex = hugenumber % arraysize, a hash function that converts a large range of numbers into a small number range.
Use the remainder operator (%) to convert a large integer range to twice the range of the array subscript to be stored. The following is an example of a hash function:
Arraysize = wordnumber * 2;
Arrayindex = hugenumber % arraysize;
The expected array should have the following characteristics: on average, each two array units has a value. Some units do not have a value, but some units may have multiple values.
Conflict
To compress a large numeric space into a small numeric space, you must pay the price, that is, you cannot ensure that each word is mapped to a blank unit in the array. Assume that the word zoo needs to be inserted in the array and Its subscript is obtained after the word is hashed. It is found that this unit already has another different word, which is called "conflict ".
Solution 1-open address Method
As mentioned above, the size of the specified array is twice the size of the data to be stored, so half of the elements are blank. When a conflict occurs, you can use the system method to find a blank space in the array and put the word in it, instead of using the array subscript obtained by the hash function, this method is called the Open address method ".
Solution 2-link address Method
Create an array that stores the word linked list. Words are not directly stored in the array. In this case, when a conflict occurs, new data items are directly linked to the linked list referred to by the array subscript. This method is called the link address method ".
Open address Method
There are three methods to find other positions in the array: linear detection, secondary detection, and re-hash.
1) linear detection
Linear query of blank units. If 21 is the location where data is to be inserted, it is occupied. Then 22, 23, and so on are used. The subscript of the array increments until the vacant space is found.
Insert)
When the number of data items accounts for half of the hash table, or a maximum of 2/3, the hash table has the best performance. It can be seen that the filled units are unevenly distributed, sometimes a string of blank units and sometimes a string of filled units.
In a hash table, a series of continuous filled units are called "fill sequence ". When more and more data items are added, the filling sequence becomes longer and longer, which is called "aggregation ".
Delete)
In the hash table, the search algorithm starts with the hash keyword and searches one by one along the array. If a blank unit is encountered before the keyword is found, the search fails.
Delete does not simply change the data items of a unit to null. Because there is a blank space in the middle of a fill sequence, the search algorithm will give up searching in the middle. Therefore, a data item with special keywords is required to replace the data item to be deleted. The marked data item does not exist.
public class DataItem { private int i; public DataItem(int i) { this.i = i; } public int getKey() { return i; } public void printf() { System.out.println("data -> " + i); }}
public class HashTable { private DataItem[] itemArray; private int arraySize; private DataItem nonItem; // for deleted items public HashTable(int size) { this.arraySize = size; itemArray = new DataItem[arraySize]; nonItem = new DataItem(-1); } public void display() { for (DataItem data : itemArray) { if (data != null) { data.printf(); } } } public int hashFuc(int key) { return key % arraySize; } public void insert(DataItem item) { int key = item.getKey(); int hashVal = hashFuc(key); DataItem tItem; while ((tItem = itemArray[hashVal]) != null && tItem.getKey() != -1) { if (tItem.getKey() == key) { itemArray[hashVal] = item; return; } hashVal++; // go to next cell hashVal %= arraySize; // wraparound if necessary } itemArray[hashVal] = item; } public DataItem delete(int key) { int hashVal = hashFuc(key); DataItem item; while ((item = itemArray[hashVal]) != null) { // until empty cell if (item.getKey() == key) { itemArray[hashVal] = nonItem; return item; } hashVal++; hashVal %= arraySize; } return null; } public DataItem find(int key) { int hashVal = hashFuc(key); DataItem item; while ((item = itemArray[hashVal]) != null) { // until empty cell if (item.getKey() == key) { return item; } hashVal++; hashVal %= arraySize; } return null; }}
public static void main(String[] args) { HashTable t = new HashTable(10); t.insert(new DataItem(39)); t.insert(new DataItem(51)); t.insert(new DataItem(23)); t.insert(new DataItem(25)); t.insert(new DataItem(23)); t.insert(new DataItem(10)); t.insert(new DataItem(9)); t.delete(25); t.insert(new DataItem(79)); t.insert(new DataItem(81)); t.display(); }
Print result:
Data-> 10
Data-> 51
Data-> 9
Data-> 23
Data-> 79
Data-> 81
Data-> 39
Extended Array
When the hash table is too full, you need to extend the array. Only one new and larger array can be created, and all data items in the old array can be inserted into the new array. Because the hash function calculates the position of a data item based on the size of the array, you cannot simply insert a data item into a new array. You need to traverse the old array in order and then call insert () insert each data item to the new array. This is called "re-hashing ".
The expanded array capacity is usually twice the original size. In fact, the size of the array should be a prime number, so the new array is a little more than twice the size.
A good hash function needs to evenly distribute the original data to the hash array. For example, if most of the hash data is an even number, if the hash array capacity is an even number, it is easy to make the original data not evenly distributed after hash:
2, 4, 6, 8, 10, 12, and so on. If we get 2 4, 0, 2, 4, and 0 from the remainder of 6, we will get only three hash values. There will be many conflicts. If you get 2, 4, 6, 3, and 5 for 7, there is no conflict.
Similarly, if the data is a multiple of 3 and the hash array capacity is a multiple of 3, there is also a conflict after hash. Using a prime number will reduce the probability of conflict and be more dispersed.
The following code is used to calculate the quality:
private int getPrime(int min) { for (int j = min;; j++) { if (isPrime(j)) { return j; } } } private boolean isPrime(int num) { for (int j = 2; j * j <= num; j++) { if (num % j == 0) { return false; } } return true; }
3) Secondary Detection
Clustering occurs in linear detection. Once clustering is formed, it will become larger and larger. After hashing, data items falling within the clustering range must be moved step by step, and the performance will be worse.
Fill factor: the ratio of data items in the hash table to the table length is called the fill factor. Loadfactor = nitems/arraysize;
Secondary detection is to prevent aggregation. The idea is to detect distant units rather than adjacent units.
The step is the square of the number of steps: assume that the original subscript in the hash table is X, then the linear test is: x + 1, x + 2, x + 3 ......; In the secondary probe, the probe process is: x + 12, x + 22, x + 32 .......
Secondary detection eliminates the clustering problem in linear detection. This clustering problem is called "original clustering ". However, secondary detection produces another finer clustering problem. The cause is that all the keywords mapped to the same position are the same when looking for a vacant space (The step size is always fixed and is: 1, 4, 9, 16, 25, 36 ......).
4) rehash
To eliminate original clustering and secondary clustering, another method can be used: Re-hash. Now, you need a method to generate a test sequence that depends on the keyword, instead of having the same keyword. If different keywords are mapped to the same Array subscript, you can also use different test sequences.
The key word is hashed again using different hash functions, and the result is used as the step size. For the specified keyword, The step size remains unchanged throughout the test, but different keywords use different step sizes.
Experience shows that the second hash function must meet the following conditions:
Different from the first Hash Function
You cannot enter 0 (otherwise, there is no step size. Every time the probe is done, it is a static step and an endless loop .)
Stepsize = constant-(Key % constant), where constant is a prime number and smaller than the array capacity. Example: stepsize = 5-key % 5;
public class HashTable2 { private DataItem[] itemArray; private int arraySize; private DataItem nonItem; // for deleted items public HashTable2(int size) { this.arraySize = size; itemArray = new DataItem[arraySize]; nonItem = new DataItem(-1); } public void display() { for (DataItem data : itemArray) { if (data != null) { data.printf(); } } } public int hashFuc1(int key) { return key % arraySize; } public int hashFuc2(int key) { /* * non-zero, less than array size, different from hashFuc1. array size * must be relatively prime to 5, 4, 3, 2 */ return 5 - key % 5; } public void insert(DataItem item) { int key = item.getKey(); int hashVal = hashFuc1(key); int stepSize = hashFuc2(key); DataItem tItem; while ((tItem = itemArray[hashVal]) != null && tItem.getKey() != -1) { if (tItem.getKey() == key) { itemArray[hashVal] = item; return; } hashVal += stepSize; // add the step hashVal %= arraySize; // wraparound if necessary } itemArray[hashVal] = item; } public DataItem delete(int key) { int hashVal = hashFuc1(key); int stepSize = hashFuc2(key); DataItem item; while ((item = itemArray[hashVal]) != null) { // until empty cell if (item.getKey() == key) { itemArray[hashVal] = nonItem; return item; } hashVal += stepSize; hashVal %= arraySize; } return null; } public DataItem find(int key) { int hashVal = hashFuc1(key); int stepSize = hashFuc2(key); DataItem item; while ((item = itemArray[hashVal]) != null) { // until empty cell if (item.getKey() == key) { return item; } hashVal += stepSize; hashVal %= arraySize; } return null; }}
The table capacity must be a prime number.
The hash method requires that the table capacity is a prime number. Why is there such a limit? Suppose the table capacity is not a prime number, the table length is 15 (coordinates 0-14), there is a special keyword ing to 0, the step size is 5, the probe sequence is 0, 5, 10, 0, 5 ......, The algorithm will only try these three units until the loop goes on. It is impossible to find other blank units and the algorithm crashes.
If the array capacity is 13, that is, a prime number, the probe sequence accesses all units. That is, 0, 5, 10, 2, 7, 12, 4, 9, 1, 6, 11, and 3. Keep going, as long as there is a blank space in the table, you can detect it. Using the prime number as the array capacity makes it impossible to divide any number, so the probe sequence will eventually check all units.