Java collection--hash, Hash conflict

Source: Internet
Author: User

First, Hash

A hash table, also known as a hash table, is a data structure that accesses the memory storage location directly from the key (key). That is, it accesses records by calculating a function of the key value, mapping the data of the desired query to a location in the table, which speeds up the lookup. This mapping function is called a hash function, and the array that holds the record is called a hash list.

    • The key of implementing hash algorithm: Implementing hash algorithm and solving hash conflict
1.Hash function

First of all, the hash function, Java objects have a hashcode () method, then why do you need a hash function? Hashcode is a signed int type in the JDK, this is a very large scope, if the hash list array can overwrite all int values, there is no need to hash function, of course, memory does not allow us to maintain such a large hash list. At this point we need the hash function to map the original hashcode to a very small array. The idea is to convert super-long or indefinite-length shaping data to the only (ideally, the hash value of different objects should not be the same) fixed-length hash value, the common practice is to take the model method, is also the implementation of the JDK.

    • HashMap's Hashcode implementation:
1 Static Final intHash (Object key) {2     inth;3     return(Key = =NULL) ? 0: (H = key.hashcode ()) ^ (H >>> 16);4 }5 6 Static intIndexfor (intHintlength) {7     returnH & (length-1);8}

The first hash function is called "perturbation function", the second indexfor function is removed in JDK8, the code inside the function is merged into the Putval, and the individual considers the two functions to be a complete hash function.
H & (Length-1) The function of this code is actually modulo, assuming that the array initialization length is 16, then length-1 the result is 15, corresponding to the binary 00001111, if we have a size of 20 key, corresponding to the binary 00010100, And the result of the operation is 00000100, corresponding to the decimal 4.
The length of this array must be a power of 2. Because we have a modulo operation on key, we know that when length=16, we discard the value of the key high, leaving only the lower 4 bits. Originally int is 32 bit, just with low 4 bit conflict is not too easy to happen?
So the function of the first "disturbance function" appears, which makes a difference or operation between the height 16 and the low 16 bits of the key.

Despite the realization of such an effective hashing algorithm, but only the probability of the hash collision between different objects is reduced, or not fully guaranteed no hash conflict, so to continue to use the advantages of hash table to solve the problem of hash conflict.

second, solve the hash conflict1. Open addressing method (linear detection, two-time detection, pseudo-random detection)

The approach to conflict resolution with open addressing is to use some sort of probing (also known as probing) technique to form a sniffing sequence in a hash table when a conflict occurs. Finds the specified keyword along this sequence, either until a given key is found, or when an open address (that is, the address cell is empty) (to insert, in the case of an open address, the new node to be inserted is stored in the Address cell). Probing to open addresses while searching indicates that there are no unknown origin keywords in the table, that is, the lookup failed.

As the hash table becomes more and more clustered, this results in a very long probe length, and subsequent data insertions can be very time consuming. Typically, when data exceeds two-thirds full, the performance degradation is severe, so designing a hash table key ensures that it does not exceed half of this data capacity, up to two-thirds.

    • When creating a hash table with open addressing, all the cells in the table (more strictly, the keywords stored in the cell) must be empty before the tables are built.
    • The representation of an empty cell is related to the specific application.

According to the method of forming the probing sequence, the open addressing method can be divided into linear probing method, linear compensation detection method, random detection and so on.

(1) Linear probing method (Linear probing)

    • The basic idea of this method is to think of the hash table t[0..m-1] as a cyclic vector, and if the initial probe address is D (that is, H (key) =d), the longest probe sequence is:

D,d+l,d+2,...,m-1,0,1,...,d-1
That is, start with address D on probing, first probe T[d], then probe t[d+1], ..., until T[m-1], and then loop to t[0],t[1], ... until you have probed to t[d-1].

    • The probing process terminates in three cases:

① if the currently probed unit is empty, the lookup succeeds (if inserted, the key is written to it);
② if the currently probed unit contains a key, the lookup succeeds, but the insertion means that it fails;
③ If the empty cell is not found when probing to t[d-1], either the lookup or the insert means failure (at which time the table is full and needs to be enlarged).

    • Using the general form of open address method, the probing sequence of the linear probing method is:

Hi= (H (key) +i)%m 0≤i≤m-1//i=1

    • Using linear detection method to deal with the conflict, the idea is clear, the algorithm is simple, but there are the following disadvantages:

① hash table capacity is not fully utilized, and the expansion will be catastrophic, the need to delete previously tagged elements and the need to re-calculate the location of all elements, when the frequent deletion and insertion efficiency becomes very low.
② the hash table set up by the above algorithm, it is very difficult to delete the work. If you want to delete a record from the hash table HT, it is supposed that the location of this record should be empty, but we can not do this, but can only be marked with the deleted tag, otherwise, will affect the future lookup.
③ linear detection method is very easy to produce the phenomenon of heap accumulation. The so-called heap accumulation phenomenon, that is, the records deposited into a hash table are linked together. When dealing with collisions in linear probing, if the successive sequence generating the hash address is longer (that is, the longer the hash address of the different key values is adjacent together), the greater the likelihood of a conflict with the new record when it joins the table. Therefore, a long sequential sequence of hash addresses grows faster than a short sequential sequence, which means that, once a heap is present (along with the conflict), it will cause further heap accumulation.

(2) Linear compensation detection method

The basic idea of the linear compensation detection method is to change the step length of the linear probe from 1 to Q, and to change the hi= (H (key) +i)%m in the above algorithm to: hi= (key) +q)%m, this Q is based on a certain rate of increase (1, 4, 9 ...), so that the data distribution is scattered enough, It is not easy to get a heap phenomenon, and the change of Q can make the whole table fully scanned, so that all the cells in the hash table can be detected (and of course there are other linear hashing algorithm rules, which only discuss the re-hash of this method).

(3) Random detection

The basic idea of random detection is to change the step of linear probing from constant to random number, even if: hi= (key) +rn)%m, where RN is a random number. In the actual program, the random number generator should be used to generate a random sequence, which is used as the step of sequential detection. This allows different keywords to have different detection sequences, which can avoid or reduce heap accumulation. Based on the same reason as the linear detection method, in the linear compensation detection method and the random detection method, the deletion mark is also marked after deleting a record.

2. Chain Address Method (Zipper method)

Zipper method The approach to conflict resolution is to link all keywords that are synonyms to the same single-linked list. If the hash list length selected is M, the hash list can be defined as an array of pointers consisting of M head pointers T[0..m-1]. All nodes with hash address I are inserted into a single linked list with T[i] as the head pointer. The initial value of each component in T should be a null pointer. In the Zipper method, the filling factor α can be greater than 1, but generally take α≤1.

    • Advantages of zipper method compared with open addressing method

① Zipper method to deal with the conflict is simple, and no accumulation phenomenon, that is, non-synonym will never conflict, so the average search length is short;
② because of the dynamic application of the node space on each linked list in the Zipper method, it is more suitable for the case that the table length can not be determined before watchmaking.
③ open addressing method to reduce the conflict, requires the loading factor α is small, so when the node scale is large, it will waste a lot of space. While the Zipper method is preferable to α≥1, and when the node is larger, the additional pointer field in the Zipper method is negligible, thus saving space;
④ in a hash list constructed with the Zipper method, the operation of deleting nodes is easy to implement. Simply delete the corresponding node on the list. In the case of a hash table constructed by the open address method, the deletion node cannot simply empty the space of the deleted node, otherwise it will truncate the lookup path of the synonym node of the hash table after it. This is because in various open address laws, empty address units (that is, open addresses) are the criteria for finding failures. Therefore, the delete operation is performed on the hash list that handles the conflict with the open address method, and the deletion mark can only be done on the deleted node, instead of the node being actually deleted.

    • Disadvantages of the Zipper method

When a hash conflict occurs, it takes additional space to generate the list, and it takes a certain amount of time and effort to maintain the conflict list. In the expansion of the need to re-hash all elements and assign addresses, the algorithm is more complex and cumbersome.

3. Re-hash (learn)

Re-hash method, is to calculate the method of hashcode more than one, one if the figure out to repeat, and then use another algorithm to calculate. A hash algorithm that uses certain algorithmic logic to a hash conflict that does not occur in the current situation.

4. Create a public overflow area (learn)

Creating a public overflow area is to put the conflict in another place, not in the table. The concrete implementation does not do the discussion (not commonly used).

Java collection--hash, Hash conflict

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.