In-depth hashcode Method

Source: Internet
Author: User

Why is hashcode so important to objects?

The hashcode of an object is the implementation of a simple hash algorithm. Although it cannot be called a real algorithm than a real complex hash algorithm, how can it be implemented, it is not only about the programming level of programmers, but also about the performance of your object in access. it is possible that different hashcodes may generate access to your objects, with hundreds of times of performance differences.

Let's take a look at two important data structures in Java: hashmap and hashtable. Although they are very different, for example, different inheritance relationships, the constraints on values (whether to allow null) are different and thread security are different. However, they are consistent in implementation principle. therefore, we only use hashtable to describe:

In Java, the performance of data access is generally the first array, but hashtable will have a higher query speed than the Array Performance in the selection of containers with a slightly larger data volume. for specific reasons, see the following content.

When storing data, hashtable generally performs and operates on the hashcode and 0x7fffffff of the object that serves as the key first, because the hashcode of an object can be negative, so that it can be a positive integer after the operation. then, modulo the hashtable length to obtain the index of the value object in hashtable. Index = (O. hashcode () & 0x7fffffff) % hs. length; this value object will be directly placed in the index position of hashtable. For writing, this is the same as the array, put an object in the index position, but if it is a query, through the same algorithm, hashtable can directly obtain the index through the key and obtain the value object from the index, but the array needs to be compared cyclically. therefore, when the data volume is slightly larger, hashtable queries have higher performance than data.

Although different objects have different hashcodes, different hashcodes may produce the same index after the remainder of the length.

In extreme cases, a large number of objects will generate the same index. This is the most important issue related to hashtable performance: Hash conflicts. A common hash conflict is that different key objects eventually produce the same index, and a very or even rare hash conflict is that if the number of objects in a group exceeds the int range, the length of hashcode can only be in the int range, so there must be the same group of elements with the same hashcode, so that they will have the same index in any case. of course, this extreme situation is rare and can be ignored for the moment. However, if the same hashcode is modeled, the same index or different objects have the same hashcode, of course, they have the same index.

In fact, a well-designed hashtable generally distributes each element evenly, because the length of hashtable is always auto-incrementing in proportion to the actual number of elements (the filling factor is generally 0.75) in this way, most index locations have only one object, and few locations have several elements. therefore, each position in hashtable stores a linked list. For an object with only one position, the linked list has only one entry and the next of the entry is null. the hashcode, key, and value Attributes store the hashcode, key, and value of the object at the specified position. If an object with the same index comes in, it enters the next node of the linked list. if multiple objects exist in the same index, an object that matches the queried key can be found in the linked list based on the hashcode and key.

As I can see from the above, the first thing that has a major impact on the access performance of hashmap and hashtable is to make the elements in the data structure as large as possible with different hashcodes, although different hashcodes cannot generate different indexes, the same hashcode must generate the same index, which affects the generation of hash conflicts.

If an image has many attributes and all attributes are involved in the hash, it is obviously a clumsy design. because the hashcode () method of an object is automatically called almost everywhere, such as equals comparison, if too many objects are involved in the hash. the required operation constant time will increase greatly. therefore, selecting which attributes to participate in the hashed column is definitely a programming level problem.

Generally, the hashcode method returns attribute1.hashcode () + attribute1.hashcode ()... [+ super. hashcode ()].

We know that every time we call this method, we need to re-calculate the hashcode operations of the objects involved in the hash in the method. If the attribute of an object is not changed, the calculation is still performed every time. Therefore, if you set a flag to cache the current hash code, you only need to re-calculate the hash code when the object involved in the hash is changed. Otherwise, the cached hashcode is called, this can greatly improve the performance. Of course, for the State feature of Java objects, it is difficult for us to know which of the two objects has changed.

The default implementation is to convert the internal address of the object into an integer as hashcode, which of course ensures that each object has a different hascode, because the internal address of different objects must be different (nonsense ), however, the Java language does not allow programmers to obtain the internal address of objects. Therefore, there are many research techniques to generate different hashcodes for each object.

If the hashcode attribute that can be evenly distributed is sampled from multiple attributes, this is a contradiction between performance and diversity. If all attributes are involved in the hash, of course, the diversity of hashcode will be greatly improved, but the performance will be sacrificed. If only a small number of attributes can be used to sample the hash, a large number of hash conflicts will occur in extreme cases, for example, in the "person" attribute, if gender is used instead of name or date of birth, there will be only two or more optional hashcode values, resulting in more than half of the hash conflicts. therefore, it is a good choice to generate a sequence to generate hashcode under possible conditions (of course, the performance of generation sequence is higher than that of all attributes involved in the hash, otherwise, it is better to use all attribute hashes directly ).

To balance the performance and diversity of hashcode, you can refer to the relevant algorithm design book. In fact, it is not necessarily required to be excellent, as long as you can minimize the clustering of hash values. the important thing is that we should remember that hashcode has a major impact on the performance of our programs. We should always pay attention to it during program design.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.