Understanding the features and principles of hashtable in actual development (I)

Source: Internet
Author: User
Tags php source code

Hashtable is the modern majorityProgramStaff travel at home, cannot be used as a weapon. such as ASP. net programmers deal with application items every day, cache items are implemented by hashtable. we also use hashtable or its structure, such as namevaluecollection, for routine storage configuration parameters and data columns ,. NET 2.0 added a new system. collections. generic. dictionary, which is similar to hashtable at first glance, and even has the advantage of generics. can you say that dictionary will replace hashtable? How is hashtable implemented? What are the applicable scenarios? What are their advantages and disadvantages? Microsoft official documentation is not clear. We may wish to conduct some preliminary research on our own and make some comparisons based on Java and PHP implementations.

In a narrow sense, hashtable can be a specific type name, such.. net. collections. hashtable class, or Java. util. hashtable class. broadly speaking, she refers to a data structure, that is, a hash table, which involves multiple types, such as hashmap,ArticleThe dictionary mentioned at the beginning belongs to the category of hash tables despite its varied terms. The term hashtable will appear below, unless otherwise specified, also refers to the hash table in a broad sense.

The original definition and basic principles of the hash table are described in various data structure tutorials. in short, a hash table can obtain records based on keywords (a typical example is a string key value, because she has established a record storage location internally-that is, a set of mappings between index numbers and keywords in the internal array F, so when searching, you only need to find the number corresponding to the given key value K Based on the ing relationship F. F (K), you can directly obtain the target data from the array hashtable [k] = hashtable. internalarray [F (k)] without having to traverse and compare the array. this ing relationship F is called a hash function.

Two important features of hash function F:
[1] The hash function can be customized, as long as the range of the integer F (k) does not exceed the upper and lower bounds of the array stored in the hash table.
[2] K can be obtained in any way, but F (k) can only be fixed in one range. Therefore, different keywords may correspond to the same hash value, resulting in a conflict.

It should be noted that the calculation and conflict processing of hash functions both require system overhead, especially when the latter is expensive. Therefore, two key issues arise: how to design function fAlgorithmTo make the hash table more efficient.

Different Languages and running environments have different solutions, and their ideas are even quite different. for example. net System. collections. hashtable and Java. util. although hashtable has the same name, its internal algorithms are different, which also produces performance differences.

Here we select several common instances for in-depth analysis:
[1] . NET 2.0, system. Collections. hashtable
[2] . NET 2.0, system. Collections. Generic. dictionary <K, V>
[3] Java, java. util. hashmap (lightweight Implementation of Java. util. hashtable)
[4] PhP5, PHP is a weak type language, hashtable is transparent to programmers, and is implemented in the background runtime.

Note: The above. netSource codeFrom reflector decompilation, JAVA SourceCodeFor more information, see JDK. For PHP source code, see php sdk. For convenience, some pseudocode is used below.

.. Net. collecitons. hashtable (hereinafter referred to as hashtable) is a traditional implementation that is very representative of the style. textbooks of various types of data structures generally adopt similar principles as the opening course. (Of course, the book should be simple and primitive, but there is still a gap between it and reality)

The actual data in hashtable is stored in an internal array (of course, it is the same as an ordinary array, with a fixed capacity, up and down mark, and accessed with a digital index ), when you want to obtain the hashtable [k] value, hashtable performs the following processing:

[1] To ensure that the value range of F (k) is 0 <= f (k) <array. the key step of function f is the modulo operation. The actual data storage location is F (K) = hashof (k) % array. length. As for how to calculate this hashof (K), for example, she can calculate the ASCII code of the keyword according to certain rules.

[2] If the hash values of multiple K values are repeated, that is, F (K1) = f (K2), and F (K1) is occupied by data, hashtable uses the "open address method" to handle conflicts. The specific behavior is to set hashof (K2) % array. change length to (hashof (K2) + d (K2) % array. length to get another location to store the data corresponding to the keyword K2. D is an incremental function. if the conflict persists, perform the incremental operation again, and follow this cycle until a blank space in the array is found. search hashof (K2) for K2 in the future. If K2 is not found, the incremental D (K2) will continue searching until it is found. the more consecutive conflicts, the more searches, and the lower the efficiency.

[3] When the inserted data volume reaches the hashtable capacity limit, expand the internal array (new a larger array, and then copy the data. length has changed. After resizing, You need to recalculate F (k) for all existing data ). therefore, resizing is an amazing internal operation. hashtable's write efficiency is only 1/10 of the Read efficiency, and frequent resizing is a factor.

The acquisition of F (k) is the key to a hash table, which fundamentally determines many important features of the hash table, such. net System. collections. the algorithm of hashtable's hash function f determines the following aspects:

[1] Array capacity array. the larger the length, the smaller the chance of conflict. because the value range of F (k) is equal to array. length, so with array. length growth, F (k) value is more diverse, it is not easy to repeat.

[2] array capacity array. length is expected to be a "relatively large prime number", so f (K) = hashof (k) % array. there is a small chance of a number conflict after the length modulo operation. imagine an extreme example, assuming array. length = 2, as long as hashof (k) is an even number, F (k) is 0. therefore, the actual capacity of the hash table is generally regular. It is different from the array and cannot be set at will.

[3] As the number of inserted data items increases, there are fewer and fewer empty spaces in the hashtable array, and the possibility of the next conflict is also increasing, seriously affecting efficiency. therefore, you cannot resize an array until it is full. in. net, when the ratio of the number of inserted data to the array capacity is 0.72 Start expansion. this 0.72 is called the load factor. this is a demanding number. You may increase or decrease the filling factor by 0.01 at some time, and your hashtable access efficiency may increase or decrease by 50%, because the filling factor determines array. length, array. length affects the probability of F (k) conflict, and thus affects the performance. 0.72 is a balanced value obtained by Microsoft after long-term experiments. (the appropriate value is also related to the F (k) algorithm. 0.72 is not necessarily applicable to hash tables of other structures)

[4] The initial size of hashtable array. length is at least 11, and the capacity to be resized again is at least "A prime number no less than 2 times the current capacity". Here is an example to show how much space is wasted by hashtable.

Assume that a hashtable is initialized by default, and eight values are inserted in sequence. As the value ranges from 8 to 0.72> 11, hashtable automatically scales up and the new capacity is a prime number no less than 11*2, that is, 23. therefore, there are only eight people who have dinner, but they have to arrange a banquet for 23 seats. to avoid this, the capacity is directly set to 17 by using the construction with parameters during hashtable initialization, but 9 space is still wasted.

After calculation, the reader may ask why the initial capacity is not specified as 13, 13 is a prime number, and 13*0.72> 8. this is an ideal situation, but it takes a lot of time to dynamically calculate and determine whether a number is prime or not. the capacity value in net hashtable is an internal preset series, which can only be 3, 7, 11, 17, 23... so I am very sorry. (Note: only when array. length> 0x6dda89: dynamically calculates the capacity expansion. Normally, we will not store so much data)

.. Net hashtable reduces conflicts in this way and exchanges the read/write speed at the cost of space. assume that you are sensitive to memory space requirements in actual development, such as ASP development. net super large B/S website, it is very necessary to review the use of hashtable scenario requirements, sometimes can you change the way, adopt custom struct, or array to achieve efficient implementation?

 

From: http://blog.csdn.net/cloudandy/archive/2007/05/11/1604273.aspx

 

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.