Hash table
A hash table (Hashtable), also known as a hash, is a collection of index key (key) and value pairs that are organized according to the hash program code of the index key (Hashtable). The Hashtable object is made up of a hash bucket (bucket) containing the elements in the collection. Buckets are virtual subgroups of elements within the Hashtable, making it easier and faster to find and retrieve jobs in most collections.
A hash function is an algorithm that returns a numeric hash program code based on an index key. The index key (key) is the value of some property of the object being stored. When an object is added to Hashtable, it is stored in the bucket associated with the hash program code that matches the object hash program code. When a value is searched within a hashtable, the hash program code is generated for that value, and the bucket associated with the hash program code is searched. For example, student and teacher are placed in different buckets, and the dog and God are placed in the same bucket. So it performs better when the index key is the only one that gets the performance of the element from Hashtable. The four advantages of hashing are shown below.
- No sorting is required beforehand.
- The search speed is irrelevant to the data.
- Digital signatures have high cryptographic technology confidentiality (Security).
- Data Compression can be done to save space.
- Hash tables in the Linux kernel are widely used, and most of the language features in the PHP kernel are based on hash table implementations. Why is a hash table so mighty? Hash tables enable efficient data storage and lookups, while storage and lookup are the two most widely used operations in programming.
Hash table in the Linux kernel
People who have read the Linux kernel source may find that there are not too many complex data structures, and that the two-way list (list) as the underlying data structure and the list-based hash table occupy most of the data structures. Why does the kernel use these two data structures in a large number of ways? Around this question (mainly the hash table), I will try to figure out its intentions with my own understanding.
First of all, both of these data structures are very simple, simple to understand and simple to use two aspects of the content. This also means that the readability and maintainability of code are better than other complex data structures, and the risk of bugs is lower. Philosophically speaking, this is also in line with the K.I.S.S clause.
Second, the kernel is a more performance-oriented software, in order to program design and maintenance of the simplicity of the loss of performance, this is not worth it? Should we be more biased towards performance? I can't remember where I heard it, many commercial routing software is based on binary tree data structure to store routing items, in order to get its route lookup time complexity of log (n), and he criticized the Linux routing items organized into a hash table, resulting in poor performance, not suitable for business. There is a certain reason, can be carefully analyzed, the performance of the hash table is really worse than the binary tree? The time complexity of inserting and deleting a binary tree is log (n), the time complexity of inserting and deleting the hash table is O (1), the worst is O (n), if the selected table entry (m) is sufficient, and the hash function is good enough, its time complexity is O (n/m) (when m<=n). When M > N/log (N), the average performance of the hash table is better than the binary tree, and when m>=n, its time complexity tends to be near O (1). The value of M can be made adjustable, which also shows the customization of the kernel. However, do not blindly optimistic, all this is a good enough hash function for the prophase.
The merits and demerits of hash function
How to determine the good or bad of a hash function?
The Chinese meaning of the hash is "hash", which can be interpreted as: scattered arrangement. A good hash function should be distributed evenly across all elements, minimizing or reducing the conflict between them (collision). It is necessary to remind everyone again that the choice of hash function must be careful, if the unfortunate all the elements of conflict between, then the hash table will degenerate into a list, its performance will be greatly reduced, time complexity quickly down to O (n), there is absolutely no luck, because it is quite dangerous. Historically, the use of the Linux kernel hash function of the vulnerability, the successful construction of a large number of hash table collision elements, resulting in the system is DOS, so at present, most of the kernel hash function has a random number as a parameter to the doping, so that its final value can not be easily predicted. This also puts forward the 2nd security requirement for the hash function: the hash function is preferably unidirectional, and is doped with random numbers. Referring to one-way, you might think of the one-way hash function MD4 and MD5, which unfortunately tells you that they are not suitable because the hash function needs to have quite good performance.
Helpless, huh? Who said you want to be behind closed doors again! Or to see how the predecessors do, and fully carry forward the spirit of take doctrine, I also call this practice as "do not fight and bend the soldiers", this is not a military strategist above the policy? The Jhash used in the Linux kernel is a tried and tested hash function that can be CPMS (Copy Paste Modify Save). Jhash's author, Bob Jenkins, also published on his website a hash function such as hash for predictable data--perfect (perfect) hash function and other hash functions, which spectators can choose, if they are interested in continuing to delve, can also tread on their shoulders.
What is a bucket?
Bucket in English explanation:
Hash table lookup Operations is often O (n/m) (where n is the number of objects in the table and M is the number of buckets s), which is close to O (1), especially when the hash function has spread the hashed objects evenly through the hash table, And there is more hash buckets than objects to be stored.
This can be understood as:
The result of a hash corresponds to an address that can hold two buckets. Hash conflicts can be resolved.
- To save the data, the first hash is here and a data is stored in the first bucket.
- To save the data, when the second hash is here for some reason, another data is stored in the second bucket.
A hash table consisting of 5 buckets with 7 elements:
Linux hash function Hash_long, and so on, with the golden ratio to calculate. Because the number of buckets (bits) needs to be determined by the hash function and the expectation of conflict, how do we determine the number of buckets for a hash function like Hash_long?
In general, they are based on data characteristics to consider the use of the hash algorithm, not a single bite to kill a not put.
For example, the IP address of the hash table, with a 65536 bucket is good, the IP after 16bit as key. This method is absolutely lower than the collision rate of Hash_long, Jhash and other functions.
In fact, this is the boundary and performance of the compromise. I can fetch the maximum value of my problem space. This ensures that the key values are dispersed. But that would waste a lot of space. However, the acquisition is too small and affects the search efficiency. The feeling is still to be tested in the experiment. And personally think that the hash is more flexible than other search data structures is its customizable. Can be adjusted according to the specific circumstances, in order to achieve optimal results.
Bucket buckets in the Linux kernel hash table