Linux Kernel hash and bucket

Source: Internet
Author: User


The hash and bucket hash tables (Hashtable) in the Linux kernel are also called "discrete". Hashtable is an index Key (Key) and Value (Value) that will be organized according to the hash program code of the index Key) the set of pairs. A Hashtable object is composed of buckets that contain elements in a collection. Bucket is a virtual sub-group of Hashtable elements, which makes searching and obtaining in most sets easier and faster. Www.2cto.com Hash Function is an algorithm used to return the numerical Hash program code based on the index key. The index Key is the attribute Value of the stored object ). When an object is added to Hashtable, it is stored in the Bucket related to the hash program code that matches the object hash program code. When you search for a value in Hashtable, the hash program code generates the value and searches for buckets related to the hash program code. For example, student and teacher are placed in different buckets, while dog and god are placed in the same Bucket. Therefore, the index key performs better when it is the only one that gets elements from Hashtable. Hash has the following advantages. No sorting is required in advance. The search speed has nothing to do with the amount of data. The encryption technology of digital signatures is highly confidential. Data Compression can be used to save space. Anyone who has read the Linux kernel source code may find that there are not many complex data structures in the list) and list-based hash tables occupy the vast majority of data structures. Why are these two data structures widely used by the kernel? Around this problem (mainly hash tables), I will try to figure out my intention with my own understanding. First, the two data structures are very simple, including simple understanding and simple use. This also means that code readability and maintainability are better than other complex data structures, and the risk of bugs is also low. In philosophy, this is also in line with the K. I. S. Clause. Www.2cto.com second, the kernel is a performance-oriented software, and performance is lost for the simplicity of program design and maintenance. Is this not worth the candle? Should we make the balance more performance-oriented? I can't remember where I have heard of it. Many commercial routing software store routing items based on the binary tree data structure, so that the time complexity of Route Search is log (n ), in addition, he criticized the organization of Linux route entries as hash tables, resulting in poor performance and unsuitable for commercial use. It is true that the performance of hash tables is actually worse than that of Binary Trees? The time complexity of inserting and deleting a binary tree is log (n). the time complexity of inserting and deleting a hash table is preferably O (1) and O (n ), if the selected table has enough items (m) and the hash function is good enough, the time complexity is O (n/m) (when m <= n ). When m> n/log (n), the average performance of the hash table is better than that of the binary tree. When m> = n, the time complexity approaches O (1 ). The m value can be adjusted, which also shows the customization of the kernel. However, do not be blindly optimistic. All of this is based on a good enough hash function. How can we determine whether a hash function is good or bad? The Chinese meaning of hash is "hash", which can be interpreted as scattered arrangement. A good hash function should evenly arrange all elements and try to avoid or reduce Collision ). It is necessary to remind everyone that the selection of hash functions must be careful. Unfortunately, if there is a conflict between all elements, the hash table will degrade to a linked list, and its performance will be compromised, time Complexity quickly drops to O (n). Never be lucky because it is quite dangerous. In history, the Linux kernel hash function vulnerability has been used to successfully construct a large number of elements that collide with hash tables, resulting in DoS, therefore, most hash functions in the kernel are doped with a random number as a parameter so that the final value cannot or is difficult to predict. This puts forward the second security requirement for the hash function: the hash function should preferably be unidirectional and be doped with random numbers. When it comes to unidirectional functions, you may think of the unidirectional hash functions md4 and md5. Unfortunately, they are not suitable because hash functions need to have good performance. Www.2cto.com is not at all ready, right? Who told you that you want to build a car again! Let's take a look at how our predecessors did and fully carry forward the spirit of "attackism". I also say that this practice is "not a war but a war-driven soldier". Isn't this the best strategy for the military family? The jhash used in the Linux kernel is a tested and proven hash function that can be saved by CPMS (Copy Paste Modify Save. On its website, Bob Jenkins, author of Jhash, also published a series of other hash functions, such as the hash function-perfect hash function for predictable data, customers can choose, and if they are interested in further research, they can also step on their shoulders. Bucket explanation: Hash table lookup operations are often O (n/m) (where n is the number of objects in the table and m is the number of buckets ), which is close to O (1), especially when the hash function has spread the hashed objects evenly through the hash table, and there are more hash buckets than objects to be stored. this is understandable. The address corresponding to a HASH result can be stored in two buckets. HASH conflicts can be solved. The first time the data is hashed here, a data is stored in the first BUCKET. When data is to be stored, when the second HASH is here for some reason, the second BUCKET stores the other data. The hash_long function of www.2cto.com linux is calculated using golden ratio. Because the number of buckets (bits) needs to be determined by the hash function and expectation for conflicts, how can we determine the number of buckets for hash functions such as hash_long? Generally, the hash algorithm is used based on data characteristics. instead of sticking to a hash table that stores IP addresses, for example, using a 65536 bucket is good, using the last 16bit of the IP address as the key is definitely lower than the hit rate of functions such as hash_long and jhash.
This is actually a compromise between this field and performance. I can take the maximum value of my problem space. This will certainly ensure that key values are scattered. However, this will waste a lot of space. However, the retrieval efficiency is too small. It seems that we still need to test it in the test. In addition, I personally think that what makes hash more flexible than other searched data structures is its customization. It can be adjusted according to the actual situation to achieve the optimal effect.
 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.