hash tables and hash functions are the curriculum of university data structure, in practice we often use hashtable this structure, when the key-value pairs of storage, the use of Hashtable than ArrayList lookup performance. Why, then? We enjoy high performance at the same time, need to pay what the price (these days to see the red top businessman Hu Xueyan, classic lines: Before you enjoy this, you must suffer the suffering of others can not bear the humiliation), then use Hashtable is not a profit of the sale? In this question, do the following analysis, I hope to give a comment.
One, hash it why for key-value lookup performance is high
Those who have studied data structures should know that in linear tables and in trees, the relative position of the records in the structure is random, there is no clear relationship between the records and the keywords, so a series of keyword comparisons are needed to find the record, which is based on the comparison, in. NET (Array, Arraylist,list) These collection structures use the above storage mode.
For example, now we have a class of students data, including name, gender, age, school number, etc. If the data has
name |
Gender |
Age |
School Number |
Tom |
Man |
15 |
1 |
John doe |
Woman |
14 |
2 |
Harry |
Man |
14 |
3 |
If we look up by name, suppose the lookup function Findbyname (string name);
1) find "John"
Just match once in the first line.
2) find "Harry"
Match on the first line, fail,
Match in the second row, failed,
Match on third line, success
The above two cases, the best case, and the worst case, then the average lookup should be (1+3)/2=2 times, that is, the average lookup number (record Total + 1) of 1/2.
Although there are some optimized algorithms that can increase the efficiency of lookup sorting, the complexity will remain within the LOG2N range.
How do you find it faster? The effect we expect is to navigate to the location of the record at once, with a time complexity of 1 and the fastest lookup. If we make a serial number for each record in advance, and then let them hit the numbers, and we know what rules to number the records, and if we look at a record again, we just have to figure out the number of the record by the rule, and then, according to the number, in the linear queue of records, We can easily find the record.
Note that the above description contains two concepts, one for the student numbering rules, in the data structure, called the hash function, the other is in accordance with the rules for students ordered structure, called a hash table.
Still take the above students as an example, assuming that the school number is the rule, the teacher has a rule table on hand, in the row seat in accordance with this rule to sort, find Dick, first of all, the teacher will be judged according to the rules, Dick's number is 2, is in the seat of the 2nd position, go directly to the past, "Dick, Haha, you boy, is in this!" ”
Look at the overall process:
From the diagram above, you can see that the hash table can be described as two cheese, one for the record location number, the other for the record, and a set of rules to describe the connection between the record and the number. How does this rule usually work?
A) Direct addressing method:
In my previous article on the GetHashCode () performance comparisons, the GetHashCode () function of the shaping is to return to the reshaping itself, which is based on direct addressing, such as a group of 0-100 of data to indicate the age of the person.
Then, a hash table consisting of a direct addressable method is:
0 |
1 |
2 |
3 |
4 |
5 |
0 years old |
1 years old |
2 years old |
3 years old |
4 years old |
5 years old |
.....
Such a method of addressing is simple and convenient, which is suitable for the case that the metadata can be expressed in a digital way or the original data has a distinct sequence relation.
b) Digital Analysis method:
There is such a set of data to describe the date of birth of some people
Years |
Month |
Day |
75 |
10 |
1 |
75 |
12 |
10 |
75 |
02 |
14 |
Analysis, the first digit of the year and month is basically the same, resulting in a very high probability of conflict, and the rear three-bit difference is relatively large, so the use of the latter three
c) Method of square-taking
Take the middle of the keyword square as the hash address
d) Folding Method:
Divide the keywords into bits of the same number of digits, the last part of the number can be different, and then go to these parts of the stack and (fetch carry) as a hash address, such as the data 20-1445-4547-3
OK
5473
+ 4454
+ 201
= 10128
Remove the carry 1 and take 0128 as the hash address
E) Remainder method
The remainder is a hash address when the keyword is removed by a number of p that is not more than the hash table long m. H (key) =key MOD p (p<=m)
f) Random number method
Select a random function, take the keyword's random function value to its hash address, that is, H (key) =random (key), where random is a random function. This method is usually used when the keyword length is unequal.
In summary, the rule of a hash function is to allow the keyword to be moderately dispersed into a sequential structure of the specified size through a transformation relationship. The more dispersed, then the more time to find the complexity of the smaller, the higher the complexity of space.
Second, the use of hash, we pay what?
Hash is a typical space-time algorithm, such as the original 100-length array, search for it, only need to traverse and match the corresponding records, from the space complexity, if the array is stored in byte data, then the array occupies 100byte space. Now we use the hash algorithm, we said the hash must have a rule, constraint key and storage location relationship, then need a fixed length of the hash table, at this time, is still a 100byte array, assuming we need 100byte to record the relationship between the key and position, So the total space is 200byte, and the table size used to record rules is likely to be variable depending on the rule, for example, in the LZW algorithm, if a very long byte array for recording pixels is used to record the position and key relationship of the table space, the algorithm is recommended as a 12bit can be expressed as an integer size, So long enough pixel array, how to spread to such a fixed-length table, LZW algorithm uses variable length code, specifically in the in-depth introduction of LZW algorithm.
Note: The most prominent problem with the hash table is the conflict, that is, the two key values are computed by the hash function of the index position is likely to be the same, this issue, the next article will be explained.
Note: The reason is simple to introduce the hash, is to better learn LZW algorithm, learning LZW algorithm is to better study the GIF file structure, and finally, I will elaborate on how the GIF file is composed, how to efficiently manipulate this type of file.
The above is the entire content of this article, I hope to give you a reference, but also hope that we support the cloud habitat community.