Hash algorithm principle Detailed

Source: Internet
Author: User
Tags numeric value
I. Concept

A hash table is a structure that stores data in key-value (key-indexed), and we can find its corresponding value as long as we enter the value to be looked for as key.

The idea of hashing is simple, if all the keys are integers, then a simple unordered array can be implemented: the key is indexed and the value is its corresponding value, so you can quickly access the value of any key. This is the case for a simple key, which we extend to a key that can handle more complex types.

There are two steps to using a hash lookup:

1. Use the hash function to convert the lookup key to the index of the array. Ideally, different keys will be converted to different index values, but in some cases we need to handle multiple keys being hashed to the same index value. So the second step in a hash lookup is to handle the conflict

2. Handle hash collision conflicts. There are many ways to handle hash collision conflicts, and the zipper method and the linear detection method are introduced later in this article.

A hash table is a classic example of a trade-off between time and space. If there is no memory limit, you can directly index the key as an array. Then all the search time complexity is O (1), if there is no time limit, then we can use unordered array and order lookup, which requires very little memory. The hash table uses a modest amount of time and space to find a balance between the two extremes. You just need to adjust the hash function algorithm to make trade-offs in time and space.

In a hash table, there is a definite relationship between the location of the record in the table and its keywords . This allows us to know the location of the key in the table in advance and find the record directly through the subscript. Make the ASL closer to the 0.

1 hash function is a mapping, that is, the collection of keywords to map to a set of addresses, it is set up very flexible, as long as the size of the address set does not exceed the allowable range;

2 because the hash function is a compressed image, it is very easy to produce a "conflict" phenomenon in general, namely: Key1. =key2, while f (key1) = f (key2).

3. Only minimize conflicts and not avoid conflicts altogether, because the key collection is usually large and its elements include all possible keywords, and the address collection element is only the address value in the hash table

When constructing this particular lookup table, you need to find a way to "handle conflicts" , in addition to having to select a hash function that is "good" (as little as possible) .

Two. Hash constructor method

1. Direct Addressing method:

the direct addressing method is the hash address of the data element K itself or its linear function, namely: H (k) =k or H (k) =axk+b; (where a,b is constant)

For example 1, there is a demographic table that records the number of people from 1 to 100 years of age, of which the age as the keyword, the hash function takes the keyword itself, as shown in figure (1):

Address

A1

A2

......

A99

A100

Age

1

2

......

99

100

Number

980

800

......

495

107

You can see that when you need to find the number of people of a certain age, find the appropriate item directly. If you look for the number of elderly persons aged 99, read the 99th item directly.

Address

A0

A1

......

A99

A100

Age

1980

1981

......

1999

2000

Number

980

800

......

495

107

If we want to count the number of births, as shown in the table above, then our team's year of birth this keyword can be used as the year minus 1980来 as the address, at this time F (key) =key-1980

This hash function is simple and does not conflict with different keywords, but it can be seen that this is a more specific hash function, in real life, the elements of the keyword are rarely continuous. The hash table produced by this method will cause a lot of waste of space, so the method is not adaptable. [2]↑

This method is only suitable for: the size of the address set = = the size of the keyword collection, where A and B are constants.

2. Digital Analysis Method:

Suppose that each keyword in a collection of keywords is composed of an S-bit number (U1, U2, ..., US), which analyzes the whole of the keyword set and extracts a number of evenly distributed bits or their combinations as addresses.

The method of digital analysis is to take some more uniform digit bits as the hash address in the data element keywords. That is, when the number of digits in the keyword, you can analyze the keywords, throw out the uneven distribution of bits, as a hash value. It is only suitable for cases where all the keyword values are known. By analyzing the distribution, the value range of the key word is transformed into a small key value range.

Example 2, to construct a hash of the number of data elements n=80, hash length m=100. Without losing generality, we only give 8 keywords for analysis, 8 keywords as follows:

k1=61317602 k2=61326875 k3=62739628 k4=61343634

k5=62706815 k6=62774638 k7=61381262 k8=61394220

Analysis of the above 8 key words, the keyword from left to right of the 1th, 2, 3, 6 bit values are relatively concentrated, not as a hash address, the remaining 4th, 5, 7, 8 bit value is more uniform, you can select two bits as a hash address. Set to select the last two bits as the hash address, the hash addresses of the 8 keywords are: 2,75,28,34,15,38,62,20.

This method is suitable for: it can estimate the frequency of each number of all the key words in advance.

3. Folding Method:

Divides the keywords into sections, and then takes their superposition and is the hash address. Two methods of superposition: Shift overlay: Add a few parts of the lower alignment after the partition, the boundary overlay: fold back and forth from one end along the dividing line, then add the alignment.

The so-called folding method is to divide the keywords into the same number of bits (the last part of the number of digits can be different), and then take these parts of the superposition and (rounding), this method is called folding method. This method is applicable to a large number of key characters, and each of the keywords in the number distribution is roughly uniform.

In the folding method, the digit folding is divided into two methods: shifting superposition and boundary superposition, and the shift superposition is to align the lowest bits of each part after the partition, and then add together; the boundary overlay is folded back and forth from one end to the other, and then the alignment is added.

Example 4, when the hash table is 1000, the keyword key=110108331119891, and the allowed address space is three-bit decimal, the two overlays are as follows:

Shift superposition boundary superposition

8 9 1 8 9 1

1 1 9 9 1 1

3 3 1 3 3 1

1 0 8 8 0 1

+ 1 1 0 + 1 1 0

(1) 5 5 9 (3) 0 4 4

Fig. (2) The hash address is obtained by folding method

The hash address is 559, and the hash address superimposed by the boundary is 44. If the keyword is not a numeric value but a string, the number can be converted first. The method of transformation can be used to ascⅱ the order value of characters or characters.

This method is suitable for: the number of digits in the keyword is particularly numerous.

4. Method of square-taking

This is a common method of constructing hash functions. The method is to first take the square of the keyword, and then, depending on the size of the available space, select the square as the middle number of the hash address.

hash function H (key) = "middle of key2" because the principle of this method is to widen the difference by taking the square, the median of the square value is related to each bit of this number, then the hash function for different keywords is not easy to conflict, resulting in a more uniform hash address.

Example 5, if you set a hash table length of 1000, it is preferable to the middle three digits of the keyword squared value, as shown in the figure:

Key words

The square of the keyword

hash function value

1234

1522756

227

2143

4592449

924

4132

17073424

734

3214

10329796

297

Here we give the hash function of the method of square-taking

The hash function in square-sum, which sets a key value of 32-bit integers

The hash function returns the middle 10 bits of the key * key

int Hash (int key)

{

Calculates the square of a key

Key * = key;

Remove the 11-bit low

key>>=11;

Returns a low 10-bit (the middle 10-bit of key * key)

Return key% 1024;

}

This method is suitable for: every digit in the keyword has a high frequency of repeated occurrences.


5. Minus the law

Minus method is the key value of the data minus a specific value to obtain the location of the data store.

Example 7, the company has 100 employees, and the employee's number is between 1001 and 1100, minus the method is the employee number minus 1000 is the location of the data. Number 1001 Employee's data in the first pen in the data. Number 1002 Employee's data in the second pen in the data ... by analogy. To get all the information about the employee, since no data was available before number 1000, and all employee numbers are numbered from 1001 onwards.

6. Cardinal Conversion Method

Consider the decimal number x as a different feed, such as 13, and then convert it to a decimal number according to the 13 number, and extract some of the hash values as x. Generally take a number larger than the original cardinality as the base of the conversion, and the two cardinality should be a reciprocal element.

Example hash (80127429) = (80127429) 13=8*137+0*136+1*135+2*134+7*133+4*132+2*131+9= (502432641) 10 if the middle three bit is taken as a hash value, Hash (80127429) =432

In order to obtain a good hash function, several methods can be combined to use, such as first variable base, then fold or square, and so on, as long as the hash evenly, you can piece together.

7. Excluding remainder method:

Assuming that the hash table is long m,p to the largest prime number less than or equal to m, the hash function is

H (k) =k% p, where% is modulo p remainder operation.

For example, if the hash element is known to be (18,75,60,43,54,90,46) and the table length is m=10,p=7, there is

(h) =18% 7=4 H (a) =75% 7=5 H (60) =60% 7=4

H (a) =43% 7=1 H (=54)% 7=5 H (90) =90% 7=6

H (46) =46% 7=4

There are more conflicts at this time. To reduce conflicts, it is desirable to have larger m and P values, such as m=p=13, with the following results:

(h) =18% 13=5 H (a) =75% 13=10 H (60) =60% 13=8

H (a) =43% 13=4 H (=54)% 13=2 H (90) =90% 13=12

H (46) =46% 13=7

There is no conflict at this point, as shown in Figure 8.25.

0 1 2 3 4 5 6 7 8 9 10 11-12

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.