# The principle of common hash algorithm

Source: Internet
Author: User

Tags: style blog http color io ar using for strong

Hash table, which is based on the high-speed access point of view design, but also a typical " space-time " approach. As the name implies, the data structure can be understood as a linear table, but the elements are not tightly arranged, but there may be gaps.

A hash table, also known as a hash table, is a data structure that is directly interviewed based on key value. That is, it visits records by mapping key code values to a location in the table to speed up the search. This mapping function is called a hash function, and the array that holds the record is called the hash table.

For example, we store 70 elements, but we may have requested 100 elements of space for these 70 elements. 70/100=0.7, this number is called the load factor. The reason we do this is also for the purpose of "High speed access". We arrange a storage location for each element based on a result of a random, evenly distributed fixed function h, which avoids traversing linear searches to achieve high-speed access. But because of this randomness, it must also lead to a problem is conflict. The so-called conflict, that is, two elements through the hash function h to get the same address, then these two elements are called "synonyms." This is similar to 70 people going to a restaurant with 100 chairs for dinner. The hash function evaluates to a storage unit address, and each storage unit is called a bucket. To set a hash list with M buckets, the value of the hash function should be [0,m-1].
Conflict resolution is a complex issue. The conflict depends mainly on:
(1) hash function, the value of a good hash function should be distributed as evenly as possible.
(2) dealing with conflicting methods.
(3) The size of the load factor. Too big is not necessarily good, and wasted space is serious, load factor and hash function is linkage.
Resolution of the conflict:
(1) Linear probing method: After the conflict, linear forward heuristic, find a recent empty position. The disadvantage is that there will be a stacking phenomenon. When accessed, words that may not be synonyms are also located in the probing sequence, affecting efficiency.
(2) Double hash function method: After the conflict of position D, again using a hash function to produce a hash list bucket capacity m coprime number C, then heuristic (d+n*c)%m, so that the exploration sequence jump-type distribution.
Frequently used methods for constructing hash functions

The hash function makes the access to a data series more efficient, and the data elements are positioned more quickly through the hash function:

1. Direct addressing method: take keyword or keyword a linear function value is a hash address. That is, H (key) =key or H (key) = A?key + B, where A and B are constants (such hash functions are called self functions)

2. Digital analysis: analysis of a set of data, such as the date of birth of a group of employees, when we found that the first few numbers of the date of birth is roughly the same, the likelihood of conflict will be very large, but we found that the number of months and days after the month and the detailed date of the difference is very large, Assuming that the following numbers are used to form a hash address, the odds of a conflict are significantly reduced. Therefore, the digital analysis method is to find out the laws of numbers, as far as possible to use this data to construct a low probability of conflict hash address.

3. The square takes the middle method: takes keyword square after the intermediate several as the hash address.

4. Folding method: The keyword is cut into the same number of bits, the last part of the number can be different, and then take these parts of the overlay and (remove carry) as the hash address.

5. Random number method: Select a random function, take keyword random value as the hash address, pass often used in keyword length different occasions.

6. In addition to the remainder of the method: Take keyword is not greater than the hash table length m of the number of p after the remainder is a hash address. That is, H (key) = key MOD p, p<=m. Not only can the keyword directly take the mold, but also in the folding, square take the medium operation after the modulo. The choice of P is very important, generally take prime or m, if p is not good, easy to produce synonyms.
Performance analysis for lookups

The hash table lookup process is essentially the same as the watchmaking process. Some key codes can be found directly through the address of the hash function transformation, and there are some key codes in the hash function to get the address of the conflict, you need to deal with the method of conflict to find. In the three methods of dealing with conflicts, post-conflict lookups are still the process of comparing a given value to a key code. Therefore, the measurement of the efficiency of the hash table is still measured by the average lookup length.

In the process of searching, the number of key codes depends on how many conflicts are generated, the conflict is less, the search efficiency is high, the conflict is more, and the search efficiency is low. Therefore, the factors that affect the number of conflicts, that is, the factors that affect the search efficiency. There are three factors that affect the number of conflicts:

1. The hash function is uniform;

2. Methods of dealing with conflicts;

3. Reload factor for the hash table.

The reload factor for the hash list is defined as: α= the number of elements in the table/length of the hash list

α is the marker factor for the full extent of the hash table. Because the length of the table is fixed, α is proportional to the number of elements in the table, so the larger the alpha, the more elements are filled in the table, the greater the likelihood of a conflict, and the smaller the alpha, the less likely it is to have a conflict.

In fact, the average lookup length of a hash table is a function of filling factor α, and only different methods of dealing with conflicts have different functions.

Understanding the basic definition of hash, you can not mention some famous hash algorithm, MD5 and SHA-1 can be said to be the most widely used hash algorithm, and they are based on MD4 design. So what do they mean?

Here's a quick look:

(1) MD4

MD4 (RFC 1320) was designed by MIT's Ronald L. Rivest in 1990, MD is the abbreviation for Message Digest. It is implemented on a 32-bit word processor with a fast software implementation-it is based on a bitwise operation of 32-bit operands.

(2) MD5

MD5 (RFC 1321) is an improved version number of Rivest for MD4 in 1991. It still groups the input with 512 bits, and its output is a cascade of 4 32-bit words, same as MD4. MD5 is more complex than MD4 and slower, but safer, better in terms of resistance to analysis and differential resistance

(3) SHA-1 and others

The SHA1 is designed by the NIST NSA to be used with the DSA, which produces a hash value of 160bit in length for inputs of less than 264, thus providing better anti-brute-force. The SHA-1 design is based on the same principles as the MD4, and mimics the algorithm.

A hash table inevitably conflicts (collision) Phenomenon: keyword may get the same hash address as Key1≠key2, and hash (key1) =hash (Key2) for different types. Therefore, when building a hash table, you should not only set a good hash function, but also set a method for dealing with conflicts. A hash table can be described as follows: According to the set hash function H (key) and the selected method of dealing with conflicts , a set of keyword images to a finite , sequential address Address set (interval) and keyword the "elephant" in the address set as the storage location for the corresponding record in the table, such a table is called a Hashtable.

For a dynamic lookup table, 1) The table length is indeterminate; 2) When designing a lookup table, you only know the scope of the keyword, and you do not know the exact keyword. Therefore, the normal situation needs to establish a function relationship, with F (key) as the keyword key is recorded in the table position, usually called this function f (key) is a hash function. (Note: This function is not necessarily a mathematical function)

A hash function is an image that maps a collection of keyword to an address set, and its settings are very flexible, just so that the size of the address collection does not exceed the scope of consent.

In reality, the function of the hash is to be constructed, and the good ability of construction is good to use.

So what's the use of these hash algorithms?

The application of hash algorithm in information security main body now the following 3 aspects:

(1) file check

We are more familiar with the parity check algorithm and CRC check, these 2 kinds of verification does not have the ability to resist data tampering, they can detect and correct the transmission of the channel error in the data, but it does not prevent malicious damage to the data.

The "digital fingerprint" feature of MD5 hash algorithm makes it become the most widely used file integrity checksum (Checksum) algorithm, and many UNIX systems have the command to provide calculation MD5 Checksum.

(2) digital signature

Hash algorithm is also an important part of modern password system. Because of the slow operation of the asymmetric algorithm, the one-way hash function plays an important role in the digital signature protocol. The Hash value, also known as "Digital Digest", is digitally signed and can be statistically equivalent to a digital signature on the file itself. And there are other advantages to this agreement.

(3) authentication Agreement

For example, the following authentication protocol is also known as the Challenge-authentication mode: This is a simple and secure way to be able to listen to a transmission channel but not tamper with it.

File hash value

The Digital Digest of the md5-hash-file is computed by the Hash function. Regardless of the length of the file, its hash function evaluates to a fixed-length number. Unlike cryptographic algorithms, this hash algorithm is an irreversible one-way function. With a high-security hash algorithm, such as MD5, Sha, two different files almost impossible to get the same hash results. Therefore, once the file has been altered, it can be detected.

The hash function also has a different meaning. The actual hash function refers to the mapping of a large range to a small range. The goal of mapping a wide range to a small area is often to save space and make data easy to save. In addition, hash functions are often applied to lookups. Therefore, before considering the use of the hash function, it is necessary to specify several limitations:

1. The main principle of hash is to map a large range to a small range, so the number of actual values you enter must be equal to or smaller than the small range. Otherwise there will be very much conflict.
2. Because the hash approximates a one-way function, you can use it to encrypt the data.
3. Different applications have different requirements for the hash function, for example, the hash function for encryption mainly considers the gap between it and the single function, while the hash function used for searching mainly considers its mapping to a small range of conflict rates.
The hash function applied to encryption has been discussed too much, in the author's blog has more specific introduction. Therefore, this article only explores the hash function used for lookups.
The main object used by the hash function is an array (for example, a string), and its target is usually an int type. We all follow this way.
In general, the hash function can be easily divided into such as the following categories:
2. bit operation Hash;
3. Multiplication hash;
4. Division Hash;
5. Check the table hash;
6. Mixed hash;
The following specific introduction of the above-mentioned various methods in the practical application.
The so-called additive hash is the addition of the input elements of a single sum to form the final result. The construction of the standard addition hash is for example the following:

static int Additivehash (String key, int prime)
{
int hash, I;
for (hash = Key.length (), i = 0; i < key.length (); i++)
Hash + = Key.charat (i);
Return (hash% prime);
}

Here the prime is random prime number, see, the result of the range is [0,prime-1].

Two-bit arithmetic hash
This type of hash function fully mixes the input elements by taking advantage of the various bitwise operations (common is shift and XOR). For example, the construction of a standard rotational hash is as follows:

static int Rotatinghash (String key, int prime)
{
int hash, I;
For (Hash=key.length (), i=0; I
hash = (hash<<4>>28) ^key.charat (i);
Return (hash% prime);
}

The main feature of this type of hash function is to shift first and then perform various bit operations. For example, the above-calculated hash code can also have several variants such as the following:

hash = (hash<<5>>27) ^key.charat (i);
Hash + = Key.charat (i);
Hash + = (hash << 10);
Hash ^= (hash >> 6);
if ((i&1) = = 0)
{
Hash ^= (HASH&LT;&LT;7&GT;&GT;3);
}
Else
{
Hash ^= ~ ((hash<<11>>5));
}
Hash + = (hash<<5>
hash = Key.charat (i) + (hash<<6>>16)? Hash
Hash ^= ((hash<<5>>2));

Three-multiplication Hash
This type of hash function takes advantage of the non-correlation of multiplication (the nature of multiplication, the most famous of which is the random number generation algorithm of the square Kinsoku, although the algorithm does not work well). Example

static int Bernstein (String key)
{
int hash = 0;
int i;
for (i=0; i
return hash;
}

The Hashcode () method of the string class inside the jdk5.0 also uses the multiplication hash. Just, it uses a multiplier of 31. The recommended multipliers are: 131, 1313, 13131, 131313, and so on.
The well-known hash functions in this way are:

32-bit FNV algorithm
int m_shift = 0;
public int Fnvhash (byte[] data)
{
int hash = (int) 2166136261L;
for (byte b:data)
hash = (Hash * 16777619) ^ b;
if (M_shift = = 0)
return hash;
Return (hash ^ (hash >> m_shift)) & M_mask;
}

And the improved FNV algorithm:

public static int FNVHash1 (String data)
{
Final int p = 16777619;
int hash = (int) 2166136261L;
for (int i=0;i
hash = (hash ^ data.charat (i)) * p;
Hash + = Hash << 13;
Hash ^= Hash >> 7;
Hash + = Hash << 3;
Hash ^= Hash >> 17;
Hash + = Hash << 5;
return hash;
}

In addition to multiplying by a fixed number, the common ones are multiplied by a constantly changing number, for example:

static int Rshash (String str)
{
int b = 378551;
int a = 63689;
int hash = 0;

for (int i = 0; i < str.length (); i++)
{
hash = hash * A + str.charat (i);
A = a * b;
}
Return (hash & 0x7FFFFFFF);
}

Although the application of the ADLER32 algorithm is not CRC32 widely, it is probably the most famous one in the multiplication hash. On its introduction, you can see the RFC 1950 specification.

Four-Division Hash
Like division and multiplication, the same has a seemingly non-correlation on the surface. Just because the division is too slow, this way is almost no real application. It is important to note that the result of the hash we see in the previous divided by a prime is simply to guarantee the scope of the result. Assuming you don't need it to limit a range, you can replace "Hash%prime" with code such as the following: hash = hash ^ (hash>>10) ^ (hash>>20).
Five look-up table hash
The most famous example of tabular hash is the CRC series algorithm. Although the CRC series algorithm itself is not a check table, but the table is one of its fastest way to implement. The following is the implementation of CRC32:

static int crctab[256] = {
0x00000000, 0x77073096, 0xee0e612c, 0x990951ba, 0x076dc419, 0x706af48f, 0xe963a535, 0X9E6495A3, 0x0edb8832, 0X79DCB8A4, 0XE0D5E91E, 0x97d2d988, 0X09B64C2B, 0X7EB17CBD, 0xe7b82d07, 0x90bf1d91, 0x1db71064, 0x6ab020f2, 0xf3b97148, 0x84be41de, 0x1adad47d, 0x6ddde4eb, 0xf4d4b551, 0x83d385c7, 0x136c9856, 0x646ba8c0, 0xfd62f97a, 0x8a65c9ec, 0x14015c4f, 0X63066CD9, 0xfa0f3d63, 0x8d080df5, 0x3b6e20c8, 0x4c69105e, 0xd56041e4, 0xa2677172, 0x3c03e4d1, 0x4b04d447, 0XD20D85FD, 0xa50ab56b, 0X35B5A8FA, 0x42b2986c, 0xdbbbc9d6, 0xacbcf940, 0x32d86ce3, 0x45df5c75, 0XDCD60DCF, 0xabd13d59, 0x26d930ac, 0x51de003a, 0xc8d75180, 0xbfd06116, 0x21b4f4b5, 0x56b3c423, 0xcfba9599, 0xb8bda50f, 0x2802b89e, 0x5f058808, 0XC60CD9B2, 0xb10be924, 0x2f6f7c87, 0x58684c11, 0xc1611dab, 0xb6662d3d, 0x76dc4190, 0x01db7106, 0X98D220BC, 0xefd5102a, 0x71b18589, 0x06b6b51f, 0x9fbfe4a5, 0xe8b8d433, 0X7807C9A2, 0x0f00f934, 0x9609a88e, 0xe10e9818, 0X7F6A0DBB, 0x086d3d2d, 0x91646c97, 0XE6635C01, 0x6b6b51f4, 0x1c6c6162, 0x856530d8, 0xf262004e, 0x6c0695ed, 0x1b01a57b, 0x8208f4c1, 0xf50fc457, 0x65b0d9c6, 0x12b7e950, 0x8bbeb8ea, 0xfcb9887c, 0X62DD1DDF, 0x15da2d49, 0X8CD37CF3, 0xfbd44c65, 0x4db26158, 0x3ab551ce, 0xa3bc0074, 0xd4bb30e2, 0x4adfa541, 0x3dd895d7, 0xa4d1c46d, 0XD3D6F4FB, 0x4369e96a, 0X346ED9FC, 0xad678846, 0xda60b8d0, 0x44042d73, 0x33031de5, 0xaa0a4c5f, 0XDD0D7CC9, 0x5005713c, 0X270241AA, 0xbe0b1010, 0xc90c2086, 0x5768b525, 0x206f85b3, 0xb966d409, 0xce61e49f, 0x5edef90e, 0x29d9c998, 0xb0d09822, 0XC7D7A8B4, 0x59b33d17, 0x2eb40d81, 0xb7bd5c3b, 0xc0ba6cad, 0xedb88320, 0x9abfb3b6, 0x03b6e20c, 0x74b1d29a, 0xead54739, 0X9DD277AF, 0x04db2615, 0x73dc1683, 0xe3630b12, 0x94643b84, 0x0d6d6a3e, 0x7a6a5aa8, 0xe40ecf0b, 0x9309ff9d, 0x0a00ae27, 0x7d079eb1, 0xf00f9344, 0x8708a3d2, 0x1e01f268, 0x6906c2fe, 0xf762575d, 0X806567CB,
0x196c3671, 0x6e6b06e7, 0xfed41b76, 0X89D32BE0, 0x10da7a5a, 0X67DD4ACC, 0xf9b9df6f, 0x8ebeeff9, 0x17b7be43, 0x60b08ed5, 0xd6d6a3e8, 0xa1d1937e, 0x38d8c2c4, 0x4fdff252, 0xd1bb67f1, 0xa6bc5767, 0X3FB506DD, 0x48b2364b, 0xd80d2bda, 0XAF0A1B4C, 0x36034af6, 0x41047a60, 0XDF60EFC3, 0xa867df55, 0x316e8eef, 0x4669be79, 0xcb61b38c, 0xbc66831a, 0x256fd2a0, 0x5268e236, 0xcc0c7795, 0xbb0b4703, 0x220216b9, 0x5505262f, 0xc5ba3bbe, 0xb2bd0b28, 0x2bb45a92, 0x5cb36a04, 0xc2d7ffa7, 0xb5d0cf31, 0x2cd99e8b, 0x5bdeae1d, 0x9b64c2b0, 0xec63f226, 0x756aa39c, 0x026d930a, 0x9c0906a9, 0xeb0e363f, 0x72076785, 0x05005713, 0x95bf4a82, 0xe2b87a14, 0x7bb12bae, 0x0cb61b38, 0x92d28e9b, 0xe5d5be0d, 0x7cdcefb7, 0x0bdbdf21, 0x86d3d2d4, 0xf1d4e242, 0x68ddb3f8, 0x1fda836e, 0X81BE16CD, 0xf6b9265b, 0x6fb077e1, 0x18b74777, 0x88085ae6, 0xff0f6a70, 0X66063BCA, 0x11010b5c, 0x8f659eff, 0xf862ae69, 0x616bffd3, 0x166ccf45, 0xa00ae278, 0XD70DD2EE, 0x4e048354, 0X3903B3C2, 0xa7672661, 0xd06016f7, 0X4969474D, 0x3e6e77db, 0XAED16A4A, 0xd9D65ADC, 0x40df0b66, 0x37d83bf0, 0xa9bcae53, 0xdebb9ec5, 0x47b2cf7f, 0x30b5ffe9, 0xbdbdf21c, 0xcabac28a, 0x53b39330, 0x24b4a3a6, 0xbad03605, 0xcdd70693, 0x54de5729, 0X23D967BF, 0xb3667a2e, 0xc4614ab8, 0X5D681B02, 0x2a6f2b94, 0xb40bbe37, 0XC30C8EA1, 0X5A05DF1B, 0x2d02ef8d
};
int Crc32 (String key, int hash)
{
int i;
For (Hash=key.length (), i=0; I
hash = (hash >> 8) ^ crctab[(hash & 0xff) ^ K.charat (i)];
return hash;
}

The famous examples in tabular hash are: Universal Hashing and Zobrist Hashing. Their tables are randomly generated.

Six mixed hash
The hybrid hash algorithm takes advantage of these various methods. A variety of common hash algorithms, such as MD5, Tiger, belong to this range. They are generally very rarely used in lookup-oriented hash functions.

Evaluation of seven-pair hash algorithm
Http://www.burtleburtle.net/bob/hash/doobs.html This page provides an evaluation of several popular hash algorithms. Our suggestions for hash functions are as follows:

1. Hash of the string. The simplest can use the main multiplication hash, when the multiplier is 33 o'clock, the English word has a very good hash effect (less than 6 lowercase form can guarantee no conflict). A bit more complex is the ability to use the FNV algorithm (and its improved form), which is good at both speed and effect for longer strings.

PublicOverrideunsafeIntGetHashCode()
{//Microsoft System.String string hashing algorithm
fixed(char*Str=((char*) This))
{
char*chptr= str;
IntNum=0x15051505;
Intnum2= num;
int*numptr= (int*) chptr;
for(intI= This. Length;i >0; I-=4)
{
num = ((num <<5) +num) + (num >>0x1b) ^ numptr[0];
if(I <=2)
{
Break;
}
num2 = ((num2 <<5) +num2) + (num2 >>0x1b) ^ numptr[1];
Numptr + =2;
}
return(num + (NUM2 *0x5d588b65));
}
}

2. Hash of the long array. The ability to use an algorithm like HTTP://BURTLEBURTLE.NET/BOB/C/LOOKUP3.C, which calculates multiple bytes at once, is a good speed.

The principle of common hash algorithm

Related Keywords:
Related Article