Reprinted from: http://blog.csdn.net/zxycode007/article/details/6999984

Hash list, which is based on**Quick access**Is also a typical"**Space Change Time**. As the name suggests, this data structure can be understood as a linear table, but the elements in it are not closely arranged, but may have gaps.

A hash table is a data structure that is directly accessed based on the key value. That is to say, It maps the key value to a location in the table to access records to speed up the search. This ing function is called a hash function, and the array storing records is called a hash function.

For example, we store 70 elements, but we may have applied 100 elements for these 70 elements. 70/100 = 0.7, which is called a load factor. This is also the purpose of "quick access. We arrange storage locations for each element based on the fixed function H, which distributes the results randomly and evenly as much as possible, so that we can avoid linear searches of traversal properties to achieve fast access. However, this randomness will inevitably lead to a conflict. The so-called conflict means that the addresses of the two elements are the same through the hash function H, so these two elements are called "Synonyms ". This is similar to 70 people going to a restaurant with 100 chairs. The computing result of the hash function is a storage unit address. Each storage unit is called a bucket ". If a hash table has m buckets, the value range of the hash function should be [0 M-1].

Resolving conflicts is a complex problem. Conflicts mainly depend on:

(1) hash function. The values of a good hash function should be evenly distributed as much as possible.

(2) Conflict handling methods.

(3) load factor size. Too large is not necessarily good, and the waste of space is serious, the load factor and the hash function are linked.

Solution to the conflict:

(1) linear probing: after a conflict, the linear forward test finds the nearest empty position. The disadvantage is accumulation. During access, words that are not synonyms may also be in the probe sequence, affecting efficiency.

(2) double hash function method: After the position d conflict, use another hash function to generate a number C that interacts with the size m of the hash table bucket, test (D + N * C) % m in sequence to make the exploration sequence skip distribution.

**Common Methods for constructing Hash Functions**

The hash function allows you to access a data sequence more quickly and effectively. Using the hash function, data elements are located faster:

**1. Direct addressing:**The linear function value of a keyword or keyword is a hash address. That is, H (key) = key or H (key) = A • Key + B, where A and B are constants (this hash function is called its own function)

**2. Digital Analysis:**Analyze a group of data, such as the birth year, month, and day of a group of employees. At this time, we find that the first few digits of the birth year, month, and day are roughly the same. In this way, there is a high chance of conflict, however, we find that the last few digits of the year, month, and day indicate that the number of the month differs greatly from that of the specific date. If the following digits are used to form a hash address, the probability of conflict is significantly reduced. Therefore, the digital analysis method is to find the law of numbers and use the data as much as possible to construct a hash address with a lower probability of conflict.

**3. China and France:**Take the number of digits in the middle after the square of the keyword as the hash address.

**4. Folding Method:**The keyword is divided into several parts with the same number of digits. The last part of the number of digits can be different, and the overlay and (remove carry) of these parts are used as the hash address.

**5. Random Number method:**Select a random function and take the random value of the keyword as the hash address. It is usually used when the length of the keyword is different.

**6. Except the remaining values:**The remainder obtained after the keyword is divided by a number p that is not longer than m in the hash table is the hash address. That is, H (key) = Key mod P, P <= m. You can not only directly modulo keywords, but also perform the modulo operation after the folding and square calculation. The choice of P is very important. Generally, the prime number or M is used. If P is not good, synonyms are easily generated.

**Search Performance Analysis**

The process of searching a hash is basically the same as that of creating a table. Some key codes can be directly found through the address converted by the hash function, and some other key codes conflict with the address obtained by the hash function, and need to be searched by the method of handling the conflict. In the three methods described to deal with conflicts, the search after a conflict still compares the given value with the key code. Therefore, the average search length is still used to measure the efficiency of hash search.

The number of key code comparisons during the search process depends on the number of conflicts generated, resulting in fewer conflicts, higher search efficiency, more conflicts, and lower search efficiency. Therefore, the factors that affect the number of conflicts, that is, the factors that affect the search efficiency. There are three factors that affect the number of conflicts:

1. Whether the hash function is even;

2. methods for handling conflicts;

3. Fill Factor of the hash table.

The fill factor of the hash table is defined as: α = number of elements in the table/length of the hash table

α is a factor that indicates the filling degree of the hash list. Since the table length is a fixed value, α is proportional to the number of elements in the input table. Therefore, the larger the α value, the more elements in the input table, the higher the possibility of conflict. The smaller the α value, if the number of elements in the table is small, the possibility of conflict is smaller.

In fact, the average search length of the hash list is the function of filling factor α, but different methods of dealing with conflicts have different functions.

After learning about the basic definition of hash, we can't help but mention some famous hash algorithms. MD5 and SHA-1 are currently the most widely used Hash algorithms, and they are all designed on the basis of md4. So what do they mean?

Here is a brief introduction:

**(1) md4**

Md4 (RFC 1320) was designed by MIT's Ronald L. Rivest in 1990. md is short for message digest. It is applicable to high-speed software implementation on 32-bit character-length processors-it is implemented based on 32-bit operations.

**(2) MD5**

MD5 (RFC 1321) is an improved version of md4 by Rivest in 1991. It still groups the input in 512 bits, and its output is a cascade of 4 32 bits, which is the same as that of md4. MD5 is more complex than md4, and the speed is a little slower, but it is safer and better in terms of anti-analysis and anti-difference performance.

**(3) SHA-1 and others**

Sha1 is designed to be used together with DSA by nist nsa. It generates a hash value with a length of less than 264 for an input with a length of bits. Therefore, sha1 is resistant to brute force attacks) better performance. SHA-1 is designed based on the same principle as md4 and imitates this algorithm.

Collision:**Different keywords may obtain the same hash address.**That is, key1 = key2, while Hash (key1) = hash (key2 ). Therefore, when creating a hash table, you must not only set a good hash function, but also set a method to handle conflicts. It can be described as follows:**Hash table**: Based on the set hash**Function**H (key) and the selected**Conflict Handling Method**, Map a set of keywords to**Limited**,**Address continuity**And**The keyword "image" in the address set is used as the storage location of the corresponding record in the table**This type of table is called a hash table.

For a dynamic search table, 1) The table length is unknown; 2) when designing a search table, you only know the scope of the keyword, but not the exact keyword. Therefore, we usually need to establish a function relationship. The position of the table where F (key) is used as the key is recorded. This function f (key) is usually called a hash function. (Note: This function is not necessarily a mathematical function)

A hash function is an image that maps a set of keywords to an address set. Its settings are flexible as long as the size of this address set does not exceed the permitted range.

In reality, the hash function must be constructed and can be used properly.

So what are the purposes of these hash algorithms?

The application of the hash algorithm in information security is mainly reflected in the following three aspects:

**(1) file Verification**

We are familiar with the parity and CRC verification algorithms. These two verification algorithms do not have the ability to defend against data tampering. To a certain extent, they can detect and correct channel codes in data transmission, however, it cannot prevent malicious data destruction.

The "digital fingerprint" feature of the MD5 hash algorithm makes it the most widely used file integrity checksum algorithm. Many Unix systems provide commands for calculating MD5 checksum.

**(2) Digital Signature**

Hash algorithms are also an important part of modern cryptographic systems. Due to the slow operation speed of asymmetric algorithms, one-way hashing plays an important role in Digital Signature protocols. It can be regarded as equivalent in statistics to digital signature of the file itself. This Protocol also has other advantages.

**(3) Authentication Protocol**

The following authentication protocols are also called challenges-Authentication Mode: this is a simple and secure method when the transmission channel can be listened but cannot be tampered.

**File hash value**

The MD5-Hash-the digital digest of the file is calculated by hash function. Regardless of the file length, its hash function returns a fixed-length number. Unlike the encryption algorithm, this hash algorithm is an irreversible one-way function. When using a highly secure hash algorithm, such as MD5 or Sha, it is almost impossible for two different files to get the same hash result. Therefore, once the file is modified, it can be detected.

The hash function has another meaning. In practice, the hash function maps a large range to a small range. The purpose of ing a large scope to a small scope is often to save space and make data easy to save. In addition, hash functions are often used for search. Therefore, before using the hash function, you need to understand the following restrictions:

1. The main principle of hash is to map a large range to a small range. Therefore, the number of actual values you enter must be equal to or smaller than a small range. Otherwise, there will be many conflicts.

2. As hash approaches unidirectional functions, you can use it to encrypt data.

3. different applications have different requirements on the hash function. For example, the hash function used for encryption mainly considers the gap between it and a single function, the hash function used for searching mainly considers the conflict rate mapped to a small range.

There have been too many discussions about hash functions used in encryption. I will give you a more detailed introduction in the author's blog. Therefore, this article only discusses the hash functions used for searching.

The main object used by the hash function is an array (such as a string), and its target is generally an int type. We will describe this method as follows.

Generally, hash functions can be divided into the following categories:

1. Add hash;

2. bitwise operation hash;

3. Multiplication hash;

4. Division hash;

5. query table hash;

6. Hybrid hash;

The following describes in detail the use of the above methods in practice.

**1. Add hash**

The so-called addition hash is to add the input elements one by one to form the final result. The structure of the standard addition hash is as follows:

Static int additivehash (string key, int prime)

{

Int hash, I;

For (hash = key. Length (), I = 0; I <key. Length (); I ++)

Hash + = key. charat (I );

Return (hash % prime );

}

Here, prime is any prime number. We can see that the value of the result is [0, prime-1].

**Binary operation hash**

This type of hash function uses a variety of bitwise operations (usually shift and XOR) to fully mix input elements. For example, the structure of the standard rotating hash is as follows:

Static int rotatinghash (string key, int prime)

{

Int hash, I;

For (hash = key. Length (), I = 0; I

Hash = (hash <4> 28) ^ key. charat (I );

Return (hash % prime );

}

First shift, and then perform a variety of bitwise operations is the main feature of this type of hash function. For example, the hash calculation code above can also have the following variants:

Hash = (hash <5> 27) ^ key. charat (I );

Hash + = key. charat (I );

Hash + = (hash <10 );

Hash ^ = (hash> 6 );

If (I & 1) = 0)

{

Hash ^ = (hash <7> 3 );

}

Else

{

Hash ^ = ~ (Hash <11> 5 ));

}

Hash + = (hash <5>

Hash = key. charat (I) + (hash <6> 16 )? Hash;

Hash ^ = (hash <5> 2 ));

**Three-way hash**

This type of hash function uses the non-relevance of multiplication (this property of multiplication is most famous for its random number generation algorithm, although this algorithm is not effective ). For example,

Static int Bernstein (string key)

{

Int hash = 0;

Int I;

For (I = 0; I

Return hash;

}

The hashcode () method of the string class in jdk5.0 also uses multiplication hash. However, it uses a multiplier of 31. The recommended multiplier is 131,131 3, 13131,131 313, and so on.

The famous hash functions used in this method include:

// 32-bit FNV Algorithm

Int m_shift = 0;

Public int fnvhash (byte [] data)

{

Int hash = (INT) 2166136261l;

For (byte B: Data)

Hash = (hash * 16777619) ^ B;

If (m_shift = 0)

Return hash;

Return (hash ^ (hash> m_shift) & m_mask;

}

And the improved FNV algorithm:

Public static int fnvhash1 (string data)

{

Final int P = 16777619;

Int hash = (INT) 2166136261l;

For (INT I = 0; I

Hash = (hash ^ data. charat (I) * P;

Hash + = hash <13;

Hash ^ = hash> 7;

Hash + = hash <3;

Hash ^ = hash> 17;

Hash + = hash <5;

Return hash;

}

In addition to multiplying a fixed number, it is common to multiply it by a constantly changing number, for example:

Static int rshash (string Str)

{

Int B = 378551;

Int A = 63689;

Int hash = 0;

For (INT I = 0; I <Str. Length (); I ++)

{

Hash = hash * A + Str. charat (I );

A = A * B;

}

Return (hash & 0x7fffffff );

}

Although the adler32 algorithm is not widely used in CRC32, it may be the most famous one in multiplication hash. For more information, see the RFC 1950 standard.

**Division hash**

Division, like multiplication, also has seemingly non-relevance. However, because division is too slow, this method almost cannot find the real application. Note that the hash result we see earlier is divided by a prime to ensure the range of results. If you do not need it to limit a range, you can use the following code to replace "hash % prime": Hash = hash ^ (hash> 10) ^ (hash> 20 ).

**Five-Table hash**

The most famous example of table hash is the CRC series algorithm. Although the CRC algorithms are not table-based algorithms, table-based algorithms are the fastest way to implement them. The implementation of CRC32 is as follows:

Static int cralb [256] = {

0x00000000, 0x77073096, numeric, 0x990951ba, numeric, 0x97d2d988, numeric, 0x7eb17cbd, numeric, 0x90bf1d91, 0x1db71064, 0x6ab020f2, 0xf3b97148, 0x84be41de, 0x1ad47d, 0x6ddde4eb,

Primary, 0x83d1_c7, 0x136c9856, 0x646ba8c0, 0xfd62f97a, primary, 0x14015c4f, primary, 0x42b2986c, 0xdbbbc9d6, 0xacbcf940,

0x32d86c4, 0x45df5c75, primary, primary, 0x26d930ac, 0x51de003a, primary, 0x56b3c423, 0xcfba9599, primary, 0xb6662d3d, 0x76dc4190, 0x01db7106,

0x98d220bc, 0xefd5102a, 0x71b18589, clerk, 0x91646c97, 0xe6635c01, clerk, 0xf262004e, clerk, 0x8208f4c1, 0xf50fc457,

Primary, 0x12b7e950, primary, primary, 0x62dd1ddf, 0x15da2d49, primary, 0xa3bc0074, primary, 0x108ed9fc, 0xad678846, primary, 0x000042d73, 0x33031de5,

Hour, hour, 0x5005713c, 0x270241aa, 0xbe0b1010, hour, 0xc0ba6cad, 0xedb88320, 0x9abfb3b6, 0x03b6e20c, 0x74b1d29a,

Expires, expires, 0x04db2615, 0x73dc1683, 0xe3630b12, 0x94643b84, expires, 0xe40ecf0b, expires, 0x6906c2fe, expires, 0x806567cb,

0x196c3671, primary, 0xfed41b76, primary, primary, 0x67dd4acc, primary, primary, 0x17b7be43, primary, 0x38d8c2c4, primary, 0x48b2364b, 0xd80d2133, primary, primary, 0x36034af6, 0x000047a60,

0xdf60efc3, 0xa867df55, 0x0000e8eef, 0x4669be79, numbers, 0x5505262f, numbers, 0x5bdeae1d, 0x9b64c2b0, 0xec63f226,

Expires, 0x026d930a, 0x9c0906a9, expires, 0x72076785, 0x05005713, expires, 0x7cdcefb7, 0x0bdbdf21, 0x86d3d2d4, expires, 0x68ddb3f8, expires, 0x81be16cd, 0xf6b9265b, 0x6fb077e1, 0x18b74777,

Expires, 0x11010b5c, expires, 0xf862ae69, expires, 0x3903b3c2, 0xa7672661, expires, 0x4969474d, expires, expires, 0xd9d65adc, expires, 0x37d83bf0, 0xa9bcae53, 0xdebb9ec5,

Primary, primary, 0xbdbdf21c, 0xcabac28a, 0x53b39330, primary, 0xbad03605, primary, 0x54de5729, 0x23d967bf, primary, 0xc30c8ea1, 0x5a05df1b

};

Int CRC32 (string key, int hash)

{

Int I;

For (hash = key. Length (), I = 0; I

Hash = (hash> 8) ^ crctl [(hash & 0xff) ^ K. charat (I)];

Return hash;

}

Examples of hash in the query table are: Universal hashing and Zobrist hashing. Their tables are all randomly generated.

**Hybrid hash**

The hybrid hash algorithm utilizes the preceding methods. Various common hash algorithms, such as MD5 and tiger, are in this range. They are rarely used in search-oriented hash functions.

**7. Comment on the hash algorithm**

The http://www.burtleburtle.net/bob/hash/doobs.html page provides a comment on several popular hash algorithms. Our suggestions for the hash function are as follows:

1. String hash. The simplest way is to use the basic multiplication hash. When the multiplier is 33, it has a good hash effect for English words (there are no conflicts if there are less than 6 lowercase letters ). To be more complex, you can use the FNV algorithm (and its improved form). It provides good speed and performance for long strings.

Public override unsafe int**Gethashcode**()

{// Microsoft system. String string Hash Algorithm

Fixed (char ***Str**= (Char *) This ))

{

Char ***Chptr**= STR;

Int**Num**= 0x15051505;

Int**Num2**= Num;

Int ***Numptr**= (Int *) chptr;

For (int**I**= This. length; I> 0; I-= 4)

{

Num = (Num <5) + num) + (Num> 0x1b ))

^ Numptr [0];

If (I <= 2)

{

Break;

}

Num2 = (num2 <5) + num2) + (num2> 0x1b ))

^ Numptr [1];

Numptr + = 2;

}

Return (Num + (num2*0x5d588b65 ));

}

}

2. Hash of the long array. You can use the formula http://burtleburtle.net/bob/c/lookup3.c to calculate the speed.