ArticleDirectory
1. Hash Introduction
Hash is a frequently used data structure to implement certain functions. In Java and C ++, there is a encapsulated Data Structure: C ++ STL map Java has hashmap
Treemap.
In computing theory, there is no hash function, but one-way function. The so-called one-way function is a complicated definition. You can look at the computing theory or cryptography data. One-way function is described in the language "Human class": If a function is given input, it is easy to calculate the result. When a result is given, it is difficult to calculate the input. This is a single function. Various encryption letters
All numbers can be considered as the approximation of unidirectional functions. A hash function (or a hash function) can also be considered as an approximation of a unidirectional function. That is, it is close to the one-way function definition.
The hash function has another meaning. In practice, the hash function maps a large range to a small range. The purpose of ing a large scope to a small scope is often to save space and make data easy to save. In addition, hash functions are often used for search. Therefore, before using the hash function, you need to understand its limitations:
1. The main principle of hash is to map a large range to a small range. Therefore, the number of actual values you enter must be equal to or smaller than a small range. Otherwise, there will be many conflicts.
2. As hash approaches unidirectional functions, you can use it to encrypt data.
3. different applications have different requirements on the hash function. For example, the hash function used for encryption mainly considers the gap between it and a single function, the hash function used for searching mainly considers the conflict rate mapped to a small range.
There have been too many discussions about hash functions used in encryption. I will give you a more detailed introduction in the author's blog. Therefore, this article only discusses the hash functions used for searching.
The main object used by the hash function is an array (such as a string), and its target is generally an int type. We will describe this method as follows.
2. Common hash Algorithm
Sometimes the hash function is a compressed image, so conflicts are inevitable. Therefore, when creating the hash function, you must not only set a good hash function, but also set a method to handle conflicts, create a table by hash and hash the list.
- Direct addressing
: The size of the address set and the keyword set is the same.
- Digital Analysis
: Hash as needed
Select the appropriate hash algorithm for the characteristics of keywords, and try to find
Differences
- Square method: the center after the square of the keyword is used as the hash address. The number of digits in the middle after the square of a number is related to each digit of the number. The number of digits obtained is determined by the table length. For example, if the table length is 512, = 2 ^ 9, you can take the 9-bit binary number after the square as the hash address.
- Folding Method: when there are many digits in the keyword, And the number distribution on each digit in the keyword is roughly even, you can use the folding method to obtain the hash address,
- Except for P, you can select P as the prime number, or do not include a combination of quality factors smaller than 20.
- Random Number Method: it is more appropriate to use this method to construct a hash function when the keyword is not equal.
In actual work, different hash functions need to be used according to different situations:
- Considerations: time required for calculating hash functions, hardware commands, and other factors.
- Keyword Length
- Hash table size
- Keyword Distribution
- Record the search frequency. (Huffeman tree)
How to handle conflicts:
- Open address method: the current detection is further hashed
As long as the hash table is filled up, a non-conflicting address can always be found.
Only when the table length is a prime number can we always find a non-conflicting address. Random detection and hash depend on the pseudo-random series.
- Hash method: Clustering is not easy, but the computing time is increased.
- In the Chord protocol, consistent hash is applied.
Calculate the average length of successful searches and the average length of unsuccessful searches for different hash functions and conflict handling methods. These results mainly depend on the fill factor and fill factor, which can limit the average search length to a certain range.
Generally, hash functions can be divided into the following categories:
1. Add hash;
2. bitwise operation hash;
3. Multiplication hash;
4. Division hash;
5. query table hash;
6. Hybrid hash;
The following describes in detail the use of the above methods in practice.
1. Add hash
The so-called addition hash is to add the input elements one by one to form the final result. The structure of the standard addition hash is as follows:
- Static int
Additivehash (string key, int prime)
- {
-
Int hash, I;
-
For (hash = key. Length (), I = 0; I <key. Length (); I ++)
Hash + = key. charat (I );
-
Return (hash % prime );
- }
CopyCode
Here, prime is any prime number. We can see that the value of the result is [0, prime-1].
Binary operation hash
This type of hash function uses a variety of bitwise operations (usually shift and XOR) to fully mix input elements. For example, the structure of the standard rotating hash is as follows:
- Static int
Rotatinghash (string key, int prime)
- {
-
Int hash, I;
-
For (hash = key. Length (), I = 0; I
-
Hash = (hash <4> 28) ^ key. charat (I );
-
Return (hash % prime );
- }
Copy code
First shift, and then perform a variety of bitwise operations is the main feature of this type of hash function. For example, the hash calculation code above can also have the following variants:
- Hash =
(Hash <5> 27) ^ key. charat (I );
- Hash + =
Key. charat (I );
- Hash + = (hash
<10 );
- Hash ^ = (hash
> 6 );
- If (I & 1)
= 0)
- {
- Hash ^ =
(Hash <7> 3 );
- }
- Else
- {
-
Hash ^ = ~ (Hash <11> 5 ));
- }
- Hash + =
(Hash <5>
- Hash =
Key. charat (I) + (hash <6> 16 )? Hash;
- Hash ^ =
(Hash <5> 2 ));
Copy code
Three-way hash
This type of hash function uses the non-relevance of multiplication (this property of multiplication is most famous for its random number generation algorithm, although this algorithm is not effective ). For example,
- Static int
Bernstein (string key)
- {
-
Int hash = 0;
-
Int I;
-
For (I = 0; I
-
Return hash;
- }
Copy code
The hashcode () method of the string class in jdk5.0 also uses multiplication hash. However, it uses a multiplier of 31. The recommended multiplier is 131,131 3, 13131,131 313, and so on.
The famous hash functions used in this method include:
- // 32-bit FNV Algorithm
- Int m_shift =
0;
Public int fnvhash (byte [] data)
{
-
Int hash = (INT) 2166136261l;
-
For (byte B: Data)
-
Hash = (hash * 16777619) ^ B;
-
If (m_shift = 0)
-
Return hash;
-
Return (hash ^ (hash> m_shift ))
& M_mask;
- }
Copy code
And the improved FNV algorithm:
- Public static
Int fnvhash1 (string data)
- {
-
Final int P = 16777619;
-
Int hash = (INT) 2166136261l;
-
For (INT I = 0; I
-
Hash = (hash ^
Data. charat (I) * P;
-
Hash + = hash <13;
-
Hash ^ = hash> 7;
-
Hash + = hash <3;
-
Hash ^ = hash> 17;
-
Hash + = hash <5;
-
Return hash;
- }
Copy code
In addition to multiplying a fixed number, it is common to multiply it by a constantly changing number, for example:
- Static int
Rshash (string Str)
- {
-
Int B = 378551;
-
Int A = 63689;
-
Int hash = 0;
-
-
For (INT I = 0; I <Str. Length (); I ++)
-
{
-
Hash = hash * A + Str. charat (I );
-
A = A * B;
-
}
-
Return (hash & 0x7fffffff );
- }
Copy code
Although the adler32 algorithm is not widely used in CRC32, it may be the most famous one in multiplication hash. For more information, see the RFC 1950 standard.
Division hash
Division, like multiplication, also has seemingly non-relevance. However, because division is too slow, this method almost cannot find the real application. Note that the hash result we see earlier is divided by a prime to ensure the range of results. If you do not need it to limit a range, you can use the following code to replace "hash % prime": Hash = hash ^ (hash> 10)
^ (Hash> 20 ).
Five-Table hash
The most famous example of table hash is the CRC series algorithm. Although the CRC algorithms are not table-based algorithms, table-based algorithms are the fastest way to implement them. The implementation of CRC32 is as follows:
-
- Static int
Crctl [256] = {
- 0x00000000, 0x77073096,
0xee0e612c, 0x990951ba, 0x076dc419, 0x706af48f, 0xe963a535, 0x9e6495a3,
0x0edb8832, 0x79dcb8a4, 0xe0d5e91e, 0x97d2d988, 0x09b64c2b, 0x7eb17cbd,
0xe7b82d07, 0x90bf1d91, 0x1db71064, 0x6ab020f2, 0xf3b97148, 0x84be41de,
0x1adad47d, 0x6ddde4eb, 0xf4d4b551, 0x83d1_c7, 0x136c9856, 0x646ba8c0,
0xfd62f97a, 0x8a65c9ec, 0x14015c4f, 0x63066cd9, 0xfa0f3d63, 0x8d080df5,
0x3b6e20c8, 0x4c69105e, 0xd56041e4, 0xa2677172, 0x3c03e4d1, 0x4b04d447,
0xd20d85fd, 0xa50ab56b, 0x35b5a8fa, 0x42b2986c, 0xdbbbc9d6, 0xacbcf940,
0x32d86fe3, 0x45df5c75, 0xdcd60dcf, 0xabd13d59, 0x26d930ac, 0x51de003a,
0xc8d75180, 0xbfd06116, 0x21b4f4b5, 0x56b3c423, 0xcfba9599, 0xb8bda50f,
0x2802b89e, 0x5f058808, 0xc60cd9b2, 0xb10be924, 0x2f6f7c87, 0x58684c11,
0xc1611dab, 0xb6662d3d, 0x76dc4190, 0x01db7106, 0x98d220bc, 0xefd5102a,
0x71b18589, 0x06b6b51f, 0x9fbfe4a5, 0xe8b8d433, 0x7807c9a2, 0x0f00f934,
0x9609a88e, 0xe10e9818, 0x7f6a0dbb, 0x086d3d2d, 0x91646c97, 0xe6635c01,
0x6b6b51f4, 0x1c6c6162, 0x856530d8, 0xf262004e, 0x6c0695ed, 0x1b01a57b, 0x8208f4c1,
0xf50fc457, 0x65b0d9c6, 0x12b7e950, 0x8bbeb8ea, 0xfcb9887c, 0x62dd1ddf,
0x15da2d49, 0x8cd37cf3, 0xfbd44c65, 0x4db26158, 0x3ab551ce, 0xa3bc0074,
0xd4bb30e2, 0x4adfa541, 0x3dd895d7, 0xa4d1c46d, 0xd3d6f4fb, 0x4425e96a,
0x0000ed9fc, 0xad678846, 0xda60b8d0, 0x000042d73, 0x33031de5, 0xaa0a4c5f,
0xdd0d7cc9, 0x5005713c, 0x270241aa, 0xbe0b1010, 0xc90c2086, 0x5768b525,
0x206f85b3, 0xb966d409, 0xce61e49f, 0x5edef90e, 0x29d9c998, 0xb0d09822,
0xc7d7a8b4, 0x59b33d17, 0x2eb40d81, 0xb7bd5c3b, 0xc0ba6cad, 0xedb88320, 0x9abfb3b6,
0x03b6e20c, 0x74b1d29a, 0xead54739, 0x9dd277af, 0x04db2615, 0x73dc1683,
0xe3630b12, 0x94643b84, 0x0d6d6a3e, 0x7a6a5aa8, 0xe40ecf0b, 0x9309ff9d,
0x0a00ae27, 0x7d079eb1, 0xf00f9344, 0x8708a3d2, 0x1e01f268, 0x6906c2fe,
0xf762575d, 0x806567cb,
- 0x196c3671,
0x6e6b06e7, 0xfed41b76, 0x89d32be0, 0x10da7a5a, 0x67dd4acc, 0xf9b9df6f,
0x8ebeeff9, 0x17b7be43, 0x60b08ed5, 0xd6d6a3e8, 0xa1d1937e, 0x38d8c2c4,
0x4fdff252, 0xd1bb67f1, 0xa6bc5767, 0x3fb506dd, 0x48b2364b, 0xd80d2133,
0xaf0a1b4c, 0x36034af6, 0x000047a60, 0xdf60efc3, 0xa867df55, 0x0000e8eef,
0x4669be79, 0xcb61b38c, 0xbc66831a, 0x256fd2a0, 0x5268e236, 0xcc0c7795,
0xbb0b4703, 0x220216b9, 0x5505262f, 0xc5ba3bbe, 0xb2bd0b28, 0x2bb45a92,
0x5cb36a04, 0xc2d7ffa7, 0xb5d0cf31, 0x2cd99e8b, 0x5bdeae1d, 0x9b64c2b0,
0xec63f226, 0x756aa39c, 0x026d930a, 0x9c0906a9, 0xeb0e363f, 0x72076785,
0x05005713, 0x95bf4a82, 0xe2b87a14, 0x7bb12bae, 0x0cb61b38, 0x92d28e9b,
0xe5d5be0d, 0x7cdcefb7, 0x0bdbdf21, 0x86d3d2d4, 0xf1d4e242, 0x68ddb3f8,
0x1fda836e, 0x81be16cd, 0xf6b9265b, 0x6fb077e1, 0x18b74777, 0x88085ae6,
0xff0f6a70, 0x66063bca, 0x11010b5c, 0x8f659eff, 0xf862ae69, 0x616bffd3,
0x166ccf45, 0xa00ae278, 0xd70dd2ee, 0x4e048354, 0x3903b3c2, 0xa7672661,
0xd06016f7, 0x4969474d, 0x3e6e77db, 0xaed16a4a, 0xd9d65adc, 0x40df0b66,
0x37d83bf0, 0xa9bcae53, 0xdebb9ec5, 0x47b2cf7f, 0x30b5ffe9, 0xbdbdf21c,
0xcabac28a, 0x53b39330, 0x24b4a3a6, 0xbad03605, 0xcdd70693, 0x54de5729,
0x23d967bf, 0xb3667a2e, 0xc4614ab8, 0x5d681b02, 0x2a6f2b94, 0xb40bbe37,
0xc30c8ea1, 0x5a05df1b, 0x2d02ef8d
-
- };
-
- Int CRC32 (string
Key, int hash)
-
- {
-
- Int I;
-
- For
(Hash = key. Length (), I = 0; I
-
Hash = (hash> 8) ^ crctl [(hash & 0xff) ^ K. charat (I)];
-
- Return hash;
-
- }
Copy code
Examples of hash in the query table are: Universal hashing and Zobrist hashing. Their tables are all randomly generated.
Hybrid hash
The hybrid hash algorithm utilizes the preceding methods. Various common hash algorithms, such as MD5 and tiger, are in this range. They are rarely used in search-oriented hash functions.
7. Comment on the hash algorithm
The http://www.burtleburtle.net/bob/hash/doobs.html page provides a comment on several popular hash algorithms. Our suggestions for the hash function are as follows:
1. String hash. The simplest way is to use the basic multiplication hash. When the multiplier is 33, it has a good hash effect for English words (there are no conflicts if there are less than 6 lowercase letters ). To be more complex, you can use the FNV algorithm (and its improved form). It provides good speed and performance for long strings.
2. Hash of the long array. You can use the formula http://burtleburtle.net/bob/c/lookup3.c to calculate the speed.
3. hash3.1 C ++ provided in the standard library
Standard STL sequence container
Vector, String, deque, and list.
Ø standard STL associated containers
Set, Multiset, map, and multimap.
Ø non-standard sequence container
Slist and rope.
Ø non-standard associated containers
Hash_set, hash_multiset, hash_map, and hash_multimap.
Some standard non-STL containers
Array, bitset, valarray, stack, queue, and priority_queue.
For more information about how to select container types, see: http://blog.csdn.net/alais/article/details/1180942
The underlying layer of map (multimap) is based on the red/black tree and can be accessed through subscript
The underlying layer of set (Multiset) is based on a balanced binary tree. Indirect element access must be performed through the iterator.
Java 3.2
Java 2's collections framework mainly includes two interfaces and their extensions and implementation classes: Collection interface and map interface. The difference between the two lies in that the former stores a group of objects, while the latter stores some key/value pairs.
Some complicated issues need to be implemented in the container, and the comparable interface must be implemented.
In this way, the collections sort function can be called. The implementation class of sortedmap is treemap.
If you use collections in multiple threads, you can consider calling classes in the Java. Concurrent. util package.
Advanced utility in Java. util. ConcurrentProgramClass-thread security set, thread pool, signal and synchronization Tool