The explanation of hash algorithm

Source: Internet
Author: User
Tags bitwise crc32 md5 hash rfc

Hash table, also known as the hash table, is based on a fast access point of view design, but also a typical " space-time " approach. As the name implies, the data structure can be understood as a linear table, but the elements are not tightly arranged, but there may be gaps.

a hash table, also known as a hash table, is a data structure that is accessed directly from a key value. That is, it accesses records by mapping key code values to a location in the table to speed up lookups. This mapping function is called a hash function, and the array that holds the record is called the hash table. for example, we store 70 elements, but we may have requested 100 elements of space for these 70 elements. 70/100=0.7, this number is called the load factor. The reason we do this is also for "fast access" purposes. We arrange a storage location for each element based on a result of a fixed function h that distributes as randomly as possible, thus avoiding linear searches that traverse the nature to achieve fast access. But because of this randomness, it also inevitably leads to a problem is conflict. The so-called conflict, where two elements are given the same address through the hash function h, then these two elements are called "synonyms". This is similar to 70 people going to a restaurant with 100 chairs for dinner. The hash function evaluates to a storage unit address, each storage unit is called a bucket.      To set a hash list with M buckets, the value of the hash function should be [0,m-1].

Conflict resolution is a complex issue. The conflict depends mainly on:

(1) hash function, the value of a good hash function should be distributed as evenly as possible.

(2) dealing with conflicting methods.

(3) The size of the load factor. Too big is not necessarily good, and wasted space is serious, load factor and hash function is linkage.

Resolution of the conflict:

(1) Linear probing method: After the conflict, linear forward heuristic, find the nearest empty position. The disadvantage is that there will be a stacking phenomenon. When accessed, words that may not be synonyms are also located in the probing sequence, affecting efficiency.

(2) Double hash function method: After the position d conflict, another hash function is used to produce a number C with the hash bucket capacity m coprime, then heuristic (d+n*c)%m, which makes the probing sequence jump-distribution.

A common method of constructing hash functions

The hash function makes access to a data series more efficient, and the data elements are positioned more quickly through the hash function:

  1. Direct addressing method: Take a keyword or a keyword of a linear function value is a hash address. That is, H (key) =key or H (key) = A?key + B, where A and B are constants (this hash function is called its own function)

  2. Digital analysis: analysis of a set of data, such as the date of birth of a group of employees, we found that the number of days before the birth of the first few numbers are roughly the same, so that the probability of conflict will be very large, but we found that the number of days after the month and the date of the numbers vary greatly, If you use the following numbers to form a hash address, the odds of the conflict will be significantly reduced. Therefore, the digital analysis method is to find out the laws of numbers, as far as possible to use this data to construct a low probability of conflict hash address.

  3. The square takes the middle method: takes the key word square after several as the hash address.

  4. Folding method: The keyword is divided into several parts of the same number of bits, the last part of the number can be different, and then take these parts of the overlay and (remove carry) as the hash address.

  5. Random number method: Select a random function, take the random value of the keyword as a hash address, usually used for different keyword lengths.

  6. In addition to the remainder method: Take the keyword is not greater than the hash table length m of the number of p after the remainder is a hash address. That is, H (key) = key MOD p, p<=m. Not only can the keyword directly modulo, but also in the collapse, the square to take the medium operation after the modulo. The choice of P is very important, generally take prime or m, if p is not good, easy to produce synonyms. performance analysis for lookups

The lookup process for a hash table is basically the same as the watchmaking process. Some key codes can be found directly through the address of the hash function transformation, and some key codes have conflicts on the address of the hash function and need to be searched by the method of dealing with conflicts. In the three methods described for dealing with conflicts, post-conflict lookups are still the process of comparing a given value to a key code. Therefore, the measurement of the efficiency of the hash table is still measured by the average lookup length.

In the process of searching, the number of key code comparisons depends on how many conflicts are generated, the conflict is less, the search efficiency is high, the conflict is more, and the search efficiency is low. Therefore, the factors that affect the number of conflicts, that is, the factors that affect the search efficiency. There are three factors that affect the number of conflicts:

1. The hash function is uniform;

2. Methods of dealing with conflicts;

3. Reload factor for the hash table.

The reload factor for the hash list is defined as: α= the number of elements in the table/length of the hash list

α is the marker factor for the full extent of the hash table. Since the length of the table is fixed, α is proportional to the number of elements in the table, so the larger the alpha, the more elements are filled in the table, the more likely the conflict will be, and the smaller the alpha, the less likely it will be to have a conflict.

In fact, the average lookup length of a hash table is a function of filling factor α, but different methods of dealing with conflicts have different functions.

Understand the basic definition of hash, you can not mention some well-known hash algorithm, MD5 and SHA-1 is the most widely used hash algorithm, and they are based on MD4 design. So what do they mean?

Here's a quick look:

  (1) MD4

MD4 (RFC 1320) was designed by MIT's Ronald L. Rivest in 1990, MD is the abbreviation for Message Digest. It is implemented with high-speed software on a 32-bit word processor-it is based on a bitwise operation of 32-bit operands.

  (2) MD5

MD5 (RFC 1321) is an improved version of Rivest in 1991 for MD4. It still groups the input in 512 bits, and its output is a cascade of 4 32-bit words, the same as MD4. MD5 is more complex than MD4 and slower, but safer to perform better in terms of resistance to analysis and differential resistance

  (3) SHA-1 and others

The SHA1 is designed by the NIST NSA to be used with the DSA, which produces a hash value of 160bit in length for inputs of less than 264, thus providing better anti-brute-force. The SHA-1 design is based on the same principles as MD4 and mimics the algorithm.

Hash table unavoidable conflict (collision) phenomenon: The same hash address may be obtained for different keywords , namely Key1≠key2, and hash (key1) =hash (Key2). Therefore, when building a hash table, you should not only set a good hash function, but also set a method for dealing with conflicts. A hash table can be described as follows: A set of keywords is mapped to a finite , contiguous address set (interval) according to the Set hash function H (key) and the selected method of handling conflicts. On and with the keyword in the address set as the "image" as the corresponding record in the table storage location , such a table is called a hash table.

For dynamic lookup tables, 1) The table length is indeterminate; 2) When you design a lookup table, you know only the scope of the keyword, and you don't know the exact keyword. Therefore, the general situation needs to establish a function relationship, with F (key) as the key to the location of the record in the table, usually called this function f (key) is a hash function. (Note: This function is not necessarily a mathematical function)

A hash function is an image that maps a collection of keywords to an address collection, and its settings are flexible, as long as the size of the address collection does not exceed the allowable range.

In reality, the hash function needs to be constructed, and the construction is good to use.

So what's the use of these hash algorithms?

The application of hash algorithm in information security is mainly embodied in the following 3 aspects:

  (1) file check

We are more familiar with the parity check algorithm and CRC check, these 2 kinds of verification does not have the ability to resist data tampering, they can detect and correct the channel error in the data transmission, but can not prevent malicious damage to the data.

MD5 Hash Algorithm's "digital fingerprint" feature makes it the most widely used file integrity checksum (Checksum) algorithm, and many UNIX systems have the command to provide calculation MD5 Checksum.

  (2) Digital signature

Hash algorithm is also an important part of modern cipher system. Because of the slow operation of the asymmetric algorithm, the one-way hash function plays an important role in the digital signature protocol. A digital signature of a hash value, also known as a "digital digest", can be statistically considered equivalent to a digital signature on the file itself. And there are other advantages to such an agreement.

  (3) Authentication agreement

The following authentication protocol is also known as the challenge-the certification mode: This is a simple and secure way to be able to listen to a transmission channel but not tamper with it.

File hash value

The Digital Digest of the md5-hash-file is computed by the Hash function. Regardless of the length of the file, its hash function evaluates to a fixed-length number. Unlike cryptographic algorithms, this hash algorithm is an irreversible one-way function. With a high-security hash algorithm, such as MD5, Sha, two different files are almost impossible to get the same hash result. Therefore, once the file has been modified, it can be detected.

The hash function also has a different meaning. The actual hash function refers to the mapping of a large range to a small range. The goal of mapping a wide range to a small area is often to save space and make data easy to save. In addition, hash functions are often applied to lookups. Therefore, before considering using the hash function, you need to understand several of its limitations:
1. The main principle of hash is to map a large range to a small range, so the number of actual values you enter must be equal to or smaller than the small range. Otherwise there will be a lot of conflict.

2. Because the hash approximates one-way function, you can use it to encrypt the data.

3. Different applications have different requirements for the hash function; For example, the hash function for encryption mainly considers the gap between it and the single function, while the hash function used for searching mainly considers its mapping to a small range of conflict rates. The hash function applied to encryption has been discussed too much, in the author's blog has a more detailed introduction. Therefore, this article only discusses the hash function used for lookups. The main object that the hash function applies to is an array (for example, a string), and its target is generally an int type.

Here's what we're going to show you in this way. Generally speaking, the hash function can be easily divided into the following categories:

1. Addition hash; 2. Bit operation hash; 3. multiplication hash; 4. Division hash; 5. Tabular hash; 6. Mixed hash;

The following detailed introduction of the above various methods in the practical application.

An addition hash

The so-called additive hash is the addition of the input elements of a single sum to form the final result. The standard addition hash is constructed as follows:

static int Additivehash (String key, int prime) {int hash, I; for (hash = Key.length (), i = 0; i < key.length (); i++) 
   hash + = Key.charat (i); Return (hash% prime); }

  

The prime in this case is any prime number, and it can be seen that the value of the result is [0,prime-1].

two-bit arithmetic hash

This type of hash function fully mixes the input elements by taking advantage of the various bitwise operations (common is shift and XOR). For example, the standard spin hash is constructed as follows:

static int Rotatinghash (String key, int prime) {int hash, I; for (Hash=key.length (), i=0; I    hash = (hash<<4> >28) ^key.charat (i); Return (hash% prime); }

  

The main feature of this type of hash function is to shift first and then perform various bit operations. For example, the above-calculated hash of the code can also have the following kinds of variants:

  

three-multiplication hash

This type of hash function takes advantage of the non-correlation of multiplication (this property of multiplication, the most famous is the random number generation algorithm of the square Kinsoku, although this algorithm is not good). Like what

static int Bernstein (String key) {int hash = 0; int i; for (i=0; I return hash;}

  

The Hashcode () method of the string class inside the jdk5.0 also uses the multiplication hash. However, it uses a multiplier of 31. The recommended multipliers are: 131, 1313, 13131, 131313, and so on. The well-known hash functions in this way are:

32-bit FNV algorithm int m_shift = 0;   public int Fnvhash (byte[] data)   {       int hash = (int) 2166136261L;       for (byte b:data)           hash = (Hash * 16777619) ^ b;       if (M_shift = = 0)           return hash;       Return (hash ^ (hash >> m_shift)) & M_mask; }

  

And the improved FNV algorithm:

public static int FNVHash1 (String data) {       final int p = 16777619;       int hash = (int) 2166136261L;       for (int i=0;i           hash = (hash ^ data.charat (i)) * p;       Hash + = Hash <<;       Hash ^= Hash >> 7;       Hash + = Hash << 3;       Hash ^= hash >>;       Hash + = hash << 5;       return hash; }

  

In addition to multiplying by a fixed number, the common ones are multiplied by a constantly changing number, such as:

static int Rshash (String str) {       int b    = 378551;       int a    = 63689;       int hash = 0; for (int i = 0; i < str.length (); i++) {         hash = hash * A + str.charat (i);         A    = a * b;      }      Return (hash & 0x7FFFFFFF); }

  

Although the application of the ADLER32 algorithm is not CRC32 widely, it is probably the most famous one in the multiplication hash. For its introduction, you can look at the RFC 1950 specification.

Four Division Hash

Division, like multiplication, also has a seemingly non-correlation on the surface. However, because the division is too slow, this approach can hardly find a real application. It is important to note that the result of the hash we see in the previous divided by a prime is only intended to guarantee the scope of the result. If you don't need it to limit a range, you can replace "Hash%prime" with the following code: hash = hash ^ (hash>>10) ^ (hash>>20).

Five look-up table hash

The most famous example of tabular hash is the CRC series algorithm. Although the CRC series algorithm itself is not a check table, but the table is one of its fastest way to implement. Here is the implementation of CRC32:

static int crctab[256] = {0x00000000, 0x77073096, 0xee0e612c, 0x990951ba, 0x076dc419, 0x706af48f,  0xe963a535, 0x9e6 495A3, 0x0edb8832, 0x79dcb8a4, 0xe0d5e91e, 0x97d2d988,  0x09b64c2b, 0X7EB17CBD, 0xe7b82d07, 0x90bf1d91, 0x1db71064 , 0x6ab020f2,  0xf3b97148, 0x84be41de, 0x1adad47d, 0x6ddde4eb, 0xf4d4b551, 0x83d385c7, 0x136c9856, 0X646BA8C0, 0xfd62f97a, 0x8a65c9ec, 0x14015c4f, 0X63066CD9, 0xfa0f3d63, 0x8d080df5, 0x3b6e20c8, 0x4c69105e, 0xd56041e4, 0xa2677172, 0x3c03e4d1, 0x4b04d447, 0XD20D85FD, 0xa50ab56b, 0x35b5a8fa, 0x42b2986c, 0xdbbbc9d6, 0xacbcf940, 0x32d86ce3, 0x45df5c75, 0XDCD60DCF, 0xabd13d59, 0x26d930ac, 0x51de003a, 0xc8d75180, 0xbfd06116, 0x21b4f4b5, 0x56b3c423, 0xcfba9599, 0xb8bda50f, 0x2802b89e, 0x5f058808, 0xc60cd9b2, 0xb10be924, 0x2f6f7c87, 0x58684c11, 0xc1611dab, 0xb6662d3d, 0x76dc4190, 0x01db7106, 0X98D220BC, 0xefd5102a, 0x71b18589, 0x06b6b51f, 0x9fbfe4a5, 0xe8b8d433, 0X7807C9A2, 0x0f00f934, 0x9609a88e, 0xe10e9818, 0X7F6A0DBB, 0x086d3d2d, 0X91646C97, 0xe6635C01, 0x6b6b51f4, 0x1c6c6162, 0x856530d8, 0xf262004e, 0x6c0695ed, 0x1b01a57b, 0x8208f4c1, 0xf50fc457, 0X65B0D9C6, 0x12b7e950, 0x8bbeb8ea, 0xfcb9887c, 0x62dd1ddf, 0x15da2d49, 0X8CD37CF3, 0xfbd44c65, 0x4db26158, 0x3ab551ce, 0xa3bc0074, 0xd4bb30e2, 0x4adfa541, 0x3dd895d7, 0xa4d1c46d, 0XD3D6F4FB, 0x4369e96a, 0X346ED9FC, 0xad678846, 0xda60b8d0, 0x44042d73, 0X33031DE5, 0xaa0a4c5f, 0XDD0D7CC9, 0x5005713c, 0X270241AA, 0xbe0b1010, 0xc90c2086, 0x5768b525, 0x206f85b3, 0xb966d409, 0xce61e49f, 0x5edef90e, 0x29d9c998, 0xb0d09822, 0xc7d7a8b4, 0x59b33d17, 0x2eb40d81, 0xb7bd5c3b, 0xc0ba6cad, 0xedb88320, 0x9abfb3b6, 0x03b6e20c, 0x74b1d29a, 0xead54739, 0x9dd277af, 0x04db2615, 0x73dc1683, 0xe3630b12, 0x94643b84, 0x0d6d6a3e, 0x7a6a5aa8, 0xe40ecf0b, 0x9309ff9d, 0x0a00ae27, 0x7d079eb1, 0xf00f9344, 0x8708a3d2, 0x1e01f268, 0x6906c2fe, 0xf762575d, 0X806567CB, 0x196c3671, 0x6e6b06e7, 0xfed41b76, 0X89D32BE0, 0x10da7a5a, 0x67dd4acc,  0xf9b9df6f, 0X8EBEEFF9, 0x17b7be43, 0x60b08ed5, 0xd6d6a3e8, 0xa1d1937e,  0X38D8C2C4, 0x4fdff252, 0xd1bb67f1, 0xa6bc5767, 0X3FB506DD, 0x48b2364b,  0xd80d2bda, 0xaf0a1b4c, 0x36034af6, 0x41047a60, 0XDF60EFC3, 0xa867df55, 0x316e8eef, 0x4669be79, 0xcb61b38c, 0xbc66831a, 0x256fd2a0, 0x5268e236, 0xcc0c7795, 0xbb0b4703, 0X220216B9, 0x5505262f, 0xc5ba3bbe, 0xb2bd0b28, 0x2bb45a92, 0x5cb36a04, 0xc2d7ffa7, 0xb5d0cf31, 0x2cd99e8b, 0X5BDEAE1D, 0x9b64c2b0, 0xec63f226, 0x756aa39c, 0x026d930a, 0x9c0906a9, 0xeb0e363f, 0x72076785, 0x05005713, 0x95bf4a82, 0XE2B87A14, 0x7bb12bae, 0x0cb61b38, 0x92d28e9b, 0xe5d5be0d, 0x7cdcefb7, 0x0bdbdf21, 0x86d3d2d4, 0xf1d4e242, 0x68ddb3f8, 0x1fda836e, 0X81BE16CD, 0xf6b9265b, 0x6fb077e1, 0x18b74777, 0x88085ae6, 0xff0f6a70, 0X66063BCA, 0x11010b5c, 0x8f659eff, 0xf862ae69, 0x616bffd3, 0x166ccf45, 0xa00ae278, 0XD70DD2EE, 0x4e048354, 0X3903B3C2, 0xa7672661, 0xd06016f7, 0X4969474D, 0x3e6e77db, 0xaed16a4a, 0XD9D65ADC, 0x40df0b66, 0x37d83bf0, 0xa9bcae53, 0xdebb9ec5, 0x47b2cf7f, 0x30b5ffe9, 0xbdbdf21c, 0xcabac28a, 0x53b39330, 0x24b4a3a6, 0xbad03605, 0xcdd70693, 0x54de5729, 0X23D967BF, 0xb3667a2e, 0xc4614ab8, 0X5D681B02, 0x2a6f2b94, 0xb40bbe37, 0XC30C8EA1, 0X5A05DF1B, 0x2d02ef8d}; int Crc32 (String key, int hash) {int i; for (Hash=key.length (), i=0; I   hash = (hash >> 8) ^ crctab[(hash &am P 0xFF) ^ K.charat (i)]; return hash; }

Notable examples of tabular hashes are: Universal Hashing and Zobrist Hashing. Their tables are randomly generated.
Six mixed hash

The hybrid hash algorithm takes advantage of these various methods. Various common hash algorithms, such as MD5 and Tiger, fall into this range. They are rarely used in search-oriented hash functions

The explanation of hash algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.