Hash, generally translated as "hash", and also directly transliterated as "hash", is to convert any length of input (also known as pre-image, pre-image) into a fixed-length output through a hash algorithm. The output is the hash value. This conversion is a kind of compression mapping, that is, the space of the hash value is usually much smaller than the space of the input. Different inputs may be hashed into the same output, so it is impossible to uniquely determine the input value from the hash value. Simply put, it is a function that compresses messages of any length to a fixed-length message digest.
HASH function (computer algorithm field)
Basic Concepts * If a record equal to the keyword K exists in the structure, it must be in the storage location of f (K). As a result, the searched records can be obtained directly without comparison. This correspondence relationship f is called a hash function, and a table created in advance is a hash table. * For different keywords, the same hash address may be obtained, that is, key1 ≠ key2, and f (key1) = f (key2). This phenomenon is called collision. Keywords with the same function value are called synonyms for the hash function. In summary, according to the hash function H (key) and the method of dealing with conflicts, a set of keywords is mapped to a limited continuous address set (interval), and the "image" of the keyword in the address set is used as The storage location recorded in the table, this table is called a hash table, this mapping process is called hash table or hash, the resulting storage location is called the hash address. * If for any keyword in the keyword set, the probability that the hash function maps to any address in the address set is equal, such a hash function is called a uniform hash function (Uniform Hash function), This is to make the keywords go through a hash function to get a "random address", thereby reducing collisions. Properties All hash functions have one of the basic characteristics: If two hash values are different (based on the same function), then the original input of the two hash values is also different. This characteristic is a deterministic consequence of the hash function. On the other hand, the input and output of the hash function are not one-to-one. If the two hash values are the same, the two input values are likely to be the same, but it is not absolutely certain that the two must be equal. Enter some data to calculate the hash value, and then partially change the input value. A hash function with strong obfuscation will produce a completely different hash value. Typical hash functions have infinite domains, such as byte strings of arbitrary length, and limited value domains, such as fixed-length bit strings. In some cases, the hash function can be designed to have a one-to-one correspondence between domains and ranges of the same size. The one-to-one hash function is also called permutation. Reversibility can be obtained by using a series of reversible "mixing" operations on input values.
Common HASH functions
The direct remainder method: f (x): = x mod maxM; maxM is generally a prime number not too close to 2 ^ t. · Multiplication and rounding method: f (x): = trunc ((x / maxX) * maxlongit) mod maxM, mainly used for real numbers.
· The method of squaring: f (x): = (x * x div 1000) mod 1000000); Take the middle after squaring, each bit contains more information. Construction method The hash function can make the process of accessing a data sequence faster and more efficient. Through the hash function, the data elements will be located faster. (For detailed construction method, please refer to [Hash Table Construction Method] in the hash function.)
1. Direct addressing method: Take a keyword or a linear function of a keyword as a hash address. That is H (key) = key or H (key) = a · key + b, where a and b are constants (this hash function is called its own function)
2. Numerical analysis
3. Square to square
4. Folding method
5. Random number method
6. Divide the remainder method: Take the remainder after the keyword is divided by a number p that is not larger than the length m of the hash table table as the hash address. That is H (key) = key MOD p, p <= m. Not only can the keyword be directly modulo, but also modulo after folding, squaring, etc. The choice of p is very important. Generally, the prime or m is taken. If p is not selected well, synonyms are easy to be generated. Methods of Handling Conflicts Open addressing method; Hi = (H (key) + di) MOD m, i = 1,2,…, k (k <= m-1), where H (key) is a hash function and m is a hash table Long, di is an incremental sequence, which can be taken in the following three ways:
1). di = 1,2,3, ..., m-1, called linear probing and hashing;
2). di = 1 ^ 2, (-1) ^ 2,2 ^ 2, (-2) ^ 2, (3) ^ 2, ..., ± (k) ^ 2, (k <= m / 2) is called quadratic Probe rehash;
3). di = Pseudo-random number sequence, called pseudo-random detection and hashing.
2. Re-hashing method: Hi = RHi (key), i = 1,2, ..., k RHi are different hash functions, that is, when the address conflict of the synonyms occurs, another hash function address is calculated until the conflict no longer occurs This method is not easy to produce "gathering", but it increases the calculation time.
3. Chain address method (zip method)
4. Establishing a common overflow area Search performance analysis The lookup process of the hash table is basically the same as the table creation process. Some key codes can be directly found by the address converted by the hash function, and other key codes have conflicts on the addresses obtained by the hash function, and need to be looked up according to the method of handling conflicts. Among the three methods of dealing with conflicts introduced, the post-conflict search is still a process of comparing a given value with a key. Therefore, the measure of hash table lookup efficiency is still measured by the average lookup length. During the search process, the number of comparisons of the key code depends on the number of conflicts that are generated. If there are fewer conflicts, the search efficiency is higher. If there are more conflicts, the search efficiency is lower. Therefore, the factors that affect the number of conflicts, that is, the factors that affect the search efficiency. There are three factors that affect how much conflicts occur:
1. Whether the hash function is uniform;
2. Methods for dealing with conflicts;
3. The fill factor of the hash table.
The filling factor of a hash table is defined as: α = number of elements in the table / length of the hash table α is a flag factor indicating how full the hash table is. Because the table length is a fixed value, α is proportional to the "number of elements filled in the table", so the larger the α, the more elements filled in the table, the greater the possibility of conflict; the smaller the α, Fewer elements are populated in the table and the less likely they are to conflict. In fact, the average lookup length of the hash table is a function of the filling factor α, but different methods have different functions for dealing with conflicts. Knowing the basic definition of hash, it is necessary to mention some well-known hash algorithms. MD5 and SHA-1 can be said to be the most widely used hash algorithms, and they are designed based on MD4.
Introduction to common hash algorithms:
(1) MD4 MD4 (RFC 1320) was designed by Ronald L. Rivest of MIT in 1990. MD is the abbreviation of Message Digest. It is suitable for high-speed software implementation on 32-bit word processors-it is implemented based on 32-bit operand bit operations.
(2) MD5 MD5 (RFC 1321) is an improved version of MD4 from Rivest in 1991. Its input is still grouped in 512 bits, and its output is a concatenation of 4 32-bit words, which is the same as MD4. MD5 is more complex than MD4 and is a bit slower, but it is safer and performs better in anti-analysis and anti-differential.
(3) SHA-1 and other SHA1 are designed by NIST NSA for use with DSA. It generates a 160-bit hash value for inputs less than 264 in length, so it has better brute-force resistance. .
SHA-1 was designed based on the same principles as MD4 and mimicked the algorithm. Hash function applications Due to the variety of applications of hash functions, they are often designed for an application. For example, the cryptographic hash function assumes that there is an enemy to find the original input with the same hash value. A well-designed cryptographic hash function is a "one-way" operation: for a given hash value, there is no practical way to calculate a raw input, which means that it is difficult to forge. Functions designed for the purpose of cryptographic hashing, such as MD5, are widely used to check hash functions. In this way, when the software is downloaded, the correct part of the file is downloaded after verifying the code. This code may change due to changes in environmental factors, such as changes in machine configuration or IP address. To ensure the security of the source file. Error monitoring and repair functions are mainly used to identify instances where data is disturbed by a random process. When a hash function is used for a checksum, a relatively short hash value can be used to verify that any length of data has been changed. Error correction Use a hash function to intuitively detect errors that occur during data transmission. On the sender side of the data, a hash function is applied to the data to be sent, and the result of the calculation is sent with the original data. On the receiving side of the data, the same hash function is applied to the received data again. If the results calculated by the two hash functions are inconsistent, it means that there is something wrong in the data during transmission. This is called redundancy check. For error correction, it is assumed that a distribution of likely perturbations is assumed at least approximately. Perturbations to a message string can be divided into two categories, large (impossible) errors and small (possible) errors. We redefine the second type of error as follows. Given H (x) and x + s, then as long as s is small enough, we can effectively calculate x. Such a hash function is called error correction coding. There are two important categories of these error correction codes: cyclic redundancy check and Reed-Solomon codes. Speech recognition For applications such as matching an MP3 file from a known list, one possible solution is to use a traditional hash function-such as MD5, but this solution will be time-shifted, CD read errors, different The audio compression algorithm or the implementation mechanism of volume adjustment are very sensitive. Using some methods similar to MD5 is helpful to quickly find audio files that are strictly the same (from the binary data of the audio file), but to find audio files that are all the same (from the content of the audio file) you need to use other more Advanced algorithms too. Those who do not follow the IT industry trend can often do the opposite, and hash functions that are robust enough for small differences do exist. Most of the existing hashing algorithms are not robust enough, but there are a few hashing algorithms that can achieve the robustness of discerning the music played from the speakers in a noisy room. A practical example is the Shazam [1] service. The user can dial a specific number from the telephone and place the microphone of the telephone near the speaker for playing music. The service analyzes the music being played and compares it with a known hash value stored in the database. Users will be able to receive the title of the identified music (a certain fee will be charged) Information Security The application of the Hash algorithm in information security is mainly reflected in the following 3 aspects:
(1) File check We are more familiar with the parity check and CRC check. These two kinds of check have no ability to resist data tampering. They can detect and correct channel errors in data transmission to a certain extent , But it cannot prevent malicious destruction of data. The "digital fingerprint" feature of the MD5 Hash algorithm makes it the most widely used file integrity checksum algorithm. Many Unix systems provide commands to calculate the md5 checksum. (2) The digital signature hash algorithm is also an important part of the modern cryptosystem. Due to the slow operation speed of asymmetric algorithms, one-way hash functions play an important role in digital signature protocols. Digitally signing a hash value, also known as a "digital digest," can be considered statistically equivalent to digitally signing the file itself. And such a protocol has other advantages.
(3) Authentication protocol The following authentication protocol is also called challenge-authentication mode: In the case that the transmission channel can be intercepted but cannot be tampered with, this is a simple and secure method. The above is some basic prerequisite knowledge about hash and its related.
Hash function
(1) Residual method: First estimate the number of table items in the entire hash table. Then use this estimate as a divisor to remove each original value to get the quotient and remainder. Use the remainder as the hash value. Because the possibility of conflicts in this method is quite high, any search algorithm should be able to determine whether a conflict has occurred and propose a replacement algorithm.
(2) Folding method: This method is used when the original value is a number. The original value is divided into several parts, and then the parts are superimposed to obtain the last four digits (or other digits can be used). As a hash value.
(3) Base conversion method: When the original value is a number, the base number of the original value can be converted to a different number. For example, you can convert a decimal raw value to a hexadecimal hash. In order to make the hash values the same length, the higher digits can be omitted.
(4) Data rearrangement method: This method simply sorts the data in the original value. For example, the third to sixth digits can be arranged in reverse order, and then the rearranged digits are used as the hash value. Hash functions are not universal. For example, a hash function that can get good results in a database may not be feasible in cryptography or error checking. There are several well-known hash functions in the field of cryptography. These functions include MD2, MD4, and MD5. The hash value converted by the digital signature using the hash method is called message-digest, and there is also a secure hash algorithm (SHA), which is a standard algorithm. It can generate a larger (60bit) message digest, which is similar to the MD4 algorithm. The hash value of the file is known to everyone as emule is based on P2P (abbreviation of Peer-to-peer, which refers to software for client-to-client file transfer on a peer-to-peer network). It uses the "multi-source file transfer protocol" (MFTP, the Multisource FileTransfer Protocol). In the protocol, a series of transmission, compression and Settings, which makes the file unique and traceable across the web. The MD5-Hash-file digital digest is calculated by the hash function. Regardless of the file length, its Hash function calculates a fixed-length number. Unlike the encryption algorithm, this Hash algorithm is an irreversible one-way function. When using a highly secure hash algorithm, such as MD5 and SHA, it is almost impossible to get the same hash result for two different files. Therefore, once a file is modified, it can be detected. When our file is placed in emule for sharing and publishing, emule will automatically generate the hash value of this file according to the hash algorithm. It is the unique identity of this file. It contains the basic information of this file and then submits it The connected server. When someone wants to make a download request for this file, this hash value can let others know if the file he is downloading is what he wants. This value becomes even more important after other attributes of the file have been changed (such as name). And the server also provides the user's address, port and other information, so that emule knows where to download it. Generally speaking, we have to search for a file. After getting this information, emule will send a request to the added server, asking for a file with the same hash value. The server returns information about the user holding the file. In this way, our client can directly communicate with the user who owns the file to see if the required file can be downloaded from him. The hash value of the file in emule is fixed and unique. It is equivalent to the information summary of this file. No matter whose machine the file is on, its hash value is the same, no matter how long it has passed. This value is always the same. When we download and upload files, emule uses this value to determine the file. Hash files We often see in the emule log that emule is hashing the file. Here is the function of verifying the file of the hash algorithm. Some of these functions have been mentioned earlier in the article. In fact, this part is a very complicated process. This basic principle is used in software such as ftp and bt. Emule uses file transfer in blocks. In this way, each block of transmission must be compared and verified. If it is wrong, it must be downloaded again. During this period, the relevant information is written. Enter the met file until the entire task is completed. At this time, the part file is renamed, and then the move command is used to transfer it to the incoming file. Then the met file is automatically deleted, so sometimes we encounter a hash file failure, which means that The error is that the information in met cannot be matched with the part file. In addition, sometimes it is crazy to start the machine. There are two cases. One is when you use it for the first time. At this time, you need to extract all file information. There is a case where the last time you shut down illegally, this time you need to perform troubleshooting. Research on hash algorithms has always been a frontier in information science. Especially with the popularization of network technology, his importance is becoming more and more prominent. In fact, the information exchange security verification we perform on the Internet every day, the operating system we are using The key principle has it in it, especially for those who are interested in studying information security. This is a key to open the information world. He is also a research focus in the hack world. Userhash is the same as above. When we use emule for the first time, emule will automatically generate a value. This value is also unique. It is our mark in the emule world. As long as you do n’t uninstall and do not delete config, your userhash The value is always the same. The point system works through this value. The points in emule are stored and used for identification. This value is used, and it has nothing to do with your id and your user name. How can you change these things? , Your userhash value is unchanged, which also fully guarantees fairness. In fact, he is also a summary of information, but it is not the file information that is saved, but the information of each of us. Hash table Hash table is a main application of hash function. Using hash table can quickly find data records by keywords. (Note: keywords are not as secret as they are used in encryption, but they are all used to "unlock" or access data.) For example, keywords in English dictionaries are English words that are related to them The record contains definitions of these words. In this case, the hash function must map an alphabetical string to the index created for the internal array of the hash table. The almost impossible / unrealistic ideal of a hash table hash function is to map each key to a unique index (see perfect hashing), because this guarantees direct access to every data in the table. A good hash function (including most cryptographic hash functions) has a uniformly true random output, so it only takes one or two probes (depending on the loading factor) to find the target on average. Equally important, it is almost impossible for a random hash function to have a very high collision rate. However, a small number of conflicts that can be estimated are practically inevitable (see birthday paradox). In many cases, heuristic hash functions have fewer conflicts than random hash functions. Heuristic function uses the similarity of similar keywords. For example, you can design a heuristic function such that file names such as FILE0000.CHK, FILE0001.CHK, FILE0002.CHK, and so on are mapped to consecutive pointers to the table, which means that such sequences will not conflict. In contrast, a random hash function that performs well for a set of good keywords and often performs poorly for a set of bad keywords. Such bad keywords will naturally occur and not only appear in attacks. A poorly performing hash function table means that the lookup operation degenerates into a time-consuming linear search. Decryption of MD5 and SHA1 On August 17, 2004, at the International Cryptography Conference held in Santa Barbara, California, Professor Wang Xiaoyun of Shandong University announced the research results of her and her research team for the first time at the international conference. Decoding results of four well-known cryptographic algorithms, HAVAL-128, MD4, and RIPEMD. In February of the following year, the SHA-1 password was cracked. Linux commands-The hash hash command is used to display, add, and clear hash tables. The syntax of the command is shown below. Syntax hash [-l] [-r] [-p <path> <name>] [-t <command>] Option description
Option Description
-l display hash table, including path
-r clear hash table
-p <path> <name> add content to the hash table
-t <command> Display the full path of the specified command HASH command hash Display a # after each transfer of data in the data buffer
Hash for Java Learning