Hash table Detailed

Source: Internet
Author: User
Tags md5 hash rfc sha1 file transfer protocol

a hash table, also known as a hash table, is a data structure that is accessed directly from a key value. That is, it accesses records by mapping key code values to a location in the table to speed up lookups. This mapping function is called a hash function, and the array that holds the record is called the hash table. given table M, there is a function f (key), for any given keyword value key, after substituting the function if you can get the address of the record containing the keyword in the table, the table m is a hash (hash) table, function f (key) is a hash (hash) function.
Chinese name
Hash Table
Foreign names
Hash Table
Alias
Hash Table
Role
data structures that are accessed directly
Catalogue
    1. 1   basic concept
    2. < Span class= "index" style= "Display:inline-block; width:18px; padding-left:20px; padding-right:8px; Vertical-align:top; Text-align:right; Color:rgb (99,160,223) ">2   common methods
    1. 3   handling conflicts
    2. < Span class= "index" style= "Display:inline-block; width:18px; padding-left:20px; padding-right:8px; Vertical-align:top; Text-align:right; Color:rgb (99,160,223) ">4   find performance
    1. 5   practical application
    2. < Span class= "index" style= "Display:inline-block; width:18px; padding-left:20px; padding-right:8px; Vertical-align:top; Text-align:right; Color:rgb (99,160,223) ">6   string
Basic Concepts
  • If the keyword is k, its value is stored in the storage location of f (k) . As a result, the records can be obtained directly without comparison. The corresponding relationship F is called a hash function, and the table created by this idea is a hash.
  • The same hash address may be obtained for different keywords, that is, K1≠k2, and f (K1) =f (K2), a phenomenon called collisions (English: collision). A keyword with the same function value is called a synonym for the hash function. In summary, a set of keywords is mapped to a finite contiguous set of addresses (intervals) based on the hash function f (k) and the method of handling collisions, and the "image" of the keyword in the address set as the storage location for the record in the table, which is known as a hash table, a mapping process called hash watchmaking or hash , the resulting storage location is called a hash address.
  • In the case of any of the keywords in the keyword set, the probability of the hash function mapping to any address in the address set is equal, so that the hash function is uniformly hashed (Uniform hash function), which means that the keyword is hashed to get a "random address", thus reducing collisions. [1]
Common MethodsThe hash function makes access to a data series more efficient, and the data elements are positioned more quickly through the hash function. different hashing functions are used in actual work depending on the situation, and the factors commonly considered are:• Time required to calculate the hash function• The length of the keyword• Size of the hash table• Distribution of keywords• How often a record is found 1. Direct Addressing method: Take a keyword or a keyword of a linear function value is a hash address. That is, H (key) =key or H (key) = a key + B, where A and B are constants (this hash function is called its own function). If there is already a value in H (key), go to the next one, until there is no value in H (key), put it in. 2. Digital analysis: analysis of a set of data, such as the date of birth of a group of employees, we found that the number of days before the birth of the first few numbers are roughly the same, so that the probability of conflict will be very large, but we found that the number of days after the month and the date of the numbers vary greatly, If you use the following numbers to form a hash address, the odds of the conflict will be significantly reduced. Therefore, the digital analysis method is to find out the laws of numbers, as far as possible to use this data to construct a low probability of conflict hash address. 3. Square Take the middle method: When it is not possible to determine which of the keywords in the distribution is more evenly, you can first find the square value of the keyword, and then as needed to take the middle of the square value as a hash address. This is because: after the square after the middle and each of the keywords are related, so the different keywords will be higher probability to produce a different hash address. [2] Example: We put the English alphabet in the alphabet in the position of the ordinal as the internal code of the English letter. For example, the internal encoding of the K for 11,e internal code for 05,Y is internally encoded as 25,a, and the internal encoding of B is 02. The internal code that makes up the keyword "Keya" is 11052501, so we can get the internal encoding of the keyword "Kyab", "Akey", "Bkey". After the keyword is squared, remove the 7th to 9th bits as the hash address of the keyword, as shown in
Key words Internal code The square value of the internal encoding Hash address of the H (k) keyword
keya 11050201 122157778355001 778
kyab 11250102 126564795010404 795
akey 01110525 001233265775625 265
Bkey 02110525 004454315775625 315
[2] 4. Folding method: The keyword is divided into several parts of the same number of bits, the last part of the number can be different, and then take these parts of the overlay and (remove carry) as the hash address. Digital superposition can have two methods: shift superposition and boundary superposition. The shift overlay aligns the lowest bits of each part of the split, then adds, and the bounding overlay is folded back and forth from one end to the other, and then the addition is aligned. 5. Random number method: Select a random function, take the random value of the keyword as a hash address, usually used for different keyword lengths. 6. In addition to the remainder method: Take the keyword is not greater than the hash table length m of the number of p after the remainder is a hash address. That is, H (key) = key MOD p,p<=m. Not only can the keyword directly modulo, but also in the collapse, the square to take the medium operation after the modulo. The choice of P is very important, generally take prime or m, if p is not good, easy to produce synonyms. [3] Handling Conflicts1. Open addressing Method: Hi= (H (key) + di) MOD m,i=1,2,...,k (k<=m-1), where H (key) is a hash function, M is a hash table length, di is an incremental sequence, the following three methods can be used:1.1. Di=1,2,3,...,m-1, called linear detection re-hash;1.2. di=1^2,-1^2,2^2,-2^2,⑶^2,...,± (k) ^2, (K<=M/2) called Two-time detection and re-hashing;1.3. di= pseudo random number sequence, called pseudo-random detection re-hash. 2. Re-hashing: Hi=rhi (key), i=1,2,...,k RHi are different hash functions, that is, when a synonym generates an address conflict, computes another hash function address, until the conflict no longer occurs, this method is not easy to generate "aggregation", but increase the calculation time. 3. Chain Address Method (Zipper method)4. Create a public overflow areaFind Performancethe lookup process for a hash table is basically the same as the watchmaking process. Some key codes can be found directly through the address of the hash function transformation, and some key codes have conflicts on the address of the hash function and need to be searched by the method of dealing with conflicts. In the three methods described for dealing with conflicts, post-conflict lookups are still the process of comparing a given value to a key code. Therefore, the measurement of the efficiency of the hash table is still measured by the average lookup length. in the process of searching, the number of key code comparisons depends on how many conflicts are generated, the conflict is less, the search efficiency is high, the conflict is more, and the search efficiency is low. Therefore, the factors that affect the number of conflicts, that is, the factors that affect the search efficiency. There are three factors that affect the number of conflicts:1. The hash function is uniform;2. Methods of dealing with conflicts;3. Reload factor for the hash table. the reload factor for the hash list is defined as: α= the number of elements in the table/length of the hash listα is the marker factor for the full extent of the hash table. Since the length of the table is fixed, α is proportional to the number of elements in the table, so the larger the alpha, the more elements are filled in the table, the more likely the conflict will be, and the smaller the alpha, the less likely it will be to have a conflict. in fact, the average lookup length of a hash table is a function of filling factor α, but different methods of dealing with conflicts have different functions. understand the basic definition of hash, you can not mention some well-known hash algorithm, MD5 and SHA-1 is the most widely used hash algorithm, and they are based on MD4 design. So what do they mean?Here's a quick look:⑴MD4MD4 (RFC 1320) was designed by MIT's Ronald L. Rivest in 1990, MD is the abbreviation for Message Digest. It is implemented with high-speed software on a 32-bit word processor-it is based on a bitwise operation of 32-bit operands. ⑵MD5MD5 (RFC 1321) is an improved version of Rivest in 1991 for MD4. It still groups the input in 512 bits, and its output is a cascade of 4 32-bit words, the same as MD4. MD5 is more complex than MD4 and slower, but safer to perform better in terms of resistance to analysis and differential resistance⑶sha-1 and othersThe SHA1 is designed by the NIST NSA to be used with the DSA, which produces a hash value of 160bit in length for inputs of less than 264, thus providing better anti-brute-force. The SHA-1 design is based on the same principles as MD4 and mimics the algorithm. So what's the use of these hash algorithms?The application of hash algorithm in information security is mainly embodied in the following 3 aspects: ⑴ file checksumWe are more familiar with the parity check algorithm and CRC check, these 2 kinds of calibration does not have the ability to resist data tampering, they can detect the channel error in the data transmission, but can not prevent malicious damage to the data. MD5 Hash Algorithm's "digital fingerprint" feature makes it the most widely used file integrity checksum (Checksum) algorithm, and many UNIX systems have the command to provide calculation MD5 Checksum. ⑵ Digital SignatureHash algorithm is also an important part of modern cipher system. Because of the slow operation of the asymmetric algorithm, the one-way hash function plays an important role in the digital signature protocol. A digital signature of a hash value, also known as a "digital digest", can be statistically considered equivalent to a digital signature on the file itself. And there are other advantages to such an agreement. ⑶ Authentication AgreementThe following authentication protocol is also called the Challenge-authentication mode: This is a simple and secure way to be able to listen to a transmission channel but not tamper with it. the crack of MD5 and SHA1At the International Conference on cryptography held in Santa Barbara, California, August 17, 2004, Professor Xiao of Shandong University first announced the results of her and her research team's findings-the deciphering of four well-known cryptographic algorithms such as MD5, HAVAL-128, MD4 and RIPEMD. February 2005 announced the crack SHA-1 password. Practical ApplicationThese are some basic preliminary knowledge about hash and its related. So what exactly does he do in emule?As we all know, emule is based on peer-to-peer (peer-to-peer abbreviation, refers to the software of peering connection), it adopts "multi-source file Transfer Protocol" (Mftp,the Multisource filetransfer Protocol). In the Protocol, a series of criteria for transmission, compression, and packaging, as well as integration, is defined, and emule has md5-hash algorithm settings for each file, making the file unique and traceable across the network. What is the hash value of a file ?the Digital Digest of the md5-hash-file is computed by the Hash function. Regardless of the length of the file, its hash function evaluates to a fixed-length number. Unlike cryptographic algorithms, this hash algorithm is an irreversible one-way function. With a high-security hash algorithm, such as MD5, Sha, two different files are almost impossible to get the same hash result. Therefore, once the file has been modified, it can be detected. when our files are put into emule for shared publishing, emule automatically generates the hash value of the file based on the hash algorithm, which is the only identity symbol for this file, which contains the basic information of the file and submits it to the connected server. When someone else wants to make a download request for the file, the hash value lets others know if the file he is downloading is what he wants. This value is especially important after the other properties of the file have been changed (such as name, etc.). And the server also provides, the file is currently located in the user's address, port and other information, so emule know where to download. in general, we want to search for a file, emule after this information, will be added to the server issued a request to obtain the same hash value of the file. The server then returns the user information that holds the file. This way our client can communicate directly with the user who owns the file and see if it is possible to download the required files from him. The hash value of the file in the emule is fixed and unique, it is equivalent to the information digest of this file, regardless of the file on whose machine, his hash value is constant, no matter how long it takes, this value is consistent, when we are in the process of downloading the file upload, emule this value to determine the file. So what is Userhash ?The same reason, when we first use emule, emule will automatically generate a value, this value is unique, it is our mark in the emule world, as long as you do not uninstall, do not delete config, your Userhash value will never change, The integral system is through this value in the function, emule inside the integral preservation, the identity recognition, is uses this value, but and your ID and your user name regardless, you arbitrarily how to change these things, your Userhash value is invariable, this also fully guarantees the fairness. In fact, he is also a summary of information, but not to save the file information, but each of us information. So what is a hash file ?we often see in the emule log, emule is a hash file, here is the use of hash algorithm file checksum function, the article has said some of these features, in fact, this part is a very complex process, in the FTP, BT and other software is used in this basic principle, emule inside is the use of File block transmission, so that each piece of transmission to be compared to check, if the error is to be re-downloaded, during which the relevant information written to the Met file, until the entire task is completed, this time the part file is renamed, Then use the move command, transfer it to the incoming file, and then the Met file is automatically deleted, so we sometimes encounter a hash file failure, that is, the information in the Met inside the error can not be enough and part file matching, and some time to start also crazy hash, There are two situations when you use the first time, this time to hash out all the file information, there is also a situation is the last time you shut down the computer, then this time is to do debugging. the research on the algorithm of hash, has been a frontier in information science, especially in the popularization of network technology today, his importance is more and more prominent, in fact, we do the information on the Internet every day security verification, we use the operating system key principle, there is its figure, Especially for those who are interested in studying information security, this is a key to open the information world, he is also a focus of research in the hack world. In general linear tables, trees, the relative position of records in the structure is random and there is no deterministic relationship between the recorded keywords, and a series of comparisons with the keywords are needed to find records in the structure. This kind of finding method is based on "comparison", and the efficiency of finding is closely related to the number of comparisons. Ideally, you can directly find the records you need, so you must establish a definite correspondence between where the record is stored and its keywords, so that each keyword and a unique storage location in the structure correspond. Thus, it is only necessary to find the like F (k) of the given value K based on the corresponding relationship F. If there is a record of the same key and K in the structure, it must be in the storage location of f (k), so that the records can be obtained directly without the need for comparison. In this case, the corresponding relationship F is called a hash function, and the table created by this idea is a Hashtable (also known as hash or hash). Hash Table unavoidable conflict (collision) phenomenon: the same hash address may be obtained for different keywords, namely Key1≠key2, and hash (key1) =hash (Key2). A keyword with the same function value is called a synonym (synonym) for the hash function. Therefore, when building a hash table, you should not only set a good hash function, but also set a method for dealing with conflicts. A hash table can be described as follows: A set of keys is mapped to a finite, contiguous address set (interval) based on the Set hash function h (key) and the selected method of dealing with conflicts, and the "elephant" in the address set as the keyword is stored in the table, which is called a hash table. for dynamic lookup tables, 1) The table length is indeterminate; 2) When you design a lookup table, you know only the scope of the keyword, and you don't know the exact keyword. Therefore, the general situation needs to establish a function relationship, with F (key) as the key to the location of the record in the table, usually called this function f (key) is a hash function. (Note: This function is not necessarily a mathematical function)A hash function is an image that maps a collection of keywords to an address collection, and its settings are flexible, as long as the size of the address collection does not exceed the allowable range. in reality, the hash function needs to be constructed, and the construction is good to use. Purpose: Encrypt, resolve conflict issues. It's very versatile and uses a hash function in the bit elf, so you can see for yourself. Specifically, you can learn the data structure and algorithm of the book. (The famous Elfhash algorithm)
int Elfhash (char*key) {    unsigned long h=0;    while (*key)    {        h = (H << 4) + *key++;        unsigned long g = h & 0xf0000000l;        if (g)            h ^= g >>;        H &= ~g;    }    return h% MOD;}


Hash table Detailed

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.