Hash Lookup
The previous lookup algorithm, the time complexity of O (n), or O (㏒2n), its efficiency depends on the number of "comparisons".
即使对于采取排序树结构的查找表,由于每一次比较的结果,如果关键字与数据元素不相等,则有“大于”或者“小于”两个结果,所以下一步会有两种可能的方向,因此O(㏒2n)已经是最优了。
Hash table takes another algorithm, the time complexity of finding the fastest can reach O (1), that is, as long as the keyword is given, the element can be found immediately.
In practical applications, a large number of ways to take a hash table, many programming languages built-in hash lookup, such as Java Hashtable,hashset.
Basic ideas
Establishes a deterministic correspondence between the stored address of the record and its keyword, so that, without comparison, a single access can be used to find the element being looked up.
1. 哈希函数(Hash函数) f : 关键字 → 存储位置。即,hash函数是指关键字与存储位置(哈希地址)的对应关系,只要给出关键字,就可以通过这个函数得到存储位置。
For example, given an array that holds 10 elements
[0 1 2 3 4 5 6 7 8 9]
如果给定Hash函数f(K)=K mod 10,就可以立刻得到该键值对应元素的数组下标如果给定Hash函数f(K)=K mod 2,则虽然也缩小了查找范围,但达不到上面函数的效果
The violation of a hash function
如果不同的键值,被Hash函数映射到同样的一个哈希地址,则称为冲突(ki?kj,H(ki)=H(kj))同义词:具有相同函数值的两个不同的关键字,称为该哈希函数的同义词。 Hash函数不可能完全避免冲突,只可能尽量减少冲突,或者说,好的Hash函数能将关键字映射后得到的哈希地址,尽量均匀地分布设计Hash表的时候,除了采用好的Hash函数之外,如何有效地处理冲突,也是很重要的一方面
Hash table:
根据设定的哈希函数H(key)和处理冲突的方法,将一组关键字映射到一个有限的连续的地址区间上,并以关键字的映像作为记录在表中的存储位置。映射过程称为哈希造表,或者散列(哈希表有时也叫散列表)所得的存储位置值称为哈希地址或者散列地址
The construction of the hash function
哈希函数是一种映象,其设定很灵活,只要使任何关键字的哈希函数值都落在表长允许的范围之内即可。哈希函数“好坏”的主要评价因素有:
The structure of the hash function is simple;
After the keyword is mapped by the hash function, the resulting address is equal probability, that is, "uniform", such a hash function is called "Uniform hash Function"
好的Hash函数,可以减少关键字的冲突。
Direct addressing method: Take a keyword or a keyword of a linear function as a hash address, that is: H ( key ) = key H (key ) = Ax Key + B,a, B is a constant feature: The address collection of the direct addressing method is equal to the size of the keyword set, and there is no conflict, but it is seldom used in practice, because the distribution of the keyword needs to be known beforehand. Digital Analysis Method: Take a number of bits or combinations of keywords as the hash address of the square take the method: The keyword squared after the middle as a hash address folding method: The keyword is divided into several parts of the same number of bits (the last part can be different), and then take the superposition of these parts and as the hash address of the remainder method (modulo) Modulo the keyword on a number p (the table length m not greater than the Hashtable), the resulting number as the hash address H (key ) = key mod p, p≤m the key to using this method is P selection, p selection is not good, easy to produce synonyms. Based on experience: if the hash table length is M, the p is usually a prime number less than or equal to the length of the table (preferably near the table length m), or does not contain composite less than 20 mass factor (for example: 23 *29 ). The random number method H (key ) = random (key ), which is the hash address. This method is appropriate when the length of a keyword in a hash list is unequal.
Summary: Depending on the specific problem, take a different hash function, depending on the following:
hash function Calculation time
Keyword length
Hash table size (hash address range)
Distribution of keywords
Record Lookup Frequency
Methods of conflict handling
Conflict: The location of the hash address obtained by the keyword has already been recorded
哈希函数只能使哈希地址均匀分布,但不能避免冲突,因此冲突处理是不可避免的
The basic idea of dealing with conflicts: in the process of processing, you may get an address sequence hi,i = 1,2,...,k, that is, each time you get a hash address hi, if a conflict persists, then the corresponding method will get the next hash address hi+1 until a hash address is not conflicting
Conflict processing finds this appropriate method, that is, when a conflict occurs, another storage location is found for the conflicting element.
Open Address Law
The handle conflict function is
Hi = (H (key) + di) MOD m,
i = ..., K, k≤m-1
Hi is the hash function, M is the table length of the hash table, and di is the increment sequence
The method of selecting Di
di=1,2,3, ..., m-1, called "linear probing re-hashing"
Di=12,-12,22,-22, ..., K2,-k2, K≤M/2, called "two-time probing and hashing"
DI is a sequence of pseudo-random numbers, called pseudo-random detection and hashing.
Re-hash method
构造若干个哈希函数,当发生冲突时,利用不同的哈希函数再计算下一个新哈希地址,直到不发生冲突为止。即:Hi=RHi(key) i=1, 2, …, k RHi :一组不同的哈希函数。第一次发生冲突时,用RH1计算,第二次发生冲突时,用RH2计算…依此类推知道得到某个Hi不再冲突为止。
Advantages: The phenomenon of "aggregation" is not easy to produce conflict;
Cons: Increased calculation time.
Chain Address method
Method: Stores records of all keywords as synonyms (the same hash address) in a single-linked list, and holds the head pointers of the linked list with a one-dimensional array.
The hash list length is M, which defines a one-dimensional array of pointers:
Recnode *linkhash[m],
Where Recnode is a node type, the initial value of each component is empty.
All records with a hash address of k are inserted into a linked list with Linkhash[k] as the head pointer.
The lookup process of a hash table is basically consistent with the construction process, using hash functions and conflicting functions until the lookup fails or the lookup succeeds in the lookup process.
Hash Lookup Analysis
Visible from the hash lookup process:
A hash list establishes a direct image between the keyword and the stored address of the record;
Because of "collisions", the hash table also has a comparison of keywords, and its performance bottleneck depends on the number of comparisons
Evaluate hash lookup efficiency still with ASL (average lookup length)
hash function
The hash function is good or bad, which affects the frequency of conflict. An even hash function, which creates the same likelihood of conflict for a group of keywords, is not a decisive factor in influencing ASL
Conflict handling methods
For the several conflict processing methods introduced (linear detection, two detection, random detection, re-detection, link address), the respective ASL are different
The fill factor (which measures the full extent of the hash table) affects the ASL of the hash table
? = number of records in table/hash table length
The smaller the filling factor, the less likely the conflict will be, and the greater the contrast is (the more times you need to compare)
Conclusion: The average lookup length (ASL) of a hash table is related to the filling factor, but not to the number of records.
The design of hash function and conflict processing function is the two core problem of hash table design, and if it is well designed, the hash table is the reason why it is widely used because of other lookup tables.
Data structure-hash function