Summary of research on hash table and hash function

With the increasing of information level, the data has been replaced computing as the center of Information Computing, the demand for storage has been increasing the volume of explosive growth trend, storage has become an urgent need to improve the bottleneck. Hash table as an effective way to store mass information, this paper introduces the design of hash table, the conflict resolution scheme and the dynamic hash table in detail. In addition, the application of hash function in similarity matching, image retrieval, distributed cache and cryptography is introduced briefly.

Hash after so many years of development, there are a lot of high-performance hash functions and hash tables. This paper introduces the design principles of various hash functions and different hash table implementations, which are designed to help readers choose the most suitable hashing algorithm according to the characteristics and performance requirements of the problem, and design the optimal solution.

**Keywords: hash function; hash table; storage; cryptography; distributed cache.**

1. hash Table 1.1 Hash table

Hash table is the most important application of hash function. A hash table is a data structure that implements associative arrays (associative array), which is widely used for fast searching of data. A hash function is used to calculate the hash value of the keyword (hash value), where the key is stored by the index of the hash (bucket). Hash table differs from binary tree, stack, sequence data structure in general, the time complexity of inserting, locating, deleting, and so on the hash table is O (1).

Hash table T, which has m buckets, holds n elements, then the load factor of T (load factor) equals n/m, which is an important parameter for evaluating a hash table.

A hash table is a typical tradeoff between time and space. If the storage space is unlimited, then you can use the keyword key as the storage address, the access time is. Conversely, if there is no limit to access time, sequential lookups are performed with minimal storage space. The hash table is a search for an O (1) time complexity while using as little storage as possible [1].

1.2 Conflicts

Because | U|>m, according to the Pigeon Nest principle [2] and the Birthday Paradox [3], there must be the same hash value for different keywords, That is, conflict (collision). When designing a hash table, one hand minimizes the number of collisions by selecting the hash function of the most appropriate hash function (described in detail below), on the other hand, designs the conflict resolution method according to the actual application characteristics [4]. There are two main ways of resolving conflicts.

**Link method [1][4][11]**

For each bucket SS, with the second layer of data structure implementation, according to the actual application needs can choose arrays, lists, stacks, trees, etc., for all xx and H (x) =s, are stored in the bucket s. If | S| is large enough, and the hash function uniformity is good, so that the elements in each bucket will not exceed 3, so that the second layer of data institutions can be implemented with a simple list.

**Open addressing method [1][4][11]**

When h (x) already has other elements, it computes the probe sequence H (x), H (x), H (x),... Until we find an empty bucket. There is no need for a linked list, no dynamic memory allocation, high space utilization, but no deletion, or a complex mechanism to implement the deletion. There are three techniques for calculating the profiling sequence: linear probing (linear probing), two probing (quadratic probing), double hashing (double hashing). For dual hashing, the performance of linear probing and two probes depends on the distribution of the keywords and the performance of the hash function [8]. For general applications, the performance of double hashing is better [9].

The delete operation is cumbersome for open addressing methods. There are two ways to do this: 1) set the notation, 2) reinsert the following key-value pairs until NULL is encountered.

**The comparison between open addressing method and zipper method**

(1) linear probing and two probes exist cluster (cluster) phenomenon, the average search length is longer, zipper processing conflict is simple, and no clustering phenomenon, so the average search length is short.

(2) Due to the dynamic distribution of the node space on each linked list in the Zipper method, it is more suitable for applications that cannot determine the long hash table before it is built.

(3) Open addressing method in order to maintain good performance, the general loading factor α is small, so when the node scale is large, it will waste space. While the Zipper method is preferable to α≥1, when the node is larger, the increase of the pointer field can be neglected, at the same time, with the increase of α, the performance of open addressing method is more serious, and the link method is relatively moderate [11]. Therefore, the link method is more suitable for larger scale applications.

(4) The Zipper needle requires additional space, so when the node size is small, open addressing method is more space-saving, and if the saving of the pointer space to expand the scale of the hash table, can make the filling factor smaller, which reduces the open addressing method of conflict, thus improving the average search speed.

(5) Open addressing method, the whole hash table in a continuous space, for small scale applications, open addressing method (especially linear exploration) has better caching performance.

(6) In a hash table constructed with a zipper, if the linked list is a doubly linked list, then deleting a node can be done within O (1) time. and to the open address method constructs the hash list, the deletion node cannot simply set the deletion node the space to be empty, performs the delete operation on the hash list which handles the conflict with the open address method, can only delete the node to do the deletion mark, but cannot actually delete the node [1][4].

Although there are many different implementations of hash tables, their principles are essentially the same: the hash table has a two-level logical structure, and the hash value is the entrance to the second-level logical structure. When the hash function is better, the second-level logical structure is O (1), so the entire lookup time is O (1). The second level logical structure of the zipper method is the linked list of buckets, and the probing sequence can be regarded as the second level logical structure of open addressing method.

**Coalesced Hashing[12]**

The advantage of the open addressing method and the Zipper method, compared with the open addressing method, is that it does not produce a single aggregation and two aggregation phenomena. In addition, because of its local characteristics, the cache performance is better. When the load factor is relatively large, it can maintain a better performance than others. Of course, as with open addressing, the cost of deleting operations is large.

**Cuckoo hashing[13]**

The main principle is the exclusive insertion, when a conflict occurs, the elements in the original bucket are squeezed out, and the element is re-inserted into the other positions in the table. Its main advantage is that in any case, its look-up Time is O (1). The disadvantage is that when the elements in the table are relatively long, a large number of insertions occur every time a new element is inserted. There may be loops, in which case the hash table can only be refactored.

**Hopscotch Hashing[14]**

The principle is that when a conflict occurs, the new element is inserted into the neighbor bucket of the original bucket, that is, its distance from the original bucket is not greater than a constant, and the advantage is that, in any case, the lookup time is O (1), and that when the table is near saturation, α tends to be at 1 o'clock, still maintaining good performance. The disadvantage is that each insertion of an element may result in the movement of a large number of elements in the table, which is also more complex to implement.

**Robin Hood hashing[15]**

For element x, the location that should be stored is h (x), where the actual storage is H (x). H (x) =h (x)-H (x), which is the distance between two positions, that is, the lookup length. When x is inserted, if a conflict occurs, suppose y is stored in h (x), at which point H (Y) =h (x). When h (x)

**2-choice hashing**

Two hash functions H1 and H2, and they correspond to the two link method implemented by the hash table. When x is inserted, H1 (x) and H2 (x) are computed, and x is inserted into buckets with fewer elements. Need to traverse H1 (x) and H2 (x) separately when searching

1.3 Dynamic hash table [16][21]

The workaround described above is premised on the premise that the number of buckets in the hash table remains the same, that is, the hash table specifies a parameter at initialization time, and in the process of use, only operations such as adding, deleting, finding elements in it are allowed, but not changing the number of buckets.

In the actual application, when the hash table is small, the number of elements is not long, the above method can be fully dealt with. However, the above method is not sufficient to solve this problem when there are more elements, or there is some skew in the data (the data is distributed on a bucket). On the other hand, with the insertion and deletion of elements, in order to keep the loading factor α in a certain range, so that the hash table has good time and space performance, it is necessary to constantly adjust the size of the hash table. We introduce a method called Dynamic hashing: Dynamically adjusting the number of buckets while the elements of the hash table grow. Dynamic hashing does not require the re-insertion of all elements in the hash table (reorganization), but rather on the original basis, the dynamic bucket expansion.

**Linear Hashing[17][18]**

Linear hashing in turn splits all barrels, s points to the bucket to be split, when a bucket overflows, splits s barrels, s points to the next bucket. s gradually points the overflow bucket and splits the overflow bucket. It is important to note that each time a split bucket is always determined by s, regardless of the current value being inserted into the bucket is full or overflow. To handle overflow situations, it is easier to introduce overflow pages to solve the hash function:

Hi (x) = x mod 2i, i=1,2,3,4,....

As the hash table grows, I grows. After you delete an element, if you find that the bucket is empty, you can merge it with the adjacent bucket. Berkeley DB is used by liner hashing.

**Extendible HASHING[16][19][20]**

Is the application of the Dictionary tree (trie) for bucket lookups, the introduction of a directory array containing the bucket pointer, when a bucket overflow, the bucket split, the directory array doubled, while adjusting the individual bucket pointers. The advantage of the extendible hashing is that the dynamic growth of barrels is achieved, and the efficiency is improved by replacing the traditional practice of doubling the number of buckets with the smaller cost of doubling the catalog items. Of course, it also has some problems: when the data distribution of the hash is uneven or skewed large, that is, a large number of hashes have the same prefix, it will make a large number of catalog items, the vast majority of the bucket pointer to the same bucket, the space utilization is very low; directory arrays are exponentially increasing, and as the elements grow, the directory array occupies

Dynamic hashing adapts to changes in the size of the hash table by splitting or merging buckets. [20] This allows for efficient use of space, resulting in lower performance overhead. Only a small number of elements in the bucket changes, that is, the structure of the hash table changes, reducing the time delay caused by the reconstruction of the hash table, while the quantity of elements in each bucket is small.

1.4 Distributed Hash Table

If the data is stored on a single machine, if this machine fails, it will result in data loss, in addition, if there is a large number of requests to access, also cause delay. Therefore, the Distributed hash table (distributed hash table) is introduced, and its data is stored in multiple nodes, each node only stores part of the data, realizes the storage of the whole system, and is often used in the peer-to network. The more common four protocols are Chord[22],can (Content addressable Network) [23],pastry[24][25] and tapestry[26]. Distributed hash tables should be characterized by extensibility (scalable), fault tolerance (fault tolerant), and self-organizing (self-organizing). Distributed hash tables are generally implemented based on a consistent hash (consistent hashing), consistent hashing is initially presented in the network cache, which is described in detail in the following section on caching.

2. hash function 2.1 Definition

hash function h (hash functions) is a mapping relationship that maps any length of input to any fixed-length integer value, also known as a hash function.

H:U->{0,1,...,M-1}

Here, all possible values of the U keyword are set, called the whole domain (universe), and the complete set of hashes (0,1,...,m-1}) is represented by S. Think of keywords as 0 or 1 strings, or binary representations. Hash (hash) English original meaning is "mixed", "patchwork", "re-expression". In fact, a hash is an action that reverses the sort, in which the elements in the collection are sorted in some way, such as a dictionary order, and hashed by calculating the hash value, breaking the original relationship between the elements so that the elements in the set are as unordered and random as possible.

X is a collection of keyword values in a particular application, and the keywords in x are rarely evenly distributed in U, in order for the corresponding hashes to be evenly distributed in s, the hash function preferably has an avalanche effect (avalanche), meaning that the change in every bit of the keyword can change most bits of the hash value.

The performance of the hash function is of course also affected by the computing platform [7].

2.2 Properties

Generally, an ideal hash function should have the following characteristics.

1.CALCULABILITY[1]. The calculation is completed in a short period of time under the existing computing power.

2.Determinism. The same input is evaluated more than once, meaning that the resulting hash is independent of external conditions such as time.

3.Universality

4.UNIFORMITY[6]. Try to map the keyword key, etc., to every hash value in s, i.e. the resulting hash is evenly distributed in S, it is worth noting that the uniform distribution here does not require random distribution. Of course, if the hash value is randomly distributed, it must be randomly distributed, but the uniform distribution does not deduce a random distribution. You can use chi-squared test[5] to evaluate the uniform distribution characteristics of the hash function.

5.INDEPENDENCE[1]. It is possible to map a keyword, etc. to any hash value, regardless of the hash value already mapped by other keywords [4].

2.3Hashing Methods

There are several ways to construct a hash function simply:

**Division Hashing method**

One of the simpler ways to design a hash function is the Division hash (Division method), which takes the keyword K divided by the remainder of m as the hash value, namely:

H (k) =k mod m.

For example, the size of the hash table m=23, when k=45, h (k) = 22.

Try to avoid selecting m=2, so that H (k) equals the number represented by the leftmost p-bit of K. But a good hash function should take into account every bit of the keyword. An ideal choice for a prime number of an integer power close to 2 [4].

**Multiplication hashing method [4]**

The multiplication hashing method (multiplication) is defined as follows:

H (k) =m (KAMOD1)

First, the keyword K is multiplied by the constant A (0<a<1), and the decimal part of Ka is taken. It is then multiplied by m to get the result and forensics down.

There is no special requirement for the value of M, and for the convenience of calculation, M often takes a power of 2 of the whole number, so that the multiplication operation can be optimized as a simple shift operation.

**Folding method [35]**

The folding method is to divide the keyword K into the same number of bits (except for the last part), and then add them (rounding up) to get the hash value. This method is suitable for a large number of k digits, and each of the keywords on the number distribution is roughly uniform.

**The method of square take [35]**

The Square method (middle of square) takes the keyword K squared and takes the middle as the hash value. The middle of the k is related to each of the k's, so if K is randomly distributed, then the resulting hash is also randomly distributed.

**Digital analysis method [35]**

The idea of the digit analysis is this: if the possible keywords in the hash table are known, you can go to some bits in the keyword as the hash value. Determine the part to be selected according to the actual application and try to avoid conflicts.

**Random number method [35]**

Select a random function that takes the random function value of the keyword as its hash value. This method is usually used when the length of the keyword varies.

There are many methods to construct the hash function, and the appropriate method should be chosen according to different situations in actual work. The factors commonly considered are the length and distribution of keywords, the range of hashes, and so on. such as: When the keyword is an integer type, you can use the division hashing and multiplication hashing method; If the keyword is a decimal type, it is better to select the random number method.

2.4perfect HASHING[4][16][39][40]

When all the keywords are deterministic and each keyword can be enumerated, the perfect hash can be designed (perfect hashing) which is characterized by no conflict, namely the single-shot function, such as mapping 12 months to {0, ... 11}.

[4] The more commonly used is Fredman in 1894 [39], below to see how he was designed.

Use full-domain hashing at level two, using global hashing at each level. The first level is the same as the hash table implemented by the link method, using the hash function h selected from the Global hash function family h to map n keywords into m buckets. The second level does not link the elements in the bucket J, but the bucket J is also implemented with a hash table. To try to avoid a second-level hash table without conflict, let it be the square of the number of elements in the bucket. This may seem to make the total storage space larger, but if you choose the appropriate hash function, the total storage space can be limited to O (n).

If you add a limit to the perfect hash, all the hashes fall within a contiguous interval, then it is called the minimum perfect hash (minimal perfect hashing).

Dynamically perfect hash (dynamic perfect hashing) support elements are removed from the table [40].

2.5 applications

**Hash table**

Hash tables are the most important applications of hash functions. A uniformly distributed hash function has a very large impact on the performance of a hash table, which reduces the overhead of collisions.

**Distributed cache**

The traditional application is to calculate the hash value of the data through the hash function, use the hash value to find the address between the buffers, when the conflict occurs, only the original data at this address can be discarded, the implementation is relatively simple.

The distributed cache system under the network environment is generally based on a consistent hash (consistent hashing) [27][28]. In a nutshell, a consistent hash organizes the hash value space into a virtual ring, and each server is mapped to this ring using the same hash function as the data keyword K, and the data is stored in the first server it encounters in a clockwise "walk". The load of each server node can be relatively balanced, to a large extent to avoid the waste of resources.

But if a server goes down, it will load the next server. To solve this problem, the virtual node [29] is introduced, all the data keywords are mapped to a set of virtual nodes more than the number of servers, and then the virtual node is mapped to the real server. The user data is actually stored on the physical server that corresponds to the virtual node it maps to. So when the server goes down, the number of virtual nodes is fixed, and only the virtual nodes that the service is not available can be re-migrated, so that only the data of the outage node needs to be migrated. Minimize cache redistribution When the server is increasing or decreasing.

In the dynamic distributed cache system, the design of hashing algorithm is the key point. The use of more reasonable distribution of the algorithm can make the load between multiple service nodes is relatively balanced, can greatly avoid the waste of resources and some server overload. Using a consistent hashing algorithm with virtual nodes can effectively reduce the data migration cost and risk caused by the change of the service hardware environment, thus making the distributed cache system more efficient and stable.

**Bloom Filters[30][31][32]**

Its function is to determine whether an element belongs to a set, which is represented by an array of bits. Compared to other data structures that represent collections, the biggest advantage of Bloom filter is that it takes up less space, which is critical to the processing of massive amounts of data, and the time to determine whether an element is added to a collection or to determine if an elements belong to the collection is O (1), regardless of the number of elements already present in the collection. Of course, it also has a very obvious disadvantage, it is possible that the element that does not exist is mistaken for the existence of the set (false positive). Therefore, Bloom filter is not suitable for those "0 error" applications.

Let's take a look at the implementation of Bloom filter. In the initial state, the Bloom filter is a bit array of M bits and each bit is set to 0. Assuming that the current collection is s={x, x,..., x},bloom filter uses K-independent hash functions (hash function), which map each element in the collection to the range of {1,..., m} respectively. For any one element x, the position h (x) (1≤h (x) ≤m) of the first hash function map is set to 1 (1≤i≤k). Note that if a position is set to 1 multiple times, only the first time will work, and the next few times will have no effect. Determine if x is in the collection, calculate the K hash of X separately, assuming k=4, if all h (x) positions are 1 (1≤i≤k), then it is assumed that x is the element in the collection (possibly X is not in the collection) or that x is not in the collection.

The traditional Bloom filter does not support removing elements from the collection. In order to solve this problem, counting Bloom filter[33][34] was introduced. It expands each bit of the Bloom filter bit array to a small counter that, when inserting an element, adds 1 to the value of the calculator with the hash value corresponding to K (k is the number of hashes), minus 1 for the corresponding K counter when the element is deleted.

**Delete duplicate data**

You can use a hash function to delete duplicate elements in a collection. Hash functions are used to calculate the hash values of all elements, and these elements are stored in a hash table T. A repeating set of elements must be in the same bucket t[i], then compared in each bucket to remove the duplicate elements. Removing duplicate elements in this way is much more efficient than other methods.

**Cryptographic hash function**

In cryptography and data security technology, hash function is an important tool to realize effective, secure and reliable digital signature and authentication, and it is an important module in Security authentication protocol. Compared to other hash functions, cryptographic hash functions (cryptographic hash function) have higher performance requirements for security reasons [41]:

Unidirectional: For any given hash value H. Finding a K makes the H (k) =h computationally infeasible.

Anti-weak collision: For any given x, it is not feasible to find y that satisfies xy and H (x) =h (y) in calculation.

Strong collision Resistance: it is not feasible to find any pair of elements (x, y) that satisfy the H (Y) =h.

Now commonly used cryptographic hash functions such as MD5, SHA-1 are based on chunked encryption (block cipher). In 2005, my King Professor Xiao Yun proposed to find the collision of SHA-1 by theoretical 2^80 will be 2^69[42]. In October 2012, NIST chose the Keccak algorithm as the SHA-3 standard algorithm, Keccak has better encryption performance and anti-decryption capability.

**Locally sensitive hash [43]**

The basic idea of a locally sensitive hash (locality sensitive hashing,lsh) is that by passing the same mapping or projection transformation (projection) of two adjacent data points in the original data space, the probability of the two data points still adjacent in the new data space is very high, The probability that nonadjacent data points are mapped to the same bucket is small. The LSH is diametrically opposed to other hash functions with avalanche requirements. More common local-sensitive hashes are minhash[43], Nilsimsa hash[44], and simhash[45].

LSH has a wide range of applications, such as finding duplicate Web pages, finding similar articles, image retrieval, music retrieval, fingerprint matching, and so on, using LSH for similarity lookups between large amounts of data.

LSH provides us with a way to find some or some of the data points closest to the query data point approximation in a massive, high-dimensional dataset. It is important to note that LSH is not guaranteed to be able to find the data that is closest to the query data point, but rather to reduce the number of data points that need to be matched while ensuring that the nearest neighbor's data points are found to be of great probability.

**Database index [46]**

A hash index is a common index in a database, compared to the btree index, it looks faster and is commonly used as a secondary index.

2.5 Common hash functions

Person hashing

Siphash

One at a time

Lookup3

MurmurHash

Cityhash

Farmhash

Xxhash

Fnv1a

MD5

MD4

SHA-1

SHA-3

CRC32

Spookyhash

Summarize

There is no universal hash table and hash function, each hash table and hash function has its most suitable application scenario. Master various hashing algorithms to understand their pros and cons. In the case of problems, analyze data characteristics and performance requirements, select or design the most appropriate hash table and hash function.

Reference:

[1] Robert sedgewick.algorithms in C.

[2] Herstein, i. N Topics in Algebra[m]. Waltham:blaisdell Publishing Company,1964:90

[3] W. W. Rouse Ball.mathematical Recreations and essays[m]. New york:macmillan,1960:45.

[4] Thomas H. Coremen, Charles E. leiserson,ronald L. Stein. Introduction to Algorithms[m],

[5] Pearson, Karl. On the criterion, a given system of deviations from the probable in the case of a correlated system of variables are SU Ch that it can is reasonably supposed to has arisen from random sampling[j]. Philosophical Magazine 1900,series 5 50 (302): 157–175

[6] Mahima Singh,deepak garg.choosing best Hashing Strategies and Hash Functions.

[7] Fast Hashing on the Pentium

[8] on the k-independence required by linear probing and minwise independence

[9] How Caching affects Hashing

[10] A probabilistic Study on combinatorial expanders and Hashing

[11] The art of programming.

J. S. Vitter and W.-c. Chen, Design and analysis of coalesced Hashing, Oxford University Press, New York, NY, 1987, ISBN 0-19-504182-8

[13] Pagh, Rasmus; Rodler, Flemming Friche (2001). "Cuckoo Hashing". Algorithms-esa 2001. Lecture Notes in Computer science 2161. pp. 121–133

[Herlihy], Maurice and Shavit, Nir and Tzafrir, Moran (2008). "DISC ' 08:proceedings of the 22nd International Symposium on Distributed Computing". Arcachon, France:springer-verlag. pp. 350–364

[15] Celis, Pedro (1986). Robin Hood Hashing (Technical report). Computer Science Department, University of Waterloo. Cs-86-14.

[16] Advanced Data Structures

[17] Litwin, Witold (1980), "Linear hashing:a new tool for file and table addressing" (PDF), Proc. 6th Conference on Very larg E databases:212–223

[18] Larson, Per-åke (April 1988), "Dynamic Hash Tables", Communications of the ACM 31 (4): 446–457, doi:10.1145/42404.42410

[19] Fagin, R.; Nievergelt, J.; Pippenger, N.; Strong, H. R. (September 1979), "Extendible hashing-a Fast Access Method for Dynamic Files", ACM Transactions on Databas E Systems 4 (3): 315–344, doi:10.1145/320083.320092

[20] Database System Concepts.

[21] Dynamic Hashing Schemes

[Stoica], I.; Morris, R.; Karger, D.; Kaashoek, M. F.; Balakrishnan, H. (2001). "Chord:a Scalable Peer-to-peer Lookup service for Internet Applications". ACM Sigcomm Computer Communication Review 31 (4): 149. doi:10.1145/964723.383071

[23] Ratnasamy et al. (2001). "A Scalable content-addressable Network". In Proceedings of ACM Sigcomm 2001. Retrieved 2013-05-20

[24] A. Rowstron and P. Druschel (Nov 2001). "Pastry:scalable, decentralized object location and routing for large-scale peer-to-peer systems". IFIP/ACM International Conference on distributed Systems Platforms (middleware), Heidelberg, germany:329–350.

[25] A. Rowstron, A-m. Kermarrec, M. Castro and P. Druschel (Nov 2001). "Scribe:the design of a large-scale event notification infrastructure". NGC2001 UCL London.

[26] Zhao, Ben Y.; Huang, Ling; Stribling, Jeremy; Rhea, Sean C.; Joseph, Anthony D.; Kubiatowicz, John D. (2004). "Tapestry:a resilient Global-scale Overlay for Service Deployment". IEEE Journal on Selected areas in Communications (IEEE) 22 (1): 41–53. doi:10.1109/jsac.2003.818784.

[27] Karger, D.; Lehman, E.; Leighton, T.; Panigrahy, R.; Levine, M.; Lewin, D. (1997). "Consistent Hashing and Random trees:distributed Caching protocols for relieving hot spots on the world Wide Web". Proceedings of the Twenty-ninth annual ACM Symposium on Theory of Computing. ACM Press New York, NY, USA. pp. 654–663. doi:10.1145/258533.258660.

[28] Karger, D.; Sherman, A.; Berkheimer, A.; Bogstad, B.; Dhanidina, R.; Iwamoto, K.; Kim, B.; Matkins, L.; Yerushalmi, Y. (1999). "Web Caching with consistent Hashing". Computer Networks 31 (11): 1203–1213. doi:10.1016/s1389-1286 (99) 00055-9

[29] Dynamo:amazon ' s highly Available key-value Store

[30] Bloom, Burton H. (1970), "Space/time trade-offs in Hash Coding with allowable Errors", Communications of the ACM 13 (7): 4 22–426,

[31] Broder, Andrei; Mitzenmacher, Michael (2005), "Network Applications of Bloom filters:a Survey", Internet Mathematics 1 (4): 485–509,

[32] M. Mitzenmacher. Compressed Bloom Filters. IEEE/ACM Transactions on Networking 10:5 (2002), 604-612.

[33] Kirsch, Adam; Mitzenmacher, Michael (2006), "less Hashing, same performance:building a Better Bloom Filter", Azar, Yossi; Erlebach, Thomas, Algorithms–esa 2006, 14th Annual European Symposium, lecture Notes in Computer science 4168, springer- Verlag, lecture Notes in Computer Science 4168, pp. 456–467

[34] Rottenstreich, Ori; Kanizo, Yossi; Keslassy, ISAAC (+), "The Variable-increment counting Bloom Filter", 31st Annual IEEE International Conference on Compu Ter Communications, Infocom, pp. 1880–1888,

[35] Min data structure

[36] Carter, Larry; Wegman, Mark N. (1979). "Universal Classes of Hash Functions". Journal of computer and System Sciences 18 (2): 143–154. doi:10.1016/0022-0000 (79) 90044-8. Conference version in Stoc ' 77

[37] Miltersen, Peter Bro. "Universal Hashing" (PDF). Archived from the original on the June 2009.

[38] Kaser, Owen; Lemire, Daniel (2013). "Strongly universal string hashing is fast". Computer Journal (Oxford University Press). arxiv:1202.4961. doi:10.1093/comjnl/bxt070.

[39] Fredman, M. L., Komlós, J., and Szemerédi, E. 1984. Storing a Sparse Table with 0 (1) worst case Access time. J. ACM, 3 (June. 1984), 538-544

[40] Dietzfelbinger, M., Karlin, A., Mehlhorn, K., Meyer auf der Heide, F., Rohnert, H., and Tarjan, R. E. 1994. Dynamic Perfect Hashing:upper and Lower Bounds. SIAM J. Comput. At 4 (1994), 738-761.

[41] Cryptography and Network Security Principles and Practice, 5th Edition

[42] Xiaoyun Wang, Yiqun Lisa Yin, and Hongbo Yu, finding collisions in the full SHA-1

[43] A. Rajaraman and J. Ullman (2010). "Mining of Massive Datasets

[44] Damiani et. Al (2004). "An Open digest-based technique for Spam Detection"

[Charikar], Moses S.. (2002). "Similarity estimation techniques from rounding algorithms". Proceedings of the 34th Annual ACM Symposium on Theory of Computing 2002: (ACM 1–58113–495–9/02/0005) ....

[46] database.system.concepts.6th Edition

Http://www.cnblogs.com/qiaoshanzi/p/5295554.html

Summary of hash functions and hash tables (RPM)