Mr. Zhang Yingpa joined Baishan Cloud technology in 2016, mainly responsible for object storage research and development, data cross-room distribution and repair problem solving. To achieve 100PB data storage as the goal, it led the team to complete the network distributed storage System design, implementation and deployment work, the data "cold" "Hot" separation, the cost of cold data compression to 1.2 times times the redundancy.
Mr. Zhang YINGPA from 2006 to 2015, he worked in Sina, responsible for the architecture design, collaborative process development, code specification and implementation standard development and most functions implementation of CROSS-IDC petabytes cloud storage service, supporting Sina Weibo, micro-disk, video, SAE, music, Software downloads and other services such as Sina internal storage; from 2015 to 2016, he was a senior technical expert in the United States, designing a multi-PB object storage solution across the engine room: designing and implementing a highly concurrent and highly reliable multiple copy replication strategy, optimizing erasure code to reduce 90%io overhead.
In software development, a hash table is equivalent to putting n keys randomly into a B bucket to achieve the storage of n data in B-unit space.
We found some interesting phenomena in the hash table:
The distribution law of key in hash table
When the number of keys and buckets in the hash table is the same (n/b=1):
37% of the buckets are empty.
There are only 1 keys in the bucket of 37%.
26% buckets with more than 1 keys (hash conflict)
Visually shows the number of keys in each bucket of the hash table when n=b=20 (sort the buckets by the number of keys):
Often our first impression of the hash table is that if the key is randomly placed in all bucket,bucket, the number of key is more uniform, and the number of key in each bucket is expected to be 1.
In fact, the distribution of key in the bucket is very uneven when n is small, and when n increases, it tends to become average.
The effect of the number of keys on the number of 3 buckets
The following table shows how the value of n/b affects the amount of 3 buckets (the conflict rate is the bucket ratio that contains more than 1 keys) when B is unchanged and n increases:
More intuitively, we used to demonstrate the trend of empty bucket rates and conflict rates with n/b values:
Effect of key number on bucket uniformity
The above groups of numbers are useful reference values when the n/b is small, but with n/b gradually increasing, the empty bucket with 1 key buckets is almost 0, most buckets contain more than one key.
When the n/b exceeds 1 (1 buckets allow multiple keys to be stored), the main objects we observe are transformed into the distribution of key numbers in buckets.
The table below shows how uneven the number of keys in each bucket tends to be when the n/b is larger.
To describe the degree of this unevenness, we use the ratio ((most-fewest)/most) between the maximum and minimum values of the key number in the bucket.
The following table lists the distribution of key when b=100 is increased with N.
It can be seen that as the average number of key in the bucket increases, the distribution of the uneven degree is gradually reduced.
The ratio of bucket to empty bucket or 1 key is different n/b, the uniformity depends not only on the value of n/b, but also on the B value, which is mentioned later. The average variance method used in statistics is not used to describe the uneven degree of the key distribution, because in the software development process, more time to consider the worst case of the need to prepare the memory and other resources.
Load factor:n/b<0.75
A concept load factorα=n/b is commonly used in hash tables to describe the characteristics of a hash table.
Usually, a hash table based on memory storage, it's n/b≤0.75. This setting, can save space, but also keep key conflict rate is relatively low, low conflict rate means low-frequency hash relocation, hash table insert will be faster.
Linear detection is a frequently used algorithm for resolving hash collisions at insert time, which, in the event of a collision of 1 buckets, looks backwards at the bucket behind the bucket in the step-by step order until it finds 1 empty buckets. Therefore, it is very sensitive to hash conflicts.
In the n/b=0.75 scenario, if a linear probe is not used (such as storing multiple keys using a linked list in a bucket), about 47% of the buckets are empty, and if linear probing is used, about half of the buckets in this 47% bucket are filled with linear probing.
In many memory hash table implementation, choose n/b=<=0.75 as the capacity limit of hash table, not only considering the conflict rate increases with n/b increase, more importantly, the efficiency of linear detection will decrease with the increase of n/b.
Hash Table features Tips:
The hash table itself is an algorithm for efficiency through a certain amount of wasted space. Low Time overhead (O (1)), low space waste, low conflict rate, three can not be concurrently;
The hash table is only suitable for storage of pure memory data structures:
The hash table is in exchange for the increase of the access speed through the wasted space, and the wasted space of the disk is intolerable, but a little waste of memory is acceptable;
The hash table is only suitable for random access to fast storage media. The data storage on the hard disk is more used Btree or other ordered data structures.
Most High-level languages (built-in hash table, hash set, etc.) are kept n/b≤<=0.75;
The hash table will not distribute the key evenly when the n/b is small.
Load factor:n/b>1
Another kind of hash table implementation, specifically used to store more than the key, when n/b>1n/b1.0, linear detection failure (not enough buckets to store each key). At this time 1 buckets not only store 1 keys, generally in a bucket with chaining, will all fall in the bucket key linked to the list to resolve the conflict when the storage of multiple keys.
The list is only applicable when the n/b is not very large. Because the lookup of the list requires O (n) time overhead, for very large n/b, sometimes the tree is used instead of the list to manage the key in the bucket.
One of the most n/b usage scenarios is to randomly assign users of a site to multiple different web-server, at which point each web-server can serve multiple users. In most cases, we want this distribution to be as homogeneous as possible, thus effectively utilizing each web-server resource.
This requires us to focus on the uniformity of the hash. So, the next step is to assume that the hash function is completely random and that the degree of uniformity varies according to N and b.
The larger the n/b, the more evenly the key is distributed.
When the n/b is large enough, the empty bucket rate approaches 0, and the number of keys in each bucket tends to average. The number of keys in each bucket is expected to be:
Avg=n/b
Define a bucket average key number is the number of keys in 100%:bucket is just n/b, respectively, the b=20,n/b is 10, 100, 1000, the number of key in the bucket distribution.
As you can see, when the n/b increases, the difference between the maximum and the minimum value of the key number in the bucket decreases gradually. The following table lists the changes in the degree of uniformity of the key distribution as B and n/b increase:
Conclusion:
Calculation
Most of the above results come from program simulations, and now we're going to figure out how to calculate these values mathematically.
Number of buckets per class
Number of empty buckets
For 1 keys, it is not in the probability of a particular bucket (b−1)/b
The probability that all keys are not in a particular bucket is ((b−1)/b) n
Known:
The empty bucket rate is:
The number of empty buckets is:
Number of buckets with 1 keys
n key, each key has 1/b probability falls into a particular bucket, the other key with the probability of 1/b not fall in the bucket, so, for a particular bucket, there is just 1 key probability is:
The number of buckets with just 1 keys is:
Buckets with multiple keys
The remainder is the number of buckets with multiple keys:
How evenly the key is distributed in buckets
Similarly, the probability of just having I key in 1 buckets is: N key in any of the selected I, and all in the probability of 1/b fall in this bucket, the other n-i key to 1-1/b probability does not fall in this bucket, namely:
This is the famous two-item distribution.
We can estimate the maximum and minimum value of the number of keys in a bucket with a two-item distribution.
Approximate by normal distribution
When N and b are large, the two-item distribution can be approximated by a normal distribution to approximate the uniformity of the key distribution:
The probability of just having an i key in a p=1/b,1 bucket is:
The probability of a key number not exceeding X in 1 buckets is:
So, the number of buckets with no more than X key is:
The minimum number of keys in a bucket can be estimated: if no more than X key bucket quantity is 1, then the only 1 bucket is the least key bucket. We only have to find 1 minimum x, so that the total number of buckets containing no more than X key is 1, and this x is the minimum value of the number of keys in a bucket.
Calculates the minimum value of the number of keys X
The probability that a bucket contains no more than x keys is:
Φ (x) is a cumulative distribution function of the normal distribution, and when x-μ approaches 0 o'clock, it can be approximated as follows:
The calculation of this function is difficult, but just to find X, we can traverse X in the range of [0~μ] to find an X so that the bucket with no more than X key is expected to be 1.
X can be thought of as the minimum value of the number of keys in the bucket, and in this hash table, the degree of unevenness can be described by the difference between the maximum value of the key number and the minimum value: Because the normal distribution is symmetric, the maximum value of the key number can be represented by μ+ (μ-x). Finally, the ratio of the maximum number of keys in the bucket to the minimum is:
(μ is mean value n/b)
Program Simulation
The following Python script simulates the distribution of key in buckets, and it can be used as a comparison to verify the results of the above calculations.
Import sysimport mathimport timeimport hashlibdef normal_pdf (x, Mu, sigma): x = float (x) mu = float (mu) m = 1.0/ MATH.SQRT (2 * math.pi)/Sigma n = math.exp (-(X-MU) **2/(2*sigma*sigma)) return m * ndef normal_cdf (x, Mu, sigma): # integral (-oo,x) x = float (x) mu = float (mu) sigma = float (sigma) # to standard form x = (X-MU)/Sigm A s = x v = x for I in range (1, +): v = v * x * x/(2*i+1) s + = v return 0.5 + s/(2*MATH.PI) * *0.5 * MATH.E * * (-X*X/2) def difference (Nbucket, nkey): Nbucket, nkey= Int (nbucket), int (nkey) # binomial Distributi On approximation by normal distribution # Find the bucket with minimal keys. # # The probability that a bucket have exactly I keys is: # # probability density function # normal_pdf (i, Mu, Sigma) # # The probability that a bucket have 0 ~ i keys is: # # cumulative distribution function # Normal _CDF (i, Mu, sigma) # # If the probability that a bucket hAs 0 ~ I keys is greater than 1/nbucket, we # say there would be a bucket in hash table have: # (I_0*p_0 + i_1*p_1 +. ..) /(P_0 + p_1 + ...) keys. p = 1.0/nbucket mu = nkey * p sigma = math.sqrt (Nkey * p * (1-p)) target = 1.0/nbucket minimal = mu whi Le true:xx = normal_cdf (Minimal, mu, sigma) if ABS (Xx-target) < Target/10:break minim Al-= 1 return minimal, (mu-minimal) * 2/(mu + (mu-minimal)) def difference_simulation (Nbucket, nkey): t = str (TI Me.time ()) nbucket, nkey= Int (nbucket), int (nkey) buckets = [0] * Nbucket for I in Range (nkey): HSH = Hash LIB.SHA1 (t + str (i)). Digest () Buckets[hash (HSH)% nbucket] + = 1 buckets.sort () nmin, Mmax = buckets[0], Bucke TS[-1] return nmin, float (mmax-nmin)/mmaxif __name__ = = "__main__": Nbucket, nkey= sys.argv[1:] Minimal, rate = Difference (nbucket, nkey) print ' by normal distribution: ' print ' min_bucket: ', minimal print 'Difference: ', rate minimal, rate = Difference_simulation (Nbucket, Nkey) print ' by simulation: ' Print ' Min_bu Cket: ', minimal print ' difference: ', rate