Hash probability problem

Last Update:2016-04-15 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Hash is a sharp knife, processing a huge amount of data often used, we may often use hash, but some of the characteristics of the hash you have thought about, understand. We can use our knowledge of probabilities and expectations to analyze some interesting questions in the hash, such as:

The average number of items on each bucket
Average number of lookups
Average number of conflicts
Average number of empty barrels
Expect the number of items that have at least one item per bucket

This paper uses the chain address method to deal with the conflict, that is, the hash value of the same different objects added to the hash bucket list.

The expected number of items on each bucket

　　Hash the n different items into a hash table of size k, how many items are there on average per bucket? First of all, for any one item (i) is hash to the 1th bucket probability is 1/k, then the n items are hashed, the 1th bucket of the number of items on the expected C (number of items) =n/k, here we picked the first bucket convenient to narrate, in fact, for any particular bucket, This expectation is applicable. This is the average number of items per bucket.

The procedure for simulating the program is as follows:

 1/*** 2 * The expected number of items on each bucket is 3 * 4 * @return 5 */6 private double expec in hash table with n string hash to size K  Teditemnum () {7//bucket size is K 8 int[] bucket = new Int[k]; 9//Generate test string list<string> strings = Getstrings (N); One//hash map for (int i = 0; i < strings.size (); i++) {int h = hash (Strings.get (i), 37); bucket[h]++; 15} 16//Calculates the average number of times per bucket, sum = 0; for (int itemnum:bucket) sum + = Itemnum; return 1.0 * SUM/K;         21} 22 23/*** 24 * Multiple tests calculate the expected number of items on each bucket, * */+ private static void Expecteditemnumtest () {27 Myhash Myhash = new Myhash (); 28//Test 100 times int trynum = 100; Double sum = 0; for (int i = 0; i < Trynum; i++) {Double count = Myhash.expecteditemnum (); M + = count; 34} 35//Take 100 Tests of average 36 double fact = sum/trynum; PNS System.out.println ("k=" + K + "n=" + N); SYSTEM.OUT.PRINTLN ("Expected number of program simulations:" + fact); Double expected = N * 1.0/K; SYSTEM.OUT.PRINTLN ("Estimated number of expected n/k:" + expected); 41}

The result of the output is as follows, we can see that the expectations we have calculated with the formula are very close to the actual, which also shows that our expected formula is calculated correctly, after all, practice is the only criterion to test the truth.

k=1000 n=618 Program Simulation expected number: 0.6180000000000007 estimated number of expected n/k:0.618

The expected number of empty buckets

What is the average number of empty buckets for a hash of n different items into a hash table of size k? We still take the 1th bucket, for example, any item item (i) has no hash to the probability of the first bucket (1-1/k), after the hash of n items, all the items are not hash to the first bucket of the probability of (1-1/k) ^n, which is the probability of each bucket is empty. The number of barrels is k, so the expected empty bucket number is C (empty bucket number) =k (1-1/k) ^n, this formula is not good calculation, with the program run may also be zeroed, conversion is easy to calculate:

C (number of empty barrels) =k (1?1k) n=k (1?1k) K (NK) =ke (NK) (1)

Again we simulate the test:

View Code

Output Result:

k=1000 n=618 Program Simulation expected empty buckets: 539.0 estimated expected empty bucket number ke^ (-n/k): 539.021403076357

Number of conflicts expected

We here n items are different, as long as an item hash to the bucket has been hash of other items, it is considered a conflict, the direct calculation of the number of collisions is not good calculation, but we know C (number of collisions) =n-c (the number of buckets occupied), and the number of buckets occupied C (the number of buckets occupied) = K-c (the number of empty buckets), so we get:

C (number of conflicts) =n? (k?ke?n/k) (2)

The program simulation is as follows:

View Code

Output Result:

k=1000 n=618 Program Simulation number of collisions: 157.89 estimated number of expected collisions N (k-ke^ (-n/k)): 157.02140307635705

Probability of non-conflict

After the n items are hashed, the probability of a conflict does not occur first to the first hash item item (1), item (1) can be hashed into any bucket, but once the item (1) is fixed, the second item (2) can only be hashed to the except item (1) Other k-1 locations in your location, and so on, you can know

P (probability of conflict not occurring) =kkxk?1kxk?1kxk?2kx??? XK? (n?1) k

This probability is not good to calculate, but when K is larger, n is less than hours, there is

P (probability of conflict not occurring) =e?n (n?1) 2k

Simulation process:

View Code

The output is as follows, this approximation formula is only in K compared to the large n hours error is small.

k=1000 n=50 Program Simulation collision probability: 0.29 estimated expected conflict probability e^ (-n (n-1)/(2k)): 0.29375770032353277 collision probability of program simulation: 0.71 estimated expected collision probability 1-e^ (-n (n-1)/( 2k)): 0.7062422996764672

Expect the number of items that have at least one item per bucket

　　The actual use of hash, we do not know at first how many items to hash, if the bucket set too large, will waste space, is generally set an initial size, when the hash of the item exceeds a certain number, the size of the bucket is enlarged one times, and the elements in the bucket to hash again. Viewing Java's HashMap source can be seen, each call put add data will check the size, when the n>k* device factor, the HashMap is rebuilt.

 1 public V put (K key, V value) {2 if (...)     3 return ...;  4 ...  5 modcount++;  6 AddEntry (hash, key, value, I);  7 return null; 8} 9/** * Adds A new entry with the specified key, value and hash code to one * the specified BUC  Ket. It is the responsibility of this * method to resize the table if appropriate. * Subclass overrides this to alter the behavior of put method. */-void addentry (int hash, K key, V value, int bucketindex) {if (size >= threshold) && (Null! = Table[bucketindex]))  {Resize (2 * table.length); hash = (Null! = key)? Hash (key): 0; Bucketindex = Indexfor (hash, table.length); Createentry (hash, key, value, Bucketindex);     24}

Now, instead of rebuilding the hash table directly when n is greater than a certain number, it is expected that every bucket in the hash table will have at least one item to reconstruct the hash table, and now ask how much n is at least one item per bucket. To calculate this n's expectation, we first set Xi Represents the first time it takes i?1 Barrels to the first time occupied I The number of items inserted by the bucket. First, it's easy to understand x1=1 For X2 Represents the number of inserts required to occupy two buckets after the first element is inserted, theoretically it can be any value greater than 1, we insert the item one time at a time, each insert has two independent results, one result is the bucket that is mapped to the first time, and the other is the bucket that is mapped to the new bucket. The number of items inserted when a new bucket is occupied is X2 , and because of the probability of mapping to a new bucket at this time p=k?1k So X2 's Expectations E (X2) =1/p=kk?1 In the same way, after taking two buckets, the probability of mapping any hash to a new bucket is k?2k So E (X2) =kk?2 。

Now define the random variable x=x1+x2+??? +xk , we can see that X is actually the number of items that need to be inserted to fill in each bucket.

E (X) =∑j=1ke (XJ)

=∑j=1kkk?j+1

=k∑j=1k1k?j+1

= Make i=k?j+1k∑i=1k1i

The number above is an interesting number called the harmonic number (Harmonic_number), which has no limit, but a mathematician has given us an equivalent approximation of N:

14+lnk≤hk≤1+lnk

So E (X) =o (Klnk) , when the number of items is Klnk , has at least one item per bucket.

Conclusion Summary

expected number of items on each bucket: hash n items into a hash table of size K, with an average number of items per bucket of NK
the expected number of empty buckets: The hash of n items into a hash table of size k, the average number of empty buckets is ke (? nk)
Conflict count Expectation: When we hash an item into a bucket, and the bucket already has an item, it's a conflict.
the probability of not having a conflict: to hash n items into a hash table of size k, the average number of collisions is n? k?ke?n/k)
Harmonic Number: hk=∑ki=11i called harmonic number, ∑ki=11i=θlogk

This article is mainly referred to from the reference [1], write this blog review of the combination of mathematics and probability theory of knowledge, the hash to understand the deeper into a bit, their own design of the hash structure can be certain of performance. Also learned to insert the formula in the blog Park, before the MathType knocked well again.

Hash probability problem

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Hash probability problem

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Hash probability problem

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support