Hash probability problem

Source: Internet
Author: User

Hash is a sharp knife, processing a huge amount of data often used, we may often use hash, but some of the characteristics of the hash you have thought about, understand. We can use our knowledge of probabilities and expectations to analyze some interesting questions in the hash, such as:

    • The average number of items on each bucket
    • Average number of lookups
    • Average number of conflicts
    • Average number of empty barrels
    • Expect the number of items that have at least one item per bucket

This paper uses the chain address method to deal with the conflict, that is, the hash value of the same different objects added to the hash bucket list.

The expected number of items on each bucket

  Hash the n different items into a hash table of size k, how many items are there on average per bucket? First of all, for any one item (i) is hash to the 1th bucket probability is 1/k, then the n items are hashed, the 1th bucket of the number of items on the expected C (number of items) =n/k, here we picked the first bucket convenient to narrate, in fact, for any particular bucket, This expectation is applicable. This is the average number of items per bucket.

The procedure for simulating the program is as follows:

 1/*** 2 * The expected number of items on each bucket is 3 * 4 * @return 5 */6 private double expec in hash table with n string hash to size K  Teditemnum () {7//bucket size is K 8 int[] bucket = new Int[k]; 9//Generate test string list<string> strings = Getstrings (N); One//hash map for (int i = 0; i < strings.size (); i++) {int h = hash (Strings.get (i), 37); bucket[h]++; 15} 16//Calculates the average number of times per bucket, sum = 0; for (int itemnum:bucket) sum + = Itemnum; return 1.0 * SUM/K;         21} 22 23/*** 24 * Multiple tests calculate the expected number of items on each bucket, * */+ private static void Expecteditemnumtest () {27 Myhash Myhash = new Myhash (); 28//Test 100 times int trynum = 100; Double sum = 0; for (int i = 0; i < Trynum; i++) {Double count = Myhash.expecteditemnum (); M + = count; 34} 35//Take 100 Tests of average 36 double fact = sum/trynum; PNS System.out.println ("k=" + K + "n=" + N); SYSTEM.OUT.PRINTLN ("Expected number of program simulations:" + fact); Double expected = N * 1.0/K; SYSTEM.OUT.PRINTLN ("Estimated number of expected n/k:" + expected); 41}

The result of the output is as follows, we can see that the expectations we have calculated with the formula are very close to the actual, which also shows that our expected formula is calculated correctly, after all, practice is the only criterion to test the truth.

k=1000 n=618 Program Simulation expected number: 0.6180000000000007 estimated number of expected n/k:0.618

The expected number of empty buckets

What is the average number of empty buckets for a hash of n different items into a hash table of size k? We still take the 1th bucket, for example, any item item (i) has no hash to the probability of the first bucket (1-1/k), after the hash of n items, all the items are not hash to the first bucket of the probability of (1-1/k) ^n, which is the probability of each bucket is empty. The number of barrels is k, so the expected empty bucket number is C (empty bucket number) =k (1-1/k) ^n, this formula is not good calculation, with the program run may also be zeroed, conversion is easy to calculate:

C (number of empty barrels) =k (1?1k) n=k (1?1k) K (NK) =ke (NK) (1)

Again we simulate the test:

View Code

Output Result:

k=1000 n=618 Program Simulation expected empty buckets: 539.0 estimated expected empty bucket number ke^ (-n/k): 539.021403076357

Number of conflicts expected

We here n items are different, as long as an item hash to the bucket has been hash of other items, it is considered a conflict, the direct calculation of the number of collisions is not good calculation, but we know C (number of collisions) =n-c (the number of buckets occupied), and the number of buckets occupied C (the number of buckets occupied) = K-c (the number of empty buckets), so we get:

C (number of conflicts) =n? (k?ke?n/k) (2)

The program simulation is as follows:

View Code

Output Result:

k=1000 n=618 Program Simulation number of collisions: 157.89 estimated number of expected collisions N (k-ke^ (-n/k)): 157.02140307635705

Probability of non-conflict

After the n items are hashed, the probability of a conflict does not occur first to the first hash item item (1), item (1) can be hashed into any bucket, but once the item (1) is fixed, the second item (2) can only be hashed to the except item (1) Other k-1 locations in your location, and so on, you can know

P (probability of conflict not occurring) =kkxk?1kxk?1kxk?2kx??? XK? (n?1) k

This probability is not good to calculate, but when K is larger, n is less than hours, there is

P (probability of conflict not occurring) =e?n (n?1) 2k

Simulation process:

View Code

The output is as follows, this approximation formula is only in K compared to the large n hours error is small.

k=1000 n=50 Program Simulation collision probability: 0.29 estimated expected conflict probability e^ (-n (n-1)/(2k)): 0.29375770032353277 collision probability of program simulation: 0.71 estimated expected collision probability 1-e^ (-n (n-1)/( 2k)): 0.7062422996764672

Expect the number of items that have at least one item per bucket

  The actual use of hash, we do not know at first how many items to hash, if the bucket set too large, will waste space, is generally set an initial size, when the hash of the item exceeds a certain number, the size of the bucket is enlarged one times, and the elements in the bucket to hash again. Viewing Java's HashMap source can be seen, each call put add data will check the size, when the n>k* device factor, the HashMap is rebuilt.

 1 public V put (K key, V value) {2 if (...)     3 return ...;  4 ...  5 modcount++;  6 AddEntry (hash, key, value, I);  7 return null; 8} 9/** * Adds A new entry with the specified key, value and hash code to one * the specified BUC  Ket. It is the responsibility of this * method to resize the table if appropriate. * Subclass overrides this to alter the behavior of put method. */-void addentry (int hash, K key, V value, int bucketindex) {if (size >= threshold) && (Null! = Table[bucketindex]))  {Resize (2 * table.length); hash = (Null! = key)? Hash (key): 0; Bucketindex = Indexfor (hash, table.length); Createentry (hash, key, value, Bucketindex);     24}

Now, instead of rebuilding the hash table directly when n is greater than a certain number, it is expected that every bucket in the hash table will have at least one item to reconstruct the hash table, and now ask how much n is at least one item per bucket. To calculate this n's expectation, we first set Xi Represents the first time it takes i?1 Barrels to the first time occupied I The number of items inserted by the bucket. First, it's easy to understand x1=1 For X2 Represents the number of inserts required to occupy two buckets after the first element is inserted, theoretically it can be any value greater than 1, we insert the item one time at a time, each insert has two independent results, one result is the bucket that is mapped to the first time, and the other is the bucket that is mapped to the new bucket. The number of items inserted when a new bucket is occupied is X2 , and because of the probability of mapping to a new bucket at this time p=k?1k So X2 's Expectations E (X2) =1/p=kk?1 In the same way, after taking two buckets, the probability of mapping any hash to a new bucket is k?2k So E (X2) =kk?2 。

Now define the random variable x=x1+x2+??? +xk , we can see that X is actually the number of items that need to be inserted to fill in each bucket.

E (X) =∑j=1ke (XJ)

=∑j=1kkk?j+1

=k∑j=1k1k?j+1

= Make i=k?j+1k∑i=1k1i

The number above is an interesting number called the harmonic number (Harmonic_number), which has no limit, but a mathematician has given us an equivalent approximation of N:

14+lnk≤hk≤1+lnk

So E (X) =o (Klnk) , when the number of items is Klnk , has at least one item per bucket.

Conclusion Summary

  • expected number of items on each bucket: hash n items into a hash table of size K, with an average number of items per bucket of NK
  • the expected number of empty buckets: The hash of n items into a hash table of size k, the average number of empty buckets is ke (? nk)
  • Conflict count Expectation: When we hash an item into a bucket, and the bucket already has an item, it's a conflict.
  • the probability of not having a conflict: to hash n items into a hash table of size k, the average number of collisions is n? k?ke?n/k)
  • Harmonic Number: hk=∑ki=11i called harmonic number, ∑ki=11i=θlogk

This article is mainly referred to from the reference [1], write this blog review of the combination of mathematics and probability theory of knowledge, the hash to understand the deeper into a bit, their own design of the hash structure can be certain of performance. Also learned to insert the formula in the blog Park, before the MathType knocked well again.

Hash probability problem

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.