[Application of reservoir sampling] How to Select k elements from n elements with equal probability

Source: Internet
Author: User
How to Select k elements from n elements with equal probability? This problem is a reservoir sampling. The algorithm can be described as follows:

 Init : A reservoir with the size: K

                       For  I = k + 1ToN

                             M = random (1, I );

                             If (M <K)

                                      SwapThe
MTh ValueAndITh Value

                        End

Someone has provided proof on the Internet and first forwarded it:

[Switch]

Proof:

 

Each time it is selected based on the probability of K/I
For example, if K is 1000, the probability of 1001 being selected is 1001, and the probability of 1000/1001 being selected is 1002, which is consistent with our intuition.
 
The following proof:
 Assume that the current element is I + 1. According to our rules, the probability of I + 1 being selected is k/I + 1, that is, the probability that the element I + 1 appears in the reservoir is k/I + 1.
 Consider the first I element. If the probability that the first I element appears in the reservoir is k/I + 1, it indicates that our algorithm is correct.
  
This problem can be proved by induction: k <I <= N
 1. when I = k + 1, the capacity of the reservoir is K. The probability of K + 1 elements being selected is k/(k + 1 ), at this time, the probability of the first k elements appearing in the reservoir is k/(k + 1). The conclusion is obvious.
 2. Assume that when J = I, the conclusion is true. At this time, the probability of K/I is used to select the I element, and the probability of the first I-1 element appearing in the reservoir is k/I.
 Verify that when J = I + 1:
 That is to say, when K/I + 1 probability is used to select the I + 1 element, at this time, the probability of any first I element appearing in the reservoir is k/(I + 1 ).
The probability that the first I element appears in the reservoir is composed of two parts. ① The first I + 1 option shows that the element is in the reservoir, ② ensure that the I + 1 option is not replaced
 ①. 2 knows that before I + 1 is selected, the probability that any of the first I elements appear in the reservoir is k/I
 ②. Consider the probability of replacement: 
First, you must replace the I + 1 element with the selected element (otherwise you do not need to replace it). The probability is k/I + 1, the second reason is that any one of the k elements in the randomly replaced pool is 1/K.
 Probability of replacement of any of the first I elements = K/(I + 1) * 1/k = 1/I + 1
 The probability of not being replaced is:1-1/(I + 1) = I/I + 1
① ②, Using multiplication rules
The probability that the first I element appears in the reservoir is k/I * I/(I + 1) = K/I + 1.
 The proof is true

 

For the sampling problem, I recently saw some methods to summarize:

Problem: m elements must be extracted from 1, 2, 3. N in an equal probability manner.

1. Use the above reservoir for sampling

Void sample_pool (const int N, const int m) {int I, RD; int * x = new int [N]; for (I = 0; I <N; I ++) x [I] = I + 1; for (I = m; I <n; I ++) {RD = rand () % I; if (RD <m) Swap (X [I], X [RD]) ;}for (I = 0; I <m; I ++) cout <X [I] <"; Delete [] X; X = NULL;} // both space and time are O (N)

2, select m from N, You can first determine a, and then from the N-1 Under the selected m out.

void sample_rand(const int N,const int m){        int select = m,i,rd;        int remain = N;        for(i = 0; i < N ; i++)        {             rd = rand()%remain;             if(rd < select)            {                  cout<< i<<" ";                  select--;             }             remaining--;          }}

 The above method is very classic and was proposed by knuth in the art of computer programming. The extra space used is O (1) and the time is O (n ). The proof of its probability is also very simple. Simply push to discoverable, It is equal probability to select each element. In addition, in the end, only M elements will be selected. If no selection is made before, the remaining = select option will be selected.

3. When we regard sampling as a set, we need to select m different elements from N and store them in the set. The set can be used to complete the process.

Use the set in STL to complete this function.

void sample_set(const int N,const int m){set<int>s;while(s.size()<m){s.insert(rand()%n);}for(set<int>::iterator it = s.begin();it!=s.end();it++)cout<<*it<<" ";}

4. disrupt an incremental sequence.

For I = [0, n)

Swap (X [I], X [rand (I, n-1)];

Someone has proved that it is enough to disturb the first M.

void sample_shuf(const int N,const int m){int i, j;int *x = new int[N];for(i = 0 ; i <N; i++)  x[i]=i+1;for(i = 0 ; i < m ; i ++){j = rand(i,n-1);swap(x[i],x[j]);}sort(x,x+m);Print(x,m);delete []x;x= NULL;}

Several questions about sampling:

1. Given a random number generator which can generate the number in rang () uniformly, how can u use it to build a random number generator which can generate number in range) uniformly?

Answer: using the reject sampling theorem

  First, use the random generator between () twice and use it in a 5-in-5 format to form a () random generator: ([Gen] [Gen]) 5, each [] is a 5-digit in decimal format: x = Gen * 5 + gen. The value range is 6-30, A simple left movement can be converted to a value in the range of 1 to 25. Then, the () value is evenly allocated to 7, and 21 is a multiple of 7, therefore, you can perform a ing for each of the three (of course, you can also cut off the numbers after 7, but the range is too small and the efficiency is not high ), 1-3 -- "-6 --"-21 -- "7. This is equivalent probability. If a number between 22-25 is generated, two methods can be used to determine the result:

  (1) reject sampling and re-calculate

  (2) If a number between 22-25 is obtained, the result of this generator is directly used. Someone has proved that this method is of equal probability. The Metropolis algorithm.

In hexadecimal notation:--> Corresponding decimal:Minus 5 Translation

11 12 13 14 1567891012345

21 22 23 24 251112131415678910

31 32 33 34 3516171819201112131415

41 42 43 44 4521222324251617181920

51 52 53 54 5526272829302122232425

2. Generate a random permutation for a deck of cards

Answer:

  From the back to the front, in step K, a 1-k Number J is randomly generated, and then the numbers at J and K are exchanged, it is easy to final that the arrangement is an equal probability arrangement.

For k = N: 1

  J = rand (1, K)

  Swap (j, k)

End

  You can also perform this process from the past to the next, but the generated range is between K-n.

For k = 1: N

  J = rand (k, n)

  Swap (j, k)

End

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.