There is a large and large input stream, so large that no memory can be stored, and only input once, how to randomly get N records from this input stream. What method can we use to get the random data of n equal probabilities without the need to store the data stream?
That's, If m<=n, just keep it. For M>n, generate a random number R=rand (m) in [0, M), replace A[r] with new number if R falls in [0, N).
The procedure is as follows:
#include <iostream>
using namespace Std;
void sample (int n,int* output)
{
int m=0;
int Val;
while (M<n && cin>>val)
Output[m++]=val;
while (Cin>>val)
{
m++;
if (rand ()%m<n)
Output[rand ()%n]=val;
}
}
The correctness of this method proves that:
1. Since a total of n elements are stored, the probability of selecting one of the elements is 1/n (this is temporarily placed, which is used later)
2. As shown
We first ask for the probability that a data is finally taken in n data, which obviously requires all s data (assuming the total number of s) to be equal to the probability of each number being obtained, i.e. 1/s.
If the first data in the N data is taken, it is not replaced by any data in the subsequent data.
first to seek the probability of not being replaced by the n+1 data:
First to obtain the correlation probability, to n/k probability to determine whether to take, that is:
obtained probability--n/(n+1)
does not get the probability--1/(n+1)
obtained and the first data exchange probability (n/( N+1) * (1/n)
Therefore: the probability that the first data is not replaced by the n+1 data is (1-(n/(n+1)) * (1/n))
Simplification is: n/(n+1)
This is the n+1 data, then the subsequent data. The probability that
is not replaced by the n+2 data, in turn: (n+1)/(n+2)
is not replaced by the first n+3 data: (n+2)/(n+3)
... ..... ..... ..... ............................. ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ....... ..... The probability that the
is not replaced by the S-1 data: (s-2)/(S-1)
is not replaced by the first S data: (s-1)/s
The probability that the first data is not replaced is the product of each probability, namely:
(n/(n+1)) * ((n+1)/( N+2) * ((n+2)/(n+3) * ... * ((s-2)/(S-1) * ((S-1)//////////////)
This, the previous and next items numerator slightly, the result of simplification is: N/S.
3. You can use the probability of the first step, because the first data is one of the top n data, and the final result is a random n data, so the probability of the second step is multiplied by the 1/n in the first step, the result is (N/S) * (1/n) is 1/s, That was the result we expected. Since each data obtained probability is 1/s, then, equal probability.