Problem definition:
give you a list of the length of N. N is big, but you don't know how big N is. Your task is to randomly remove k elements from these n elements. You can only traverse this list once. Your algorithm must ensure that the extracted elements happen to have K, and that they are completely random (with equal probability of occurrence).
Reservoir sampling algorithm:
The algorithm is based on the probability of extracting the number of distinct k from a sequence and ensuring that the probabilities extracted from each of the numbers are k/n. The practice is:-
First, a reservoir of k elements is constructed, and the first k elements of the sequence are placed in the reservoir.
Then, starting with the k+1 element, the probability of k/n determines whether the element is replaced in the pool. When all the elements have been traversed, you can get a randomly selected K element. The degree of complexity is O (n).
Its pseudo-code is as follows:
Init:a Reservoir with the Size:k
For i= k+1 to N
M=random (1, i);
if (M < k)
SWAP the Mth value and ith value
End for
the probability that each number is taken is k/n:
For the number of I (i<k), the probability of being selected in the first k step is 1, starting from step k+1, I is not selected by the probability of k/k+1, then read the number of Nth, the number of I (i<k) is selected probability = The probability of being selected * The probability that each step will not be swapped, that
1 * k/k+1 * k+1/k+2 ... n-1/n = k/n
The probability of being selected for the number of J (j>=k) is: The probability of being selected at the time of his appearance * the probability of not being swapped out after his appearance, namely:
k/j * J/j+1 ... n-1/n = k/n
Comprehensive evidence.
Reservoir Sampling algorithm