N-length data stream, to get K data randomly, n is very large (probably greater than your memory and disk capacity) and unknown, can only traverse once, how to obtain a completely random k data.
The method is:
1. Define a K-length array to store the first K data
2, the flow of data flow, when the input data stream data quantity is I (k<i<n), take a number 1 to I, if the generated number is less than k, then the number corresponding to the numbers in the array and I in Exchange.
After completing these two steps, you can achieve the purpose of extracting K random numbers in a data stream of length n.
The next step is to show that for N data, the probability of each data being taken is k/n.
Prove:
The probability of the first I data being put into an array is k/i, as evidenced by the mathematical induction method for inputting I data (k<i<n).
1, when i=k+1, easy to get the number of the first I was put into the array of probabilities are k/(k+1);
2, assuming that when I, all data is put into the array of probabilities are k/i.
3, prove that when i+1, all the data is put into the array of probabilities are k/(i+1)
For the first i+1 data, it is obvious that the probability of it being put into an array is k/(i+1)
For any one of the previous I data, it is put into an array with the probability of k/i (by 2), and the probability that it will remain in the array after entering the i+1 data should be "the probability that it is entered into the array and not displaced by the i+1 data".
"The probability of being displaced by the i+1 data" is this probability ((k/(i+1)) * (1/k) =1/(i+1)
"The probability of not being replaced by the first i+1 data" is 1-1/(1+i) =i/(1+i)
"The probability that it is entered into an array and is not displaced by the i+1 data" is k/i* (i/(1+i)) =k/(1+i)
So for i+1 data streams, the probability of each data being entered into an array is k/(1+i)
Proven success
Introduction to Reservoir algorithm