Brief introduction
function: the reservoir sampling algorithm is a sampling algorithm, and for a large set, the sampled values can be guaranteed randomly.
characteristics: Its complexity is not very high O(n) , and can be a great degree of memory savings.
Problem Import
Many big companies interview questions have examined this algorithm, take Google as an example, there is a sample of the reservoir sampling
I have a linked list of length n, the value of n is very large, I do not know the exact value of N. How can I write an algorithm that is as efficient as possible to return a completely random number of K.
There are two restrictions on this problem:
1. Efficient, that is, memory-saving use
2: Try to return the value randomly
If we remove the limit of 1, we can do it very simply, load all the data into memory, calculate the length of the list, and then use the random function to find a few random numbers.
This is not very efficient, load all the data into memory, if the data is very large can cause the inability to calculate.
Note that there is a small tip in the title, which is linked list. Linked lists this data structure is a chain storage structure formed by the end-to-end data node.
Since it is a linked list, it can be processed one node at a point, without having to load all the data into memory. A node to deal with a node, this is not enough image, the topic in a different form to express:
We have a 1 T text file on the hard drive and want to randomly pull a few lines to ensure that memory is used as little as possible and can be completely random.
Before the thought of loading into memory is not suitable, but can also think of other methods, such as each read a row of records loaded into memory, Count +1, empty the memory row data, until the last count of the total number of rows, and then calculate the K random number according to the total number of rows. How do I retrieve the data from the row again? The number that is logged on one side of the line is not in the K random number, and if it is, the contents of the row are preserved.
This way the traversal two times should be able to do, but the 1T data traversal two times the time consumption is very high.
So there's a better solution, that's the pond sampling algorithm.
Specific examples of reservoir sampling algorithm implementation
We first understand the implementation of the reservoir sampling algorithm from the specific case, and then from the abstract point of view.
If we have 10,000 numbers, we will draw 10 random numbers.
A 10,000-digit array of sample sets is recorded S .
An array of 10 random numbers R , represented result .
SThe first 10 numbers in the array are populated into the array R .
The first iterative process for the algorithm is this:
- Start the iteration from the 11th number (subscript 10), generate a random integer from 0 to 10, and
j if j<10 (IF) we replace the 5th item () in the array with the J=4 R R[4] S 11th item () in the array S[10] .
Iterating through the generated R array is the random array we asked for.
Abstract Concepts
$ s[n] $ notation: Sample Collection
$ r[k] $ notation: result set
$ N $ notation: s Array size
\ (j\) : Random number of each time
\ (k\) : First K random number
\ (i\) : Number of iterations.
Steps
Generates a random number \ (j\), the range is \ (0->k+i-1\). Because the array subscript starts at 0, so-1.
If \ (j<k\), replace \ ( r\) with the value\ (r[j] = s[i]\).
- The traversal ends, resulting in an array of results \ (r\).
Algorithm implementation (JAVA)
int[] S =New int[10000];intN = S.length; Random random =NewRandom ();//Generate 10,000-number arrays for(intR =0; R < N; R + +) {S[r] = random.Nextint(10000); }intK =Ten;int[] R =New int[K];//s Pre-K number of fill R array for(intf =0; F < k; f++) {R[f] = s[f]; }intJ;//traversal array s, according to the algorithm, replace the elements in the r array, resulting in the result R array. for(inti = K;i < S.length; i++) {j = random.Nextint(i);if(J < K) R[J] = s[i]; }//Print result of R array for(inti =0; i < R.length; i++) {System. out.println(R[i]); }
Summarize this algorithm. The k random number is obtained by traversing through the loop, and the efficiency is very high in the case of large data, which is very suitable for our application scenario.
But why is it that the number generated is completely random?
Just for the specific example, the first traversal, i=10, the random number range is 0 to 10 a total of 11 numbers, then the probability of not replacing is \ (10/11\), wait until the second iteration, the probability of not replacing becomes \ (10/12\), the third \ (10/13\), fourth time \ (10/14\)....
So it seems that each time the probability is not equal, in fact, it is not so, we want to see the final entry into the array \ (r\) probability, although the 11th number into the r\ probability is relatively large, but in the end he was replaced by a large probability, So what is the probability that each number will eventually remain in \ (r\) ?
I think it's very clear that I can refer to the proof in Wikipedia.
The probability of the nth row being extracted in the loop is k/n, denoted by \ (p_n\) . if the file has n rows, any nth row (Note that n here is the ordinal, not the total).
The probability of being extracted is:
We can calculate the probability that each row is extracted is the same, equal to \ (k/n\).
Very ingenious, so when we are faced with this scenario, we may consider using a reservoir sample for random extraction.
Reservoir Sampling algorithm