The problem originates from topic 10 in programming Zhu Ji Nanxiong column 12, which is described as follows:
How could your select one of N objects at random, where you see the objects sequentially and do not know the value of n Beforehand? For concreteness, what would you read a text file, and select and print one random line, when you don ' t know the number of Lines in advance?
The problem definition can be simplified as follows: How do I randomly extract a row from a file without knowing the total number of rows?
The first thing that comes to mind is that we have done a similar problem? Of course, we can easily get a row number randomly by using the RAND function of the C run-up library when we know the number of file rows, but the current situation is not knowing the number of rows, so how do we ask? We need a concept to help us make guesses, so that the probability of taking out each row is equal, or random. This concept is the reservoir sampling (reservoir sampling).
With this concept, we have a solution: if you want to select the number of lines, define the row number to be fetched as choice, the first time directly with the first row as the Fetch row choice, and then the second time with a one-second probability to decide whether to replace the choice with the second line, The third time, with a probability of one-third, determines whether to replace choice with the third line ... and so on, the pseudo-code can be described as follows:
i = 0
While more input lines
with probability 1.0/++i
Choice = This input line
Print Choice
The trick of this approach is that the successful construction of a method makes it possible to finally prove that the probability of extracting each row is 1/n (where n is the number of file rows currently scanned), in other words , the probability of each row being taken is equal, and the random selection is completed .
If a row is to be selected, you need to select it and not be replaced .
The following is a selection of only one row, to ensure that each selection of a row is equal to the probability of occurrence. The proof is as follows:
Looking back at this question, we can extend it, that is, how to randomly fetch the number of k from an unknown or a large sample space?
Analogy can get the answer, that is, first put the first k number into the reservoir, to the k+1, we take the probability of k/(k+1) to decide whether to swap it into the reservoir, in exchange for random selection of a replacement, so always do, for any sample space n, the selection probability of each number is k/n. In other words, the probability of selecting each number is equal.
Pseudo code:
Init:a Reservoir with the Size:k
For i= k+1 to N
M=random (1, i);
if (M < k)
SWAP the Mth value and ith value
End for
The proof is as follows:
The reservoir sampling problem is a kind of question, summarizes here, and heartily sighs this method ingenious, but to this kind of thought produces the source still to find not enough, if can know why and how to think this solution, certainly will be more meaningful.
Reservoir sampling-Reservoir sampling