The problem originated from Question 10 in programming Pearl column 12, which is described as follows:
How cocould you select one of N objects at random, where you see the objects sequentially but you do not know the value of N beforehand? For concreteness, how wocould you read a text file, and select and print one random line, when you don't know the number of lines in advance?
The problem definition can be simplified as follows: How to randomly extract a row from a file without knowing the total number of objects?
first of all, do we think of similar questions? Of course, when we know the number of file lines, we can easily use the rand function of the C Runtime Library to randomly obtain the number of rows, so as to randomly retrieve a row. However, the current situation is that the number of rows is unknown. How can this problem be solved? We need a concept to help us make our conjecture so that the probability of each row is equal, that is, random. This concept is reservoir sampling) .
with this concept, we have a solution: Define the retrieved row number as choice . For the first time, the first row is taken as the row choice , and then decide whether to replace choice : determines whether to replace choice ......, Similarly, the Code can be described as follows:
I = 0
While more input lines
With probability 1.0/++ I
Choice = this input line
Print choice
This method is ingenious because a method is successfully constructed to prove that the probability of each row is 1/N (where N is the number of files scanned currently ), in other words, the probability of each row is equal, that is, the random selection is completed.
The proof is as follows:
Looking back at this issue, we can expand it,That is, how to randomly retrieve K numbers from an unknown or large sample space?
By analogy, we can get the answer: Put the first K number into the reservoir. For k + 1, we will decide whether to change it into the reservoir by K/(k + 1) probability, during the change, a random replacement item is selected as the replacement item. This will continue. For any sample space N, the probability of selecting each number is k/n. That is to say, the probability of selecting each number is equal.
Pseudocode:
Init: A reservoir with the size: K
For I = k + 1 to n
M = random (1, I );
If (M <K)
Swap the MTH value and ith Value
End
The proof is as follows:
reservoir sampling is a type of problem, here, we will summarize and I sincerely sigh that this method is clever, but I still find that this idea is not enough at the source, it makes sense to know why and how to think of this solution.