The problem originated from Question 10 in programming Pearl column 12, which is described as follows:
How cocould you select one of N objects at random, where you see the objects sequentially but you do not know the value of N beforehand? For concreteness, how wocould you read a text file, and select and print one random line, when you don't know the number of lines in advance?
The problem definition can be simplified as follows: How to randomly extract a row from a file without knowing the total number of objects?
First of all, do we have similar questions? Of course, when we know the number of file lines, we can easily use the rand function of the C Runtime Library to randomly obtain the number of rows, so as to randomly retrieve a row. However, the current situation is that the number of rows is unknown. How can this problem be solved? We need a concept to help us make our conjecture so that the probability of each row is equal, that is, random. This concept is reservoir sampling ).
With this concept, we have a solution: Define the retrieved row number as choice, and take the first row as the row choice for the first time, then, the second time, with a 1/2 probability, determines whether to replace choice with the second line, and the third time, with a 1/3 probability, determines whether to replace choice with the third line ......, And so on.
This method is ingenious because a method is successfully constructed to prove that the probability of each row is 1/N (where N is the number of files scanned currently ), in other words, the probability of each row is equal, that is, the random selection is completed.
Looking back at this issue, we can expand it,That is, how to randomly retrieve K numbers from an unknown or large sample space?
By analogy, we can get the answer: Put the first K number into the reservoir. For k + 1, we will decide whether to change it into the reservoir by K/(k + 1) probability, during the change, a random replacement item is selected as the replacement item. This will continue. For any sample space N, the probability of selecting each number is k/n. That is to say, the probability of selecting each number is equal.
For more information, see [1].
Reference
[1] reservoir sampling-reservoir sample http://www.cnblogs.com/HappyAngel/archive/2011/02/07/1949762.html