Reservoir sampling-reservoir sampling

Source: Internet
Author: User

The problem originated from Question 10 in programming Pearl column 12, which is described as follows:

How cocould you select one of N objects at random, where you see the objects sequentially but you do not know the value of N beforehand? For concreteness, how wocould you read a text file, and select and print one random line, when you don't know the number of lines in advance?

The problem definition can be simplified as follows: How to randomly extract a row from a file without knowing the total number of objects?

first of all, do we think of similar questions? Of course, when we know the number of file lines, we can easily use the rand function of the C Runtime Library to randomly obtain the number of rows, so as to randomly retrieve a row. However, the current situation is that the number of rows is unknown. How can this problem be solved? We need a concept to help us make our conjecture so that the probability of each row is equal, that is, random. This concept is reservoir sampling) .

with this concept, we have a solution: Define the retrieved row number as choice . For the first time, the first row is taken as the row choice , and then decide whether to replace choice : determines whether to replace choice ......, Similarly, the Code can be described as follows:

I = 0

While more input lines

With probability 1.0/++ I

Choice = this input line

Print choice

This method is ingenious because a method is successfully constructed to prove that the probability of each row is 1/N (where N is the number of files scanned currently ), in other words, the probability of each row is equal, that is, the random selection is completed.

The proof is as follows:

Looking back at this issue, we can expand it,That is, how to randomly retrieve K numbers from an unknown or large sample space?

By analogy, we can get the answer: Put the first K number into the reservoir. For k + 1, we will decide whether to change it into the reservoir by K/(k + 1) probability, during the change, a random replacement item is selected as the replacement item. This will continue. For any sample space N, the probability of selecting each number is k/n. That is to say, the probability of selecting each number is equal.

Pseudocode:

Init: A reservoir with the size: K

For I = k + 1 to n

M = random (1, I );

If (M <K)

Swap the MTH value and ith Value

End 

The proof is as follows:

reservoir sampling is a type of problem, here, we will summarize and I sincerely sigh that this method is clever, but I still find that this idea is not enough at the source, it makes sense to know why and how to think of this solution.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.