Probability random sampling of massive data-reservoir Algorithm

Source: Internet
Author: User

The problem originated from Question 10 in programming Pearl column 12, which is described as follows:

How cocould you select one of N objects at random, where you see the objects sequentially but you do not know the value of N beforehand? For concreteness, how wocould you read a text file, and select and print one random line, when you don't know the number of lines in advance?

The problem definition can be simplified as follows: How to randomly extract a row from a file without knowing the total number of objects?

First of all, do we have similar questions? Of course, when we know the number of file lines, we can easily use the rand function of the C Runtime Library to randomly obtain the number of rows, so as to randomly retrieve a row. However, the current situation is that the number of rows is unknown. How can this problem be solved? We need a concept to help us make our conjecture so that the probability of each row is equal, that is, random. This concept is reservoir sampling ).

With this concept, we have a solution: Define the retrieved row number as choice, and take the first row as the row choice for the first time, then, the second time, with a 1/2 probability, determines whether to replace choice with the second line, and the third time, with a 1/3 probability, determines whether to replace choice with the third line ......, And so on.

This method is ingenious because a method is successfully constructed to prove that the probability of each row is 1/N (where N is the number of files scanned currently ), in other words, the probability of each row is equal, that is, the random selection is completed.

Looking back at this issue, we can expand it,That is, how to randomly retrieve K numbers from an unknown or large sample space?

By analogy, we can get the answer: Put the first K number into the reservoir. For k + 1, we will decide whether to change it into the reservoir by K/(k + 1) probability, during the change, a random replacement item is selected as the replacement item. This will continue. For any sample space N, the probability of selecting each number is k/n. That is to say, the probability of selecting each number is equal.

For more information, see [1].

 

Reference

[1] reservoir sampling-reservoir sample http://www.cnblogs.com/HappyAngel/archive/2011/02/07/1949762.html

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.