Reservoir sampling reservoir sampling algorithm, classical sampling

Source: Internet
Author: User
Tags shuffle

Random reading of the data, how to ensure that the true random is not possible, because the computer's random function is pseudo-random.

But how to ensure the random sampling of the data without considering the random function of the computer?

1. Shuffle functions provided by the system

The C++/java provides a shuffle function that can disrupt the data inside the container and keep it randomly sorted.

C++:

1 template <classclass urng>2   void Shuffle ( Randomaccessiterator First, Randomaccessiterator last, urng&& g);

Java:

1 Static void    Shuffle (list<?> List); 2 Static void    Shuffle (list<?> List, Random rnd);

These functions shuffle the number of data in a random order, and cannot handle a variable amount of data flow.

2. Take a number in the sequence stream, how to ensure randomness, that is, the probability of extracting a data is:1/(number of data read)

Assuming that the n number has been read, the number that is now reserved is ax, and the probability of taking it to Ax is (1/n).

For the number of n+1 an+1, take the probability of 1/(n+1) to an+1, otherwise still take ax. By analogy, the randomness of the data can be guaranteed.

The mathematical induction method proves as follows:

When N=1, obviously, take A1. The probability of taking A1 is 1/1.

Assume that when n=k, the data is taken to the ax. The probability of taking ax is 1/k.

When N=k+1, take an+1 with a probability of 1/(k+1), or still take ax.

(1) If ak+1 is taken, the probability is 1/(k+1);

(2) If Ax is still taken, the probability is (1/k) * (k/(k+1)) =1/(k+1)

  So, for the next number of n+1 an+1, take the probability of 1/(n+1) an+1, otherwise still take ax. By analogy, the randomness of the data can be guaranteed.

  The code is as follows:

1 //take a number in the sequence stream to ensure uniformity, that is, the probability of extracting the data is: 1/(number of data read)2 voidRandnum () {3     intres=0;4     intnum=0;5num=1;6Cin>>Res;7 8     inttmp;9      while(cin>>tmp) {Ten         if(rand ()% (num+1)+1>num) Oneres=tmp; Anum++; -     } -cout<<"res="<<res<<Endl; the}

3. The number of k in the sequence stream, how to ensure randomness, that is, the probability of fetching a certain data is:k/(number of read data)

Creates an array that stores the number of first k in the sequence stream in the array. (The so-called "cistern")

For the nth number of an, the probability of k/n takes an and randomly replaces an element in the "cistern" with the probability of 1/k; otherwise the "cistern" array does not change. By analogy, the randomness of the data can be guaranteed.

The mathematical induction method proves as follows:

When N=k is, it is clear that any number in the "cistern" is satisfied, and the probability of preserving the number is k/k.

Assuming that when n=m (m>k), any number in the "cistern" is satisfied, the probability of preserving the number is k/m.

When n=m+1, the probability of k/(m+1) to take an, and the probability of 1/k, randomly replace an element in the "cistern", otherwise the "cistern" array is unchanged. The probability of the number left in the array is:

 

  Therefore, for the nth number of an, take an with the probability of k/n and randomly replace an element in the "cistern" with the probability of 1/k; otherwise the "cistern" array does not change. By analogy, the randomness of the data can be guaranteed.

The code is as follows:

1 //the number of n is taken in the sequence stream to ensure uniformity, that is, the probability of extracting the data is: n/(number of data read)2 voidRandknum (intN) {3     int*myarray=New int[n];4      for(intI=0; i<n;i++)5Cin>>Myarray[i];6 7     inttmp=0;8     intnum=N;9      while(cin>>tmp) {Ten         if(rand ()% (num+1)+1<n) OneMyarray[rand ()%n]=tmp; A     } -  -      for(intI=0; i<n;i++) thecout<<myarray[i]<<Endl; -}

Reservoir sampling reservoir sampling algorithm, classic sampling

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.