Reservoir Sampling algorithm

Source: Internet
Author: User

Brief introduction

function: the reservoir sampling algorithm is a sampling algorithm, and for a large set, the sampled values can be guaranteed randomly.

characteristics: Its complexity is not very high O(n) , and can be a great degree of memory savings.

Problem Import

Many big companies interview questions have examined this algorithm, take Google as an example, there is a sample of the reservoir sampling

I have a linked list of length n, the value of n is very large, I do not know the exact value of N. How can I write an algorithm that is as efficient as possible to return a completely random number of K.

There are two restrictions on this problem:

1. Efficient, that is, memory-saving use

2: Try to return the value randomly

If we remove the limit of 1, we can do it very simply, load all the data into memory, calculate the length of the list, and then use the random function to find a few random numbers.

This is not very efficient, load all the data into memory, if the data is very large can cause the inability to calculate.

Note that there is a small tip in the title, which is linked list. Linked lists this data structure is a chain storage structure formed by the end-to-end data node.

Since it is a linked list, it can be processed one node at a point, without having to load all the data into memory. A node to deal with a node, this is not enough image, the topic in a different form to express:

We have a 1 T text file on the hard drive and want to randomly pull a few lines to ensure that memory is used as little as possible and can be completely random.

Before the thought of loading into memory is not suitable, but can also think of other methods, such as each read a row of records loaded into memory, Count +1, empty the memory row data, until the last count of the total number of rows, and then calculate the K random number according to the total number of rows. How do I retrieve the data from the row again? The number that is logged on one side of the line is not in the K random number, and if it is, the contents of the row are preserved.

This way the traversal two times should be able to do, but the 1T data traversal two times the time consumption is very high.

So there's a better solution, that's the pond sampling algorithm.

Specific examples of reservoir sampling algorithm implementation

We first understand the implementation of the reservoir sampling algorithm from the specific case, and then from the abstract point of view.

If we have 10,000 numbers, we will draw 10 random numbers.

A 10,000-digit array of sample sets is recorded S .

An array of 10 random numbers R , represented result .

SThe first 10 numbers in the array are populated into the array R .

The first iterative process for the algorithm is this:

    • Start the iteration from the 11th number (subscript 10), generate a random integer from 0 to 10, and j if j<10 (IF) we replace the 5th item () in the array with the J=4 R R[4] S 11th item () in the array S[10] .

Iterating through the generated R array is the random array we asked for.

Abstract Concepts

$ s[n] $ notation: Sample Collection

$ r[k] $ notation: result set

$ N $ notation: s Array size

\ (j\) : Random number of each time

\ (k\) : First K random number

\ (i\) : Number of iterations.

Steps

    • Fetch \ (s\) the number of front \ (k\) in the collection \ (r\)

    • Start traversal from \ (s[k]\)

Generates a random number \ (j\), the range is \ (0->k+i-1\). Because the array subscript starts at 0, so-1.

If \ (j<k\), replace \ ( r\) with the value\ (r[j] = s[i]\).

    • The traversal ends, resulting in an array of results \ (r\).
Algorithm implementation (JAVA)
        int[] S =New int[10000];intN = S.length; Random random =NewRandom ();//Generate 10,000-number arrays         for(intR =0; R < N; R + +) {S[r] = random.Nextint(10000); }intK =Ten;int[] R =New int[K];//s Pre-K number of fill R array         for(intf =0; F < k;        f++) {R[f] = s[f]; }intJ;//traversal array s, according to the algorithm, replace the elements in the r array, resulting in the result R array.         for(inti = K;i < S.length; i++) {j = random.Nextint(i);if(J < K)        R[J] = s[i]; }//Print result of R array         for(inti =0; i < R.length; i++) {System. out.println(R[i]); }

Summarize this algorithm. The k random number is obtained by traversing through the loop, and the efficiency is very high in the case of large data, which is very suitable for our application scenario.

But why is it that the number generated is completely random?

Just for the specific example, the first traversal, i=10, the random number range is 0 to 10 a total of 11 numbers, then the probability of not replacing is \ (10/11\), wait until the second iteration, the probability of not replacing becomes \ (10/12\), the third \ (10/13\), fourth time \ (10/14\)....

So it seems that each time the probability is not equal, in fact, it is not so, we want to see the final entry into the array \ (r\) probability, although the 11th number into the r\ probability is relatively large, but in the end he was replaced by a large probability, So what is the probability that each number will eventually remain in \ (r\) ?

I think it's very clear that I can refer to the proof in Wikipedia.

The probability of the nth row being extracted in the loop is k/n, denoted by \ (p_n\) . if the file has n rows, any nth row (Note that n here is the ordinal, not the total).

The probability of being extracted is:

We can calculate the probability that each row is extracted is the same, equal to \ (k/n\).

Very ingenious, so when we are faced with this scenario, we may consider using a reservoir sample for random extraction.

Reservoir Sampling algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.