Reservoir Sampling algorithm

Last Update:2018-06-01 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Brief introduction

function: the reservoir sampling algorithm is a sampling algorithm, and for a large set, the sampled values can be guaranteed randomly.

characteristics: Its complexity is not very high O(n) , and can be a great degree of memory savings.

Problem Import

Many big companies interview questions have examined this algorithm, take Google as an example, there is a sample of the reservoir sampling

I have a linked list of length n, the value of n is very large, I do not know the exact value of N. How can I write an algorithm that is as efficient as possible to return a completely random number of K.

There are two restrictions on this problem:

1. Efficient, that is, memory-saving use

2: Try to return the value randomly

If we remove the limit of 1, we can do it very simply, load all the data into memory, calculate the length of the list, and then use the random function to find a few random numbers.

This is not very efficient, load all the data into memory, if the data is very large can cause the inability to calculate.

Note that there is a small tip in the title, which is linked list. Linked lists this data structure is a chain storage structure formed by the end-to-end data node.

Since it is a linked list, it can be processed one node at a point, without having to load all the data into memory. A node to deal with a node, this is not enough image, the topic in a different form to express:

We have a 1 T text file on the hard drive and want to randomly pull a few lines to ensure that memory is used as little as possible and can be completely random.

Before the thought of loading into memory is not suitable, but can also think of other methods, such as each read a row of records loaded into memory, Count +1, empty the memory row data, until the last count of the total number of rows, and then calculate the K random number according to the total number of rows. How do I retrieve the data from the row again? The number that is logged on one side of the line is not in the K random number, and if it is, the contents of the row are preserved.

This way the traversal two times should be able to do, but the 1T data traversal two times the time consumption is very high.

So there's a better solution, that's the pond sampling algorithm.

Specific examples of reservoir sampling algorithm implementation

We first understand the implementation of the reservoir sampling algorithm from the specific case, and then from the abstract point of view.

If we have 10,000 numbers, we will draw 10 random numbers.

A 10,000-digit array of sample sets is recorded S .

An array of 10 random numbers R , represented result .

SThe first 10 numbers in the array are populated into the array R .

The first iterative process for the algorithm is this:

Start the iteration from the 11th number (subscript 10), generate a random integer from 0 to 10, and j if j<10 (IF) we replace the 5th item () in the array with the J=4 R R[4] S 11th item () in the array S[10] .

Iterating through the generated R array is the random array we asked for.

Abstract Concepts

$ s[n] $ notation: Sample Collection

$ r[k] $ notation: result set

$ N $ notation: s Array size

\ (j\) : Random number of each time

\ (k\) : First K random number

\ (i\) : Number of iterations.

Steps

Fetch \ (s\) the number of front \ (k\) in the collection \ (r\)
Start traversal from \ (s[k]\)

Generates a random number \ (j\), the range is \ (0->k+i-1\). Because the array subscript starts at 0, so-1.

If \ (j<k\), replace \ ( r\) with the value\ (r[j] = s[i]\).

The traversal ends, resulting in an array of results \ (r\).

Algorithm implementation (JAVA)

        int[] S =New int[10000];intN = S.length; Random random =NewRandom ();//Generate 10,000-number arrays         for(intR =0; R < N; R + +) {S[r] = random.Nextint(10000); }intK =Ten;int[] R =New int[K];//s Pre-K number of fill R array         for(intf =0; F < k;        f++) {R[f] = s[f]; }intJ;//traversal array s, according to the algorithm, replace the elements in the r array, resulting in the result R array.         for(inti = K;i < S.length; i++) {j = random.Nextint(i);if(J < K)        R[J] = s[i]; }//Print result of R array         for(inti =0; i < R.length; i++) {System. out.println(R[i]); }

Summarize this algorithm. The k random number is obtained by traversing through the loop, and the efficiency is very high in the case of large data, which is very suitable for our application scenario.

But why is it that the number generated is completely random?

Just for the specific example, the first traversal, i=10, the random number range is 0 to 10 a total of 11 numbers, then the probability of not replacing is \ (10/11\), wait until the second iteration, the probability of not replacing becomes \ (10/12\), the third \ (10/13\), fourth time \ (10/14\)....

So it seems that each time the probability is not equal, in fact, it is not so, we want to see the final entry into the array \ (r\) probability, although the 11th number into the r\ probability is relatively large, but in the end he was replaced by a large probability, So what is the probability that each number will eventually remain in \ (r\) ?

I think it's very clear that I can refer to the proof in Wikipedia.

The probability of the nth row being extracted in the loop is k/n, denoted by \ (p_n\) . if the file has n rows, any nth row (Note that n here is the ordinal, not the total).

The probability of being extracted is:

We can calculate the probability that each row is extracted is the same, equal to \ (k/n\).

Very ingenious, so when we are faced with this scenario, we may consider using a reservoir sample for random extraction.

Reservoir Sampling algorithm

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Reservoir Sampling algorithm

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support