Large data sample random sampling-Reservoir algorithm

Last Update:2017-10-27 Source: Internet

Author: User

Tags pow

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Recently, in the optimization process of personalized recommendation system encountered some problems, roughly described as follows: currently in our recommendation system, each recommended policy recall item is relatively fixed, this will lead to some problems, the user in a number of recommended scenarios (if the same recall strategy is used in multiple recommended scenarios), The result of multiple requests is also fixed, the utilization efficiency of traffic will be reduced, especially for users with fewer behaviors, the behavior data used as trigger is very small, which makes the recall item homogeneity more serious, making the first problem more obvious.

The current solution is to add a random mechanism to the recall phase of the recommended strategy, allowing users to show similar but incomplete results after multiple scenarios and requests. So the question is converted to a random sampling of k results in the N recall results (the recall needs to be expanded appropriately), two difficulties:

1. When the value of n is very large, the number of K directly in the n number is actually slower, plus we also ask for a non-repeating sample, which results in a random number of samples produced each time the result of a sample with one of the results consistent with the need to re-sampling, which leads to the performance of the online calculation will be affected, This effect will become more and more serious with the increase of N. So we need to have a sampling algorithm with less time complexity, such as the time Complexity of O (N).

2. For the results of the recommended strategy recall, in fact each item has a different weight (similarity), so we can also use this information, that is, sampling is not equal to the probability of sampling, but with a weighted probability sampling.

For the first problem, we can use a reservoir algorithm to solve. First look at the simplified version of the problem, which is to randomly sample 1 numbers from the n number.

Solution: We always select the first object, choose the second with a probability of 1/2, select the third with a probability of 1/3, and so on, select the M object with the probability of 1/m. When the process is finished, each object has the same selected probability, that is, 1/n, as shown below.

Proof: The probability that the first M object is finally selected p= the probability of choosing m * the probability that all objects behind it will not be selected, i.e.

Then the corresponding reservoir sampling problem , that is, the number of randomly sampled k from the number of N. Can be similar to the idea of solving. First to read the first K objects into the "reservoir", for the first K+1 object, the probability of k/(k+1) to select the object, the probability of k/(k+2) Select the K+2 object, and so on, the probability of k/m to select the M-Object (m>k). If M is selected, an object in the reservoir is randomly replaced. Finally, the probability of each object being selected is k/n, as shown below.

Proof: The probability of the selected M object = The probability of choosing M * (the probability that the element is not selected after that) and the probability that the element is chosen thereafter * does not replace the probability of the first M object), i.e.

The actual code implementation is still relatively simple:

1list<map<string, object>> samplelist =NewArraylist<>();2  for(inti=0; i<samplenum; ++i) {3 Samplelist.add (Rawlist.get (i));4 }5  for(intI=samplenum; i<rawlistsize; ++i) {6     intj = R.nextint (i+1);7     if(J <samplenum) {8 Samplelist.remove (j);9 Samplelist.add (Rawlist.get (i));Ten     } One}

Take a look at the second question, which involves a probabilistic sampling problem with weights. Is there a sampling algorithm with weighted probability based on the reservoir algorithm? Of course there are, want to know more about the paper "Weighted random sampling with a reservoir".

First for each sample, there is a weight of Wi, we can do a transformation for this weight value as a score for each sample: Samplescore = Random (0, 1) ^ (1/WI). The sampling process is then consistent with the previous one, and each sample is sequentially read. Maintain a minimum heap for the first k samples (for Samplescore sort), and then for subsequent samples, each time a sample is taken, the samplescore of the new sample is compared with the samplescore of the smallest sample before, If it is larger than the minimum samplescore, then the minimum value is introduced, the new sample is pressed in and the minimum heap is maintained until all the samples have been traversed once.

The specific code is implemented as follows:

comparator<map<string, object>> cmp =NewComparator<map<string, object>>() {     Public intCompare (Map<string, object> E1, Map<string, object>E2) {        returnDouble.compare ((Double) E1.get (Samplescorefield), (Double) E2.get (Samplescorefield)); }}; Priorityqueue<map<string, object>> PQ =NewPriorityqueue<>(Samplenum, CMP); for(inti=0; i<samplenum; ++i) {Map<string, object> item =Rawlist.get (i); DoubleSamplescore = Math.pow (R.nextdouble (), 1.0/(0.001+maputils.getdoublevalue (item, Weightfield, 0.0)));    Item.put (Samplescorefield, Samplescore); Pq.add (item);} for(intI=samplenum; i<rawlistsize; ++i) {Map<string, object> item =Rawlist.get (i); DoubleSamplescore = Math.pow (R.nextdouble (), 1.0/(0.001+maputils.getdoublevalue (item, Weightfield, 0.0)));    Item.put (Samplescorefield, Samplescore); Map<string, object> Minitem =Pq.peek (); if(Samplescore > (Double) Minitem.get (Samplescorefield)) {pq.remove ();    Pq.add (item); }}

Above.

Copyright Notice:

This article by the stupid Rabbit should not all, published in Http://www.cnblogs.com/bentuwuying. If reproduced, please specify the source, without the consent of the author to use this article for commercial purposes, will be held accountable for its legal responsibility.

Random sampling of large data samples-reservoir algorithm

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More