Sampling Problems-Reading Notes of programming Pearl River

Source: Internet
Author: User

Question: enter two integers, M and N, and m <n. Output an ordered list composed of M random numbers. The random numbers range from 0 to n-1, and each integer can appear at most once.

Method 1:

The method proposed in knuth's book seminumerical algorithms traverses n numbers sequentially, and elements that pass random test conditions are selected.

An example is used to explain the random test conditions, such as m = 2 and n = 5. The probability that the first element 0 is selected is 2/5. The probability that the second element 1 is selected depends on whether the first element is selected. If 0 is selected, the probability of 1 being selected is 1/4. Otherwise, it is 2/4. The probability of all 1 being selected is (2/5) * (1/4) + (3/5) * (2/4) = 2/5; similarly, the probability of element 2 being selected depends on the first two choices. If not selected, the probability of option 2 being selected is 2/3. If either of the first two is selected, then the probability of 2 being selected is 1/3. If both of the first two are selected, the probability of 2 being selected is 0, so the probability of 2 being selected is (3/5) * (3/5) * (2/3) + 2 * (2/5) * (3/5) * (1/3) = 2/5. And so on. The probability of each element being selected is 2/5.

In general, if S elements are selected from the remaining R elements, the probability of the next element being selected is S/R. From the perspective of the entire data set, the probability of each element being selected is the same.

The code for this idea is as follows:

select = mremaining = nfor i = [0, n)        if (bigrand() % remaining) < select                print i                --select        --remaining

First, this algorithm ensures that M elements are selected, and there are no more or fewer elements. The proof is as follows: first, it is proved that there will be no more than M: Because when select is equal to 0, more integers cannot be selected; then it is proved that there will be no less than M: When select/remaining = 1, an element is always selected. Because bigrand () % remaining <remaining is always set up, I will always be selected.

Secondly, the probability of each element being selected is equal, Which is M/N. The C ++ implementation code is as follows. At the same time, it is calculated from 268435455 (about 0.27 billion, int can represent the maximum integer divided by 8. In this case, we are going to take the maximum integer of int about 2.15 billion for testing, however, method 3 needs to first generate such a large space that exceeds the maximum stack space that can be allocated by the program.) Select 0.1 million Integers to test the time used by this method, it is easier to compare the performance with the subsequent methods.

#include <iostream>#include <ctime>#include <cstdlib>#include <limits>using namespace std;void genknuth(int m, int n){time_t t_start, t_end;t_start = time(NULL);for (int i = 0; i != n; ++i)if ((rand() % (n-i)) < m){cout << i << " ";--m;}cout << endl;t_end = time(NULL);cout << "collapse time: " << difftime(t_end, t_start) << " s" << endl;}int main(){int m = 100000;int n = numeric_limits<int>::max() / 8;srand(time(NULL));genknuth(m, n);cout << "n = " << n << endl;return 0;}

The space complexity of this algorithm is O (M), and the time complexity is O (n)Using this algorithm to randomly find 0.27 billion of the 0.1 million numbers4 seconds.

Method 2:

The time required by method 1 is proportional to the search space, and some applications are still unacceptable. Therefore, we need to continue to improve it. One of the methods is to randomly insert data into a collection with a capacity of M. The Code is as follows:

initialize set S to emptysize = 0while size < m do        t = bigrand() % n        if t is not in S                insert t into S                ++sizeprint the elements of S in sorted order

The implementation of C ++ code is as follows. The implementation of set S is implemented using the set provided by STL, and the underlying layer is implemented using a red/black tree. The same data cannot be inserted repeatedly, when the data to be inserted already exists in the Set, the insertion is invalid and the data is not inserted into the set. The insertion time complexity is O (logm ):

#include <iostream>#include <ctime>#include <cstdlib>#include <limits>#include <set>using namespace std;void gensets(int m, int n){time_t t_start, t_end;t_start = time(NULL);set<int> S;while (S.size() < m)S.insert(rand() % n);for (set<int>::iterator iter = S.begin(); iter != S.end(); ++iter)cout << *iter << " ";cout << endl;t_end = time(NULL);cout << "collapse time: " << difftime(t_end, t_start) << " s" << endl;}int main(){int m = 100000;int n = numeric_limits<int>::max() / 8;srand(time(NULL));gensets(m, n);return 0;}

The time complexity of the algorithm is O (mlogm), and the space complexity is O (m). Similarly, if you select 0.27 billion data records from the 0.1 million data range, the time spent is2 secondsThe visible speed is faster than the original knuth method.

Method 3:

Mess up an array of n elements, and then sort and output the first M elements. Later, Ashley
Shepherd and Alex woronow found that they only need to mess up M elements before the array. For how to generate a random sequence, refer to my article "shuffling program" and Wikipedia.

The C ++ code of this method is implemented as follows:

#include <iostream>#include <ctime>#include <cstdlib>#include <limits>#include <algorithm>using namespace std;// generate a random number between i and j,// both i and j are include.int randint(int i, int j){int ret = i + rand() % (j - i + 1);return ret;}void genshuf(int m, int n){time_t t_start, t_end;t_start = time(NULL);int i, j;int *x = new int[n];for (i = 0; i != n; ++i)x[i] = i;for (i = 0; i != m; ++i){j = randint(i, n-1);int t = x[i]; x[i] = x[j]; x[j] = t; // swap x[i] and x[j]}sort(x, x + m);for (i = 0; i != m; ++i)cout << x[i] << " ";cout << endl;t_end = time(NULL);cout << "collapse time: " << difftime(t_end, t_start) << " s" << endl;delete []x;x = NULL;}int main(){int m = 100000;int n = numeric_limits<int>::max() / 8;srand(time(NULL));genshuf(m, n);return 0;}


The time complexity of the algorithm is O (n + mlogm), and the space complexity is O (n)
. Select 0.1 million data entries from the data range. The time spent is4 secondsThe time is similar to the method. Some of the time is spent on the initialization array. If you use the method of Question 1.9 in programming, it is initialized only when a certain number is used, this algorithmThe complexity can be reduced to O (mlogm). However, the space complexity O (n) is too large.

Regarding method 2 or method 3, a math method on stackflow proves that when m <n, method 2 performs better than method 3.

References:

Http://www.cnblogs.com/2010Freeze/archive/2012/02/27/2370284.html

Http://hi.baidu.com/23star/blog/item/47f7314e5c3b0e01b2de0574.html

Taking random samples

A sample of brilliance,
Programming Perls

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.