Sampling Problems-Reading Notes of programming Pearl River

Last Update:2018-12-03 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Question: enter two integers, M and N, and m <n. Output an ordered list composed of M random numbers. The random numbers range from 0 to n-1, and each integer can appear at most once.

Method 1:

The method proposed in knuth's book seminumerical algorithms traverses n numbers sequentially, and elements that pass random test conditions are selected.

An example is used to explain the random test conditions, such as m = 2 and n = 5. The probability that the first element 0 is selected is 2/5. The probability that the second element 1 is selected depends on whether the first element is selected. If 0 is selected, the probability of 1 being selected is 1/4. Otherwise, it is 2/4. The probability of all 1 being selected is (2/5) * (1/4) + (3/5) * (2/4) = 2/5; similarly, the probability of element 2 being selected depends on the first two choices. If not selected, the probability of option 2 being selected is 2/3. If either of the first two is selected, then the probability of 2 being selected is 1/3. If both of the first two are selected, the probability of 2 being selected is 0, so the probability of 2 being selected is (3/5) * (3/5) * (2/3) + 2 * (2/5) * (3/5) * (1/3) = 2/5. And so on. The probability of each element being selected is 2/5.

In general, if S elements are selected from the remaining R elements, the probability of the next element being selected is S/R. From the perspective of the entire data set, the probability of each element being selected is the same.

The code for this idea is as follows:

select = mremaining = nfor i = [0, n)        if (bigrand() % remaining) < select                print i                --select        --remaining
First, this algorithm ensures that M elements are selected, and there are no more or fewer elements. The proof is as follows: first, it is proved that there will be no more than M: Because when select is equal to 0, more integers cannot be selected; then it is proved that there will be no less than M: When select/remaining = 1, an element is always selected. Because bigrand () % remaining <remaining is always set up, I will always be selected.
Secondly, the probability of each element being selected is equal, Which is M/N. The C ++ implementation code is as follows. At the same time, it is calculated from 268435455 (about 0.27 billion, int can represent the maximum integer divided by 8. In this case, we are going to take the maximum integer of int about 2.15 billion for testing, however, method 3 needs to first generate such a large space that exceeds the maximum stack space that can be allocated by the program.) Select 0.1 million Integers to test the time used by this method, it is easier to compare the performance with the subsequent methods.
#include <iostream>#include <ctime>#include <cstdlib>#include <limits>using namespace std;void genknuth(int m, int n){time_t t_start, t_end;t_start = time(NULL);for (int i = 0; i != n; ++i)if ((rand() % (n-i)) < m){cout << i << " ";--m;}cout << endl;t_end = time(NULL);cout << "collapse time: " << difftime(t_end, t_start) << " s" << endl;}int main(){int m = 100000;int n = numeric_limits<int>::max() / 8;srand(time(NULL));genknuth(m, n);cout << "n = " << n << endl;return 0;}
The space complexity of this algorithm is O (M), and the time complexity is O (n)Using this algorithm to randomly find 0.27 billion of the 0.1 million numbers4 seconds.
Method 2:
The time required by method 1 is proportional to the search space, and some applications are still unacceptable. Therefore, we need to continue to improve it. One of the methods is to randomly insert data into a collection with a capacity of M. The Code is as follows:
initialize set S to emptysize = 0while size < m do        t = bigrand() % n        if t is not in S                insert t into S                ++sizeprint the elements of S in sorted order
The implementation of C ++ code is as follows. The implementation of set S is implemented using the set provided by STL, and the underlying layer is implemented using a red/black tree. The same data cannot be inserted repeatedly, when the data to be inserted already exists in the Set, the insertion is invalid and the data is not inserted into the set. The insertion time complexity is O (logm ):
#include <iostream>#include <ctime>#include <cstdlib>#include <limits>#include <set>using namespace std;void gensets(int m, int n){time_t t_start, t_end;t_start = time(NULL);set<int> S;while (S.size() < m)S.insert(rand() % n);for (set<int>::iterator iter = S.begin(); iter != S.end(); ++iter)cout << *iter << " ";cout << endl;t_end = time(NULL);cout << "collapse time: " << difftime(t_end, t_start) << " s" << endl;}int main(){int m = 100000;int n = numeric_limits<int>::max() / 8;srand(time(NULL));gensets(m, n);return 0;}
The time complexity of the algorithm is O (mlogm), and the space complexity is O (m). Similarly, if you select 0.27 billion data records from the 0.1 million data range, the time spent is2 secondsThe visible speed is faster than the original knuth method.
Method 3:
Mess up an array of n elements, and then sort and output the first M elements. Later, Ashley
Shepherd and Alex woronow found that they only need to mess up M elements before the array. For how to generate a random sequence, refer to my article "shuffling program" and Wikipedia.
The C ++ code of this method is implemented as follows:
#include <iostream>#include <ctime>#include <cstdlib>#include <limits>#include <algorithm>using namespace std;// generate a random number between i and j,// both i and j are include.int randint(int i, int j){int ret = i + rand() % (j - i + 1);return ret;}void genshuf(int m, int n){time_t t_start, t_end;t_start = time(NULL);int i, j;int *x = new int[n];for (i = 0; i != n; ++i)x[i] = i;for (i = 0; i != m; ++i){j = randint(i, n-1);int t = x[i]; x[i] = x[j]; x[j] = t; // swap x[i] and x[j]}sort(x, x + m);for (i = 0; i != m; ++i)cout << x[i] << " ";cout << endl;t_end = time(NULL);cout << "collapse time: " << difftime(t_end, t_start) << " s" << endl;delete []x;x = NULL;}int main(){int m = 100000;int n = numeric_limits<int>::max() / 8;srand(time(NULL));genshuf(m, n);return 0;}

The time complexity of the algorithm is O (n + mlogm), and the space complexity is O (n). Select 0.1 million data entries from the data range. The time spent is4 secondsThe time is similar to the method. Some of the time is spent on the initialization array. If you use the method of Question 1.9 in programming, it is initialized only when a certain number is used, this algorithmThe complexity can be reduced to O (mlogm). However, the space complexity O (n) is too large.
Regarding method 2 or method 3, a math method on stackflow proves that when m <n, method 2 performs better than method 3.
References:
Http://www.cnblogs.com/2010Freeze/archive/2012/02/27/2370284.html
Http://hi.baidu.com/23star/blog/item/47f7314e5c3b0e01b2de0574.html
Taking random samples
A sample of brilliance,
Programming Perls

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Sampling Problems-Reading Notes of programming Pearl River

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Sampling Problems-Reading Notes of programming Pearl River

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support