Random selection of a massive log

Source: Internet
Author: User

The original question is as follows:

Assume that a log of 100 GB exists on the disk, and each log occupies no more than bytes. Now N logs are randomly selected from the log, make sure that the probability of selecting each log is the same.

Solution 1:

The simplest and strictest practice is to scan logs twice.

The first time: Count the total number of logs, M.

Second: Scan each log to select the log at the probability of N/m until n logs are satisfied.

Solution 2:

Because the maximum length of each log is 100 bytes, the number of logs must be at least k = 1 GB. Therefore, you can perform random sampling during log scanning, use a large probability, such as 10 * n/K. In this way, the number of selected logs will certainly be greater than N, and the total number is much less than 100 GB. Next, we can randomly select n logs from the sample according to the first method.

PS: solution 1 and solution 2 are relatively simple, but there will be a very serious problem, all the logs need to be read, GB of data will be very time-consuming to read, ideally, the vendor's standard can reach up to 80 m/s at most, so reading 1250 GB of data also takes s, that is, more than 20 minutes. Is there any way to reduce the amount of data read from the disk?

Solution 3: assume that the total file size is S = 100 GB and the text is logically divided into B = W blocks, then the actual size of each block is k = s/B, then, logically traverse the 1/1000 blocks and give each block a probability of choice. In this way, 1000 blocks are selected by offset, and N logs are randomly selected from the 1000 blocks. In this way, the actual read data volume is 100 MB, which is already very small.

Note: The third method may have slight differences in the probability of selecting each log, but this difference can be ignored due to the huge data volume.

I learned an algorithm-reservoir sampling from the comments on the first floor. The principle is as follows:

Define the row number as choice. For the first time, take the first row as the row choice, and then decide whether to replace choice with the second row with the 1/2 probability, for the third time, the probability of 1/3 is used to determine whether to replace choice with the third line ......, The following is a description of the available pseudo code:
I = 0
While more input lines
With probability 1.0/++ I
Choice = this input line
Print choice
This method is ingenious because a method is successfully constructed to prove that the probability of each row is 1/N (where N is the number of files scanned currently ), in other words, the probability of each row is equal, that is, the random selection is completed.
Looking back at this question, we can extend it, that is, how to randomly retrieve K numbers from unknown or large sample spaces?
By analogy, we can get the answer: Put the first K number into the reservoir. For k + 1, we will decide whether to change it into the reservoir by K/(k + 1) probability, during the change, a random replacement item is selected as the replacement item. This will continue. For any sample space N, the probability of selecting each number is k/n. That is to say, the probability of selecting each number is equal.
Pseudocode:
Init: A reservoir with the size: K
For I = k + 1 to n
M = random (1, I );
If (M <K)
Swap the MTH value and ith Value
End

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.