An Efficient Method for randomly generating 13-Bit Absolute random numbers without duplicates

Source: Internet
Author: User

Problem description: an efficient method for randomly generating 13-Bit Absolute random numbers without duplicates.

 

Question:

1. All non-repeated random numbers are generated and stored in advance, and the number is obtained as needed;

2. Generate a random number to instantly compare all generated numbers. If yes, it is generated again.

3. Search for a good non-conflicting hash algorithm (or a low probability of conflict ).

4. A pseudo-random number is generated based on a certain algorithm, which must meet the non-similarity or low similarity requirement within a certain order of magnitude.

Random images cannot be repeated, so no algorithm can achieve true random. It can only prevent high-frequency collision and similarity to a certain extent, thus giving the outside world a random photo.

 

 

Related methods and problems of IDEA 1:

1-10000000000000 is generated in advance, and the group is disrupted. After repeated times, all 10 trillion of the data can be obtained. The generated 13-digit (10 trillion) is about 43 to the power of 2. Suppose we store a number with 4 bits (meeting the minimum binary length that can express the number at least 9 ), therefore, a 13-bit number must be 52bit, that is, about 7 bytes, multiplied by the 43 power of 2, and the total space occupied by the budget is about 56 TB. Regardless of the storage method used for optimization, because the similarity after randomization is extremely low (the adjacent grayscale Compression Algorithm in dynamic planning, including the use of databases), the optimized data size is also TB-level, therefore, the implementation method based on this idea is unrealistic.

 

Related methods and problems of idea 2:

 

1. The database uses the generated sequence as the primary key (the database does not repeat the judgment). If insertion fails, the data is saved.

The method is easy to implement.

2. Sort random numbers generated randomly, use the binary search method, and insert appropriate positions in the sorting queue.

3. manually generate a red/black tree (AVL) and determine whether the tree exists each time it is inserted.

 

All three methods can be implemented, but the third method is better in terms of efficiency. The database exception is used to determine whether data duplication performance is obviously the worst. Compared with the second method, inserting a linked list is much more efficient, but to achieve Random Access to binary search, a sequence similar to an array must be used. The third method of using the red and black trees is the most efficient.

 

However, this type of method has a common defect. When more than 50% of data has been generated, the search efficiency of this tree is no problem, but this tree (OR array) the size is almost terabytes. From this perspective, it is not as simple as implementing the first approach. In addition, for example, if eight different data records have been obtained and then a random number is generated, the probability of repeatedly drawing one of the eight generated numbers is very high, after this analysis, we can see that every time a random number is generated, the number of times the whole tree needs to be traversed will increase with the increase of the tree, which is unacceptable.

 

 

Methods and problems related to Idea 3:

 

1. Hash 1-00000000000000000 numbers in sequence, and check the results for conflicts.

2. MD5 is similar to a strong Hash algorithm to encrypt numbers within a fixed range. If the numbers are the same, reverse decryption is not possible. (The length must be modified using a one-way unique ing)

3. For more information about how to generate a GUID, see.

 

Method 1: You need to store the corresponding results for conflict detection. Similar to method 3 in solution 2, the storage volume is very large. Method 2 and method 3 are the same problem. After MD5 encryption, 16 or 32 characters are generated without duplicates. Note that the characters (0-9a-zA-Z) are generated ), the GUID generates 128 characters at random, therefore, it is difficult to find a way to map 16, 32, or even 128 bits to 32 bits, and each bit is mapped from 62 (10 + 26 + 26) character sets to 10 (0-9) to map the difference between such a large number set.

 

 

Related methods and problems of idea 4:

This idea is recommended and easy to implement.

 

Segment link method;

 

First, we divide the data into 6 + 6 + 1, which reduces our data volume to millions. The reason for this division is determined by a certain amount of time efficiency and subsequent link methods.

 

The following describes how to generate a random number of six digits:

 

1000000 random numbers are generated in 3 s.

A Random Number of 1000000 digits is generated in 16 s.

A Random Number of 1000000 digits is generated in 43 s.

When a random number of 1000000 bits is generated, the value is greater than 1 h.

Therefore, we chose the third solution.

The generated data is as follows:

 

Use the same method to regenerate a 6-digit random number.

 

Link method we use the horizontal adjacent unrelated connection method:

 

(Here, we do not use the database Cartesian product, but use a program to read part of the data and automatically cache the next part of the data when the data is insufficient)

By using this method, we can obtain a maximum of 0.8 million records without duplicates, and no similarity within million records (symmetric links ). In addition, the entire space consumption meets the requirements.

 

The following is a random SQL script randomly selected for the key database.

 

USE Job

GO

 

Create table tb2 (id char (6 ))

Create unique index IX_tb2 ON tb2 (id)

WITH IGNORE_DUP_KEY

GO

 

DECLARE @ dt datetime

SET @ dt = GETDATE ()

SET NOCOUNT ON

DECLARE @ row int

SET @ row = 800000

WHILE @ row> 0

BEGIN

RAISERROR ('need % d rows ', 10, 1, @ row) WITH NOWAIT

Set rowcount @ row

INSERT tb2 SELECT

Id = RIGHT (100000000 + CONVERT (bigint, ABS (CHECKSUM (NEWID (), 6)

FROM syscolumns c1, sysobjects o --, syscolumns c2

SET @ row = @ row-@ ROWCOUNT

END

SELECT BeginDate = @ dt, EndDate = GETDATE (), Second = DATEDIFF (Second, @ dt, GETDATE ())

GO

 

Select count (*) FROM tb2

GO

 

Data disruption script

Select identity (int, 1, 1) as rownumber, * into tmp_tb from tb order by NEWID ();

Select identity (int, 1, 1) as rownumber, * into tmp_tb2 from tb2 order by NEWID ();

 

Data merging adopts program control:

For details about the merge method, see. The key point is to record the position in the first table and the wrong opening N.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.