[Translation] use SQL to generate uneven random numbers

Source: Internet
Author: User

Just as "the random number generation process is very important, we cannot turn a blind eye to it" (derived from the Robert R-oak National Laboratory ), we hope that you will spend a little bit of time completing this task based on the results of SQL Server MVP Jeff moden. For the use of SQL to generate random numbers, we will focus on generating unevenly distributed random numbers (uniformly distributed random numbers) based on the non-uniformly distributed random numbers ); this includes some statistical distribution basics to help you get started.

As we know, random numbers are very important in simulation (especially Monte Carlo simulation), and random numbers play equally important roles in cryptography and other high-tech fields. In addition, in our SQL Server, we sometimes need to generate a large amount of random data to test SQL Performance.

Because I am not a statistician, Here we only look at the random numbers generated using SQL and can clearly see their randomness, however, we will not go deep into the mathematical principles to see whether the randomness is true "random" or "seemingly random". Our intention is:ArticleMediumAlgorithmAnd whether the algorithm is sufficient in non-critical fields.

In general, the technology of converting a random number from a uniform random number to an uneven random number is to multiply the Uniform Random Number by the cumulative distribution function (CDF) to reverse the target data. In practice, it is difficult to estimate whether the cumulative distribution function is effective for a specific distribution, even if it is a function that is close to each other. Fortunately, the guys who are much smarter than us have already analyzed a variety of distributed functions in multiple fields and we can use them directly to meet most of our needs.

 

Test Tool

In our tests, we used standard SQL technology to use the pseudo urn (even distributed random number) function to generate float parameters and pass them to the conversion function ., we will use the scalar function including the schemabinding keyword to solve performance problems. However, you may also want to use the same table value function to test whether the performance can be further improved. First, generate test data.

 -- Data definition and Setup Declare @ numberofrns int, @ Lambda float -- For the Poisson nurns , @ Gaussianmean float -- For the normal nurns , @ Gaussianstdev float, @ lambdaexp float -- For the exponential nurns , @ Weibullalpha float -- For the Weibo nurns , @ Weibullbeta float, @ laplaceu float -- For the Laplace nurns , @ Laplaceb floatselect @ numberofrns = 10000, @ Lambda = 4.0 -- Lambda for the Poisson distribution , @ Gaussianmean = 5 -- Mean for the normal distribution , @ Gaussianstdev = 1.5 -- Standard deviation for the normal distribution , @ Lambda daexp = 1.5 -- Lambda for the exponential distribution , @ Weibo lalpha = 1.0 -- Alpha (scale) for the Weibo Distribution , @ Weibo lbeta = 1.5 -- Beta (SHAPE) for the Weibo Distribution , @ Laplaceu = 4.0-- Mu (location) for the Laplace Distribution , @ Laplaceb = 1.0 -- Beta (scale) for the Laplace Distribution  -- Create type distribution as table (eventid int, eventprob float, cumprob float) Declare @ binomial as distribution, @ duniform as distribution, @ multinomial as distribution -- Simulate a coin toss with a binomial distribution Insert into @ binomialselect 0, 0.5, 0.5 Union all select 1, 0.5, 1.0 -- Events returned by this discrete uniform distribution are the 6  -- Maid with the second occurrence of 1 Insert into @ duniformselect 1, 1. /6 ., 1. /6. union all select 2, 1. /6 ., 2. /6. union all select 3, 1. /6 ., 3. /6. union all select 5, 1. /6 ., 4. /6. union all Select 8, 1. /6 ., 5. /6. union all select 13, 1. /6 ., 1. -- Events returned by this multinomial distribution are the 5  -- Mersenne primes discovered in 1952 by Raphael M. Robinson Insert into @ multinomialselect 521 ,. 10 ,. 10 Union all select 607 ,. 25 ,. 35 Union all select 1279 ,. 30 ,. 65 Union all select 2203 ,. 15 ,. 80 Union all select 2281 ,. 2, 1.

(Translator's note: I changed the @ numberofrns parameter from W to 1 W because it takes too long to test. If you like it, you can change it as needed .)

The following is a test tool that generates a nurns (Random Number of non-even distribution) for the expected target distribution ):

-- Create random numbers for the selected DistributionsSelect top (@ numberofrns) randomuniform = urn--, Randompoisson = DBO. rn_poisson (@ lambda, urn), Randombinomial = DBO. rn_multinomial (@ binomial, urn), randomduniform = DBO. rn_multinomial (@ duniform, urn), randommultinomial = DBO. rn_multinomial (@ multinomial, urn), randomnormal = DBO. rn_normal (@ gaussianmean, @ gaussianstdev, urn, Rand (checksum (newid (), randomponential = DBO. rn_exponential (@ lambdaexp, urn), randomweibo = DBO. rn_weibo (@ Wei ullalpha, @ Wei ullbeta, urn), randomlapbeta = DBO. rn_laplace (@ laplaceu, @ laplaceb, urn) into # myrandomnumbers from sys. all_columns A1 cross apply sys. all_columns A2 cross apply (select rand (checksum (newid () urn (URN)

 

Next, we will introduce each distribution type one by one, but before that, we will first explain the questions you may ask about our testing tool:

    • The algorithm for generating the Gaussian distribution (nurn) requires two urns.
    • We use the rn_multinomial function to generate three different distribution types, because they are more common than the other two (which will be explained in detail later)
    • Before creating the rn_multinomial function, you must first create a custom table type. The custom table type is introduced in SQL Server 2008. Therefore, you cannot use this method in SQL Server 2005. Another alternative is to use XML
    • I have attached the script to the attachment of the article. You first run function. SQL, and then run nurns. SQL. It takes about 4.5 minutes to generate 2.5 million data records)

 

Evenly distributed random number ( Uniform Random Numbers ) ----- How to use the uniform method?

Since the distribution of all our data is based on a uniform random number, let's take a look at how the random numbers generated by standard SQL are even. If these so-called "even numbers" are not so "even", they will more or less affect our subsequent results.

The urn probability between {0, 1} is very simple, that is, 0.5. The variance for this interval is 1/12. Use the built-in AVG () and VAR () functions of SQL Server to summarize the million pieces of data we have generated. The results are similar to what we expected.

 

Populationmean Populationvar Samplemean Samplevar
4 0.083333 0.499650832838227 0.0832923192890653

The following bar chart clearly shows the distribution of data in our predefined range:

As we can see, although the results are not very "standard", they are enough for us and do not need to be accurate. Note that the above results I have selected after multiple tests have not selected a specific result. If you run the SQL statement again, you will find that the results you get are similar.

 

Polynomial Random Number

Before we begin to discuss multiple distributions, Let's first look at the discrete distributions of the other two types.

The widely known probability of coin throwing is actually the spread of Bai song. This is used to indicate the probability of a Two-choice (positive or negative) occurrence, return 0 or 1 (or another number) to indicate whether an event has occurred (it can also be understood as "successful" or "failed "). Of course, in terms of coin throwing, the probability of front and back is equivalent, both of which are 50%. But Bai Song's distribution also allows others with a probability of not 50%. After all, life is not always so fair.

Discrete and even distribution describes more than two events. For example, when the dice are still applied, the probability of each side is the same. Of course, this is similar to the method for generating random integers in a given range. Jeff moden has provided a description here.

Polynomial random numbers are more extensive than the preceding two distribution types. They simulate different probabilities of each individual event in a series of events.

Remember, our events do not need to be a set of simple numbers, such as 0, 1, 2, and 3. It can also be a set of any numbers (including negative numbers. for example, in our polynomial distribution, we chose Raphael M. robinson discovered the mesenny prime number in 1952. For our discrete distribution, we adopted the first six numbers of the Fibonacci series starting with 1. We can also change the event type to float by modifying the eventid column of the User-Defined table type distribution from int to float.

Now let's look at the three table variables we set up for testing. We can find that:

    • @ Binomial defines two events (0, 1). the probability of an event occurring is p 50%, and the probability of an event not occurring is 1-P.
    • @ Duniform defines six events (1, 2, 3, 5, 8, 13). Just like throwing a dice, the probability of each event is 1/6.
    • @ Multinomial defines five events (521,607,127 9, 2203,228 1). Each event has a different probability.

 

 
Create Function DBO. rn_multinomial (@ multinomial distribution readonly, @ urn float) returns int-- Cannot use with schemabindingAsbegin return isnull (select top 1 eventid from @ multinomial where @ urn <cumprob order by cumprob)-- Handle unlikely case where urn = exactly 1.0, (Select max (eventid) from @ multinomial) End

For each distribution we test, we have carried out a large number of test results, the following table shows the post probability percentage:

 

Randombinomial binomialfreq eventprob actpercentofevents

0 499778 0.5 0.499778000000

1 500222 0.5 0.500222000000

Randomduniform duniformfreq eventprob actpercentofevents

1 166288 0.166666 0.166288000000

2 166620 0.166666 0.166620000000

3 166870 0.166666 0.166870000000

5 166539 0.166666 0.166539000000

8 166693 0.166666 0.166693000000

13 166990 0.166666 0.166990000000

Randommultinomial multinomialfreq eventprob actpercentofevents

521 99719 0.1 0.099719000000

607 249706 0.25 0.249706000000

1279 300376 0.3 0.300376000000

2203 149633 0.15 0.149633000000

2281 200566 0.2 0.200566000000

The above table shows that the probability is basically consistent with the expected probability.

Random Number of Gaussian (or normal) Distribution

In a recent discussion, GPO, a member of the SSC Forum, posted a question about generating random numbers based on Gaussian (normal) distribution. So I started to study this problem (which is actually the source of inspiration in this article) and used the box Mahler Transformation Method Based on my rn_gaussian function, you can ignore my post on how to use urn to generate the nurn algorithm.

Gaussian distribution is a continuous distribution. We need to explain it a little bit. It often uses "standard" samples near the mean for analysis.

In addition to the average value, the standard deviation must be specified. The following functions help us to understand the shape of the normal distribution of the sample data.

Create Function DBO. rn_normal (@ mean float, @ STDev float, @ urn1 float, @ urn2 float) returns float with schemabindingasbegin-- Based on the box-Muller TransformReturn (@ STDev * SQRT (-2 * log (@ urn1) * Cos (2 * ACOs (-1.) * @ urn2) + @ meanend

 

First, compare the overall (expected) mean and variance with the average and variance of the example to see if the two are close.

Populationmean populationstdev samplemean samplestdev

5 1.5 5.00049965700704 1.50020497041145

Then, let's look at the chart. The interval in the figure is plus or minus the standard deviation of three averages.

 

If you are familiar with the "normal" distribution, you will see how standard (normal) This figure is. We also recognize that 1,000,000 (998,596) of 99.86%) within the standard deviation of our three averages. This is what we expect.

 

Exponential random number

Exponential random distribution can be distributed using CDF (accumulative distribution function) and can be easily expressed using a close expression. Exponential random is applied to physics, Water Science, reliability, and wait time.

 
Create Function DBO. rn_exponential (@ Lambda float, @ urn float) returns float with schemabindingasbegin return-log (@ URN)/@ lambdaend

First, we need to know that the overall mean is 1/Lambda and the standard variance is the square of 1/Lambda. We can see that our overall mean and variance are very close to the sample data.

Populationmean populationvar samplemean samplevar

0.6667 0.4444 0.666 0.4444

We can look at the probability density curve provided by Wikipedia. When Lambda is set to 1.5 (Blue Line), it is very similar to our data.

Weber distributed random number

The Weber Distribution has been studied in the university. According to my understanding, the Weber Distribution is also very regular, so it is very suitable for us to generate a Non-Uniform Random Number (nurn) here ). weber has been applied in many fields of statistics and engineering, including weather forecasting, insurance, Water Science, survival analysis, and reliability engineering. (The university professors I teach this course will be proud of me) and other fields.

The formula for generating the Weber distribution can be found in Wikipedia. Here, the implementation of the rn_weibo function in this method is also very simple. The two parameters are the shape and Scale Parameters (@ Wei ullalpha, @ Wei ullbeta)

 
Create Function DBO. rn_weibo (@ Alpha float, @ beta float, @ urn float) returns float with schemabindingasbegin return power (-1. /@ alpha) * log (1. -@ URN), 1. // @ beta) End

Weber Distribution is not so easy to calculate because the expression uses gamma distribution. The following figure shows the image with the shape parameter 1.0 and the scale parameter 1.5, and compares it with the image provided by Wikipedia.

 

Laplace Random Number

Maybe it's because I am a geek, or because our college teacher is very awesome. I like this course very much. When I know that Laplace also invented the Laplace distributed distribution, I will add this distribution to this article to express my respect for Laplace.

The Laplace Distribution is a continuous distribution. Fortunately, its cumulative distribution function (CDF) is very simple. One of our functions is the location parameter and the other is the scale parameter.

 
Create Function DBO. rn_laplace (@ u float, @ B float, @ urn float) returns float with schemabindingasbegin return @ U-@ B * log (1-2 * ABS (@ urn-0.5 )) * case when 0 <@ urn-0.5 then 1 When 0> @ urn-0.5 then-1 else 0 endend

The Laplace Distribution has an easy-to-calculate mean (@ U) and a standard variance (2 * @ B ^ 2). So we once again compare our overall sample data with the sample data.

Populationmean populationvar samplemean samplevar

4 2 4.0009 1.9975

We once again compared our data with the data distribution chart provided by Wikipedia. Here we take @ B = 4 (red line in the figure)

Summary

From the research in this article, we draw a conclusion that the generated uneven random number is basically correct, at least it is correct to compare it with the data provided by Wikipedia. In addition, we also found the correspondence between the population average and the variance. All the sample data is stored in the attached Excel file.

In nature, random is ubiquitous in humans and other fields. The tool simulates these randomness to help us find patterns in this chaotic world.

"Innovation is actually introducing patterns in an uncertain nature." ---- Eric Hoffer

Scientific and engineering research often uses random numbers to simulate natural phenomena. I hope our research will help people in these fields.

Those who are curious about why we don't need to use the Bai song distribution to generate uneven random numbers. Because SQL built-in functions do not allow us to generate multiple random numbers (newid () and rand () are not used because they have "Side effects "), we will continue to look for a better way to generate random numbers in SQL Server.

This article describes the distribution of alpha, beta, gamma, and F in statistics. However, the mathematical principles behind the generation of non-Even random numbers are more complex, so interested high-end readers can find their own information.

I hope to thank those brave and fearless readers who have read this article, especially those who have followed the process. Thank you for your interest in non-Even random numbers and for providing more useful examples in this field.

 

 

Bytes ------------------------------------------------------------------------------------------------------------

Original article:Generating Non-Uniform Random numbers with SQL

Translated by careyson

Note: The concepts in this article are a little more, so I replaced all the links in the English Wikipedia with Baidu.

Attachment: Click to download

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.