Random data generation of machine learning algorithm

Last Update:2016-11-10 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In the process of learning machine learning algorithms, we often need data to validate algorithms and debug parameters. But it's not that easy to find a set of data samples that are perfectly suited to a particular type of algorithm. Fortunately NumPy, Scikit-learn all provide the function of random data generation, we can generate data for a certain model ourselves, use random data to do cleaning, normalization, conversion, and then choose Model and algorithm to do fitting and prediction. The following is a summary of how Scikit-learn and numpy generate data samples.

1. NumPy Random data Generation API

NumPy is more suitable for producing some simple sampling data. The APIs are in the random class, and the common APIs are:

1) rand (D0, D1, ..., dn) is used to generate an array of d0xd1x...dn dimensions. The value of the array is between [0,1]

For example: Np.random.rand (3,2,2), output an array of 3x2x2 as follows

Array ([[[[0.49042678, 0.60643763],
[0.18370487, 0.10836908]],

[0.38269728, 0.66130293],
[0.5775944, 0.52354981]],

[0.71705929, 0.89453574],
[0.36245334, 0.37545211]])

2) Randn (D0, D1, ..., DN) is also the array used to generate the D0XD1X...DN dimension. However, the value of the array follows the standard normal distribution of n (0,1).

For example: Np.random.randn (3,2), output the following 3x2 array, which is the sampled data of N (0,1).

Array ([[-0.5889483,-0.34054626],
[-2.03094528,-0.21205145],
[-0.20804811,-0.97289898]])

If you need to obey the normal distribution of $n (\mu,\sigma^2) $, simply transform $\sigma x + \mu $ on each generated value x on RANDN, for example:

For example: 2*np.random.randn (3,2) + 1, output an array of the following 3x2, which are sampled data of N (1,4).

Array ([[2.32910328,-0.677016],
[-0.09049511, 1.04687598],
[2.13493001, 3.30025852]])

3) Randint (low[, high, size]), generates random data of size, and size can be an integer, a matrix dimension, or a dimension of tensor. The value is in the half open interval [low, high].

For example: Np.random.randint (3, size=[2,3,4]) returns data for dimension 2x3x4. The value range is an integer with a maximum value of 3.

Array ([[[[[2], 1, 2, 1],
[0, 1, 2, 1],
[2, 1, 0, 2]],

[[0, 1, 0, 0],
[1, 1, 2, 1],
[1, 0, 1, 2]])

Another example: Np.random.randint (3, 6, size=[2,3]) returns data with a dimension of 2x3. The value range is [3,6].

Array ([[4, 5, 3],
[3, 4, 5]])

4) Random_integers (low[, high, size]), similar to the above randint, the difference between the range of values is closed interval [low, high].

5) Random_sample ([size]), returns the random floating-point number in the half-open interval [0.0, 1.0]. If it is another interval [a, b), it can be converted (b-a) * Random_sample ([size]) + A

For example: (5-2) *np.random.random_sample (3) +2 returns 3 random numbers between [2,5].

Array ([2.87037573, 4.33790491, 2.1662832])

2. Introduction to Scikit-learn random data Generation API

Scikit-learn generates random data in the Datasets class, which, compared to numpy, can be used to generate data appropriate for a specific machine learning model. The commonly used APIs are:

1) using Make_regression to generate regression model data

2) Generate categorical model data with make_hastie_10_2,make_classification or make_multilabel_classification

3) using Make_blobs to PLA class model data

4) using Make_gaussian_quantiles to generate grouped multidimensional normal distribution data

3. Scikit-learn Random Data Generation Example 3.1 regression model random data

Here we use Make_regression to generate regression model data. Several key parameters are n_samples (number of samples generated), N_features (sample feature number), noise (sample random noise), and coef (whether regression coefficients are returned). The example code is as follows:

ImportNumPy as NPImportMatplotlib.pyplot as Plt%Matplotlib Inline fromSklearn.datasets.samples_generatorImportmake_regression#x is the sample feature, Y is the sample output, Coef is the regression coefficient, a total of 1000 samples, 1 characteristics per sampleX, Y, Coef =make_regression (n_samples=1000, n_features=1,noise=10, coef=True)#DrawingPlt.scatter (X, y, color='Black') Plt.plot (x, x*coef, color='Blue', LineWidth=3) Plt.xticks (()) Plt.yticks (()) plt.show ()

The graph of the output is as follows:

3.2 Classification Model Random data

Here we use make_classification to generate ternary categorical model data. Several key parameters are n_samples (number of samples generated), N_features (sample feature number), n_redundant (number of redundant features), and n_classes (number of output categories), the example code is as follows:

 import   NumPy as NP  import   Matplotlib.pyplot as plt % Matplotlib inline  from  sklearn.datasets.samples_generator import   make_classification  #   X1 for sample characteristics, Y1 for the sample category output, a total of 400 samples, 2 features per sample, output 3 categories, no redundant features, one cluster per category                              X1, Y1 = Make_classification (n_samples=400, n_features=2, N_redundant=0, N_clusters_per_class  =1, N_classes=3 1], mar Ker= " o   , C=y1) plt.show ()

The graph of the output is as follows:

3.3 Random data of cluster model

Here we use Make_blobs to PLA class model data. Several key parameters are n_samples (number of samples generated), N_features (number of sample features), centers (number of cluster centers or custom cluster centers) and CLUSTER_STD (cluster data Variance, which represents the aggregation degree of the cluster). Examples are as follows:

Import NumPy as NP Import Matplotlib.pyplot as plt%matplotlib inlinefromimport  make_blobs#  x is the sample characteristics, Y is the sample cluster category, a total of 1000 samples, 2 characteristics per sample, a total of 3 clusters, the cluster center in [ -1,-1], [[], [2,2], the cluster variance is [0.4, 0.5, 0.2]X, y = make_blobs ( n_samples=1000, n_features=2, centers=[[-1,-1], [all], [2,2]], cluster_std=[0.4, 0.5, 0.21], marker=' /c11>o', c=y) plt.show ()

The graph of the output is as follows:

3.4 Grouping normal distribution mixed data

We use Make_gaussian_quantiles to generate grouped, multidimensional normal distribution data. Several key parameters are n_samples (number of samples generated), n_features (dimensions of normal distribution), mean (feature mean), cov (coefficient of sample covariance), n_classes (the number of groups that data is allocated by quantile in the normal distribution). Examples are as follows:

Import NumPy as NP Import Matplotlib.pyplot as plt%matplotlib inlinefromimport  make_gaussian_ Quantiles# generates a 2-D normal distribution, the resulting data is divided into 3 groups by quantile, 1000 samples, 2 sample feature mean values 1 and 2, covariance coefficient 2X1, Y1 = Make_gaussian_ Quantiles (n_samples=1000, n_features=2, n_classes=3, mean=[1,2],cov=21], marker='o  ', c=y1)

The output diagram is as follows

The above is a summary of the production of random data, hoping to help learn the machine learning algorithm friends.

(Welcome reprint, reproduced please indicate the source.) Welcome to communicate: [email protected])

Random data generation for machine learning algorithms

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Random data generation of machine learning algorithm

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Random data generation of machine learning algorithm

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support