Introduction of random data generation method for machine learning algorithm

Source: Internet
Author: User
In the process of learning machine learning algorithms, we often need data to validate algorithms and debug parameters. But it's not that easy to find a set of data samples that are perfectly suited to a particular type of algorithm. Fortunately NumPy, Scikit-learn all provide the function of random data generation, we can generate data for a certain model ourselves, use random data to do cleaning, normalization, conversion, and then choose Model and algorithm to do fitting and prediction. The following is a summary of how Scikit-learn and numpy generate data samples.

1. NumPy Random data Generation API

NumPy is more suitable for producing some simple sampling data. The APIs are in the random class, and the common APIs are:

1) rand (D0, D1, ..., dn) is used to generate an array of d0xd1x...dn dimensions. The value of the array is between [0,1]

For example: Np.random.rand (3,2,2), output an array of 3x2x2 as follows

Array ([[[0.49042678,  0.60643763],        [0.18370487,  0.10836908]],        [[0.38269728,  0.66130293],        [0.5775944,  0.52354981]],        [[0.71705929,  0.89453574],        [0.36245334,  0.37545211]])


2) Randn (D0, D1, ..., DN) is also the array used to generate the D0XD1X...DN dimension. However, the value of the array follows the standard normal distribution of n (0,1).

For example: Np.random.randn (3,2), output the following 3x2 array, which is the sampled data of N (0,1).

Array ([[ -0.5889483, -0.34054626],       [ -2.03094528, -0.21205145],       [-0.20804811,-0.97289898]])

If you need to obey the normal distribution of n (μ,σ2) n (μ,σ2), simply change σx+μσx+μ on each generated value x on RANDN, for example:


For example: 2*np.random.randn (3,2) + 1, output an array of the following 3x2, which are sampled data of N (1,4).

Array ([[2.32910328, -0.677016  ],       [ -0.09049511,  1.04687598],       [2.13493001,  3.30025852]])

3) Randint (low[, high, size]), generates random data of size, and size can be an integer, a matrix dimension, or a dimension of tensor. The value is in the half open interval [low, high].


For example: Np.random.randint (3, size=[2,3,4]) returns data for dimension 2x3x4. The value range is an integer with a maximum value of 3.

Array ([[[2, 1, 2, 1], [0, 1, 2, 1], [2, 1, 0, 2]], [[0, 1, 0, 0], [1, 1, 2, 1], [1, 0, 1, 2]]

Another example: Np.random.randint (3, 6, size=[2,3]) returns data with a dimension of 2x3. The value range is [3,6].

Array ([[4, 5, 3], [3, 4, 5]])

4) Random_integers (low[, high, size]), similar to the above randint, the difference between the range of values is closed interval [low, high].


5) Random_sample ([size]), returns the random floating-point number in the half-open interval [0.0, 1.0]. If it is another interval [a, b), it can be converted (b-a) * Random_sample ([size]) + A

For example: (5-2) *np.random.random_sample (3) +2 returns 3 random numbers between [2,5].

Array ([2.87037573,  4.33790491,  2.1662832])

2. Introduction to Scikit-learn random data Generation API

Scikit-learn generates random data in the Datasets class, which, compared to numpy, can be used to generate data appropriate for a specific machine learning model. The commonly used APIs are:

1) using Make_regression to generate regression model data

2) Generate categorical model data with make_hastie_10_2,make_classification or make_multilabel_classification

3) using Make_blobs to PLA class model data

4) using Make_gaussian_quantiles to generate grouped multidimensional normal distribution data

3. Scikit-learn Random Data Generation example

3.1 Regression model random data

Here we use Make_regression to generate regression model data. Several key parameters are n_samples (number of samples generated), N_features (sample feature number), noise (sample random noise), and coef (whether regression coefficients are returned). The example code is as follows:

Import NumPy as Npimport Matplotlib.pyplot as Plt%matplotlib inlinefrom sklearn.datasets.samples_generator import Make_ regression# x is the sample feature, Y is the sample output, the Coef is a regression coefficient, a total of 1000 samples, 1 features x, Y, Coef =make_regression per sample (n_samples=1000, n_features=1,noise =10, Coef=true) # Draw Plt.scatter (x, y,  color= ' black ') plt.plot (x, X*coef, color= ' Blue ', linewidth=3) Plt.xticks (()) Plt.yticks (()) Plt.show ()

The graph of the output is as follows:

3.2 Classification Model Random data

Here we use make_classification to generate ternary categorical model data. Several key parameters are n_samples (number of samples generated), N_features (sample feature number), n_redundant (number of redundant features), and n_classes (number of output categories), the example code is as follows:

Import NumPy as Npimport Matplotlib.pyplot as Plt%matplotlib inlinefrom sklearn.datasets.samples_generator import Make_ classification# X1 for sample characteristics, Y1 for the sample category output, a total of 400 samples, 2 features per sample, output 3 categories, no redundancy features, one cluster X1 per category, Y1 = Make_classification (n_samples= n_features=2, n_redundant=0,                             N_clusters_per_class=1, n_classes=3) plt.scatter (x1[:, 0], x1[:, 1], marker= ' O ', C=Y1) Plt.show ()


The graph of the output is as follows:

3.3 Random data of cluster model

Here we use Make_blobs to PLA class model data. Several key parameters are n_samples (number of samples generated), N_features (number of sample features), centers (number of cluster centers or custom cluster centers) and CLUSTER_STD (cluster data Variance, which represents the aggregation degree of the cluster). Examples are as follows:

Import NumPy as Npimport Matplotlib.pyplot as Plt%matplotlib inlinefrom sklearn.datasets.samples_generator import Make_ blobs# X is a sample feature, Y is a sample cluster category, a total of 1000 samples, 2 features per sample, a total of 3 clusters, cluster center in [ -1,-1], [[]], [2,2], the cluster variance is [0.4, 0.5, 0.2]x, y = make_blobs (n_ samples=1000, n_features=2, Centers=[[-1,-1], [Max], [2,2]], cluster_std=[0.4, 0.5, 0.2]) Plt.scatter (x[:, 0], x[:, 1], MA Rker= ' O ', c=y) plt.show ()


The graph of the output is as follows:

3.4 Grouping normal distribution mixed data

We use Make_gaussian_quantiles to generate grouped, multidimensional normal distribution data. Several key parameters are n_samples (number of samples generated), n_features (dimensions of normal distribution), mean (feature mean), cov (coefficient of sample covariance), n_classes (the number of groups that data is allocated by quantile in the normal distribution). Examples are as follows:

Import NumPy as Npimport Matplotlib.pyplot as Plt%matplotlib inlinefrom sklearn.datasets import make_gaussian_quantiles# Generates a 2-D normal distribution, the resulting data is divided into 3 groups by quantile, 1000 samples, 2 sample features mean 1 and 2, covariance coefficient is 2x1, Y1 = Make_gaussian_quantiles (n_samples=1000, n_features=2, n _classes=3, mean=[1,2],cov=2) plt.scatter (x1[:, 0], x1[:, 1], marker= ' o ', c=y1)


The output diagram is as follows

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.