About fit and transform

Source: Internet
Author: User

Fit is the data to fit, so-called fitting, is based on data, calculate some of the data inside the indicators, such as mean, variance; the next step is that many APIs require these parameters for subsequent operations on the data, such as the transform below.

Transform, deformation of the data; common deformations are normalization and normalization. Standardization is the need of mean and variance, standardization of the data is essentially normal distribution;

Most of the time, when the training data and test data need to be processed in turn, the training data should be processed first, this time need to call fit, then call Tranform, or directly use Fit_transform, and then process the test data, this time, Direct transform can be, because the processing of training data, in fact, has been through the fit to obtain the average square index;

Rnd.seed (42)

m = 100

X = 6 * Rnd.rand (M, 1)-3

y = 2 + X + 0.5 * x**2 + RND.RANDN (M, 1)

X_train, X_val, y_train, Y_val = Train_test_split (x[:50], Y[:50].ravel (), test_size=0.5, random_state=10)

Poly_scaler = Pipeline ((

("Poly_features", Polynomialfeatures (degree=90, Include_bias=false)),

("Std_scaler", Standardscaler ()),

))

x_train_poly_scaled = Poly_scaler.fit_transform (X_train)

x_val_poly_scaled = Poly_scaler.transform (x_val)

One of the objects involved here is standardized scaling, which is designed to avoid a single piece of data being too large, which in turn results in data processing anomalies, so that in order to reduce the effect of a single feature on the whole, the data set is derived, and then the data is divided by the derivative; There are primitive implementations of the NumPy:

>>> from Sklearn Import preprocessing

>>> Import NumPy as NP

>>> X_train = Np.array ([[1.,-1., 2.],

... [2., 0., 0.],

... [0., 1., 1.]]

>>> x_scaled = Preprocessing.scale (X_train)

>>> x_scaled

Array ([[0 ...],-1.22 ..., 1.33 ...],

[1.22 ..., 0 .....,-0.26 ...],

[-1.22 ..., 1.22 ...,-1.06 ...])

>>> X_scaled.mean (axis=0)

Array ([0., 0., 0.])

>>> X_SCALED.STD (axis=0)

Array ([1., 1., 1.])

Finally see the scaling data met, the mean is 0, the standard deviation is 1 (note that the parameter is specified here is 0, represents the standard deviation of the column, the last row of data returned, if it is 1, it represents the standard deviation of the row, the last return of a column), and later will be given axis processing.

This is the original processing in Sklearn; there is also a encapsulated class specifically for this processing: Standardscale.

>>> scaler = preprocessing. Standardscaler (). Fit (X_train)

>>> Scaler

Standardscaler (Copy=true, With_mean=true, With_std=true)

>>> Scaler.mean_

Array ([1 ...., 0 ....., 0.33 ...])

>>> Scaler.scale_

Array ([0.81 ..., 0.81 ..., 1.24 ...])

>>> Scaler.transform (X_train)

Array ([[0 ...],-1.22 ..., 1.33 ...],

[1.22 ..., 0 .....,-0.26 ...],

[-1.22 ..., 1.22 ...,-1.06 ...])

This form has become the fit-transform we described above, after fit, will be able to get to mean and STD, and then will be a bit of data deformation, get to the final matrix, below we want to see if the matrix is not satisfied with the mean value of 0, the standard deviation is 1 of the normal distribution?

Import NumPy as NP

Formated_data =scaler.transform (X_train)

Print (Np.mean (formated_data, 0))

Print (NP.STD (formated_data))

Return information:

[0.0. 0.]

1.0

Note that in the Np.mean, the second parameter is passed, and the value is 0 (the column is the mean value, return one row); Note that this value and no value result is completely different, there is no passing parameter just return a value, should be for the row of uniform do a bit of the mean value;

As mentioned in the above description of the transform, the internal implementation is standardized, exactly what to do with the data, exactly why to standardize it?

The first normalization/normalization is to scale (map) the data to a range, such as [0,1],[-1,1], and the color in the graphics processing as [0,255]; the benefit of normalization is that data of different latitudes is within a similar range of values, so that when the gradient is reduced, The curve will be simpler (by turning the original ellipse into a circle), as shown in:

As for the principle of scaling is the dimensional representation, such as height and nail width, if the unity is a centimeter then the two is not an order of magnitude, if the height of the dimension to meters, then you will find the height of the value range and nail width is actually similar to the value range, so as to avoid a dimension becomes the influence of learning results.

Common Normalization/Normalization

1. Standard Scala (Z-score standardization): is standardized, and the element is processed by the following formula:

x = (X- ??) /??

Standard scaling is only a scenario where the data is approximated to a normal distribution, and the distribution of the data changes to a relatively standard normal distribution

2. Maxminscala: In fact, this is normalized processing, Maxminscaler does not change the data distribution, but will be scaled according to a rule, the processing formula is as follows:

x = (x-min)/(Max-min)

Suitable for overall data comparison distribution and (concentration), there is no outlier data, or very little outlier data, otherwise the value of Max Deviation will lead to the calculation is inaccurate, and maxminscala instability, if the new data added need to be recalculated;

3. Robustscaler: is a very robust algorithm, if the data has a lot of outliers, then use this method to process the data;

4. Non-linear normalization: For some scenarios where the data is very differentiated, the data can be scaled using log, exponential, and inverse tangent;

Log function: x = LG (x)/LG (max), inverse tangent function: x = atan (x) *2/pi

Summarize:

1) in the classification and clustering algorithm, involving the calculation of distance, then, the PCA dimensionality reduction calculation, standard scaler performance is better;

2) does not involve the distance, the covariance computation, the data does not conform to the normal distribution scene, may use the minmaxscaler to calculate;

The normality test can be done by scipy the contents of the library;

What is the covariance? A concept used to indicate whether (x, y) is independent

Reference:

Https://www.cnblogs.com/bjwu/p/8977141.html

The realization of normality test

72861387

Http://scikit-learn.org/stable/modules/preprocessing.html#standardization-or-mean-removal-and-variance-scaling

51461696

About fit and transform

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.