About fit and transform

Last Update:2018-09-30 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Fit is the data to fit, so-called fitting, is based on data, calculate some of the data inside the indicators, such as mean, variance; the next step is that many APIs require these parameters for subsequent operations on the data, such as the transform below.

Transform, deformation of the data; common deformations are normalization and normalization. Standardization is the need of mean and variance, standardization of the data is essentially normal distribution;

Most of the time, when the training data and test data need to be processed in turn, the training data should be processed first, this time need to call fit, then call Tranform, or directly use Fit_transform, and then process the test data, this time, Direct transform can be, because the processing of training data, in fact, has been through the fit to obtain the average square index;

Rnd.seed (42)

m = 100

X = 6 * Rnd.rand (M, 1)-3

y = 2 + X + 0.5 * x**2 + RND.RANDN (M, 1)

X_train, X_val, y_train, Y_val = Train_test_split (x[:50], Y[:50].ravel (), test_size=0.5, random_state=10)

Poly_scaler = Pipeline ((

("Poly_features", Polynomialfeatures (degree=90, Include_bias=false)),

("Std_scaler", Standardscaler ()),

))

x_train_poly_scaled = Poly_scaler.fit_transform (X_train)

x_val_poly_scaled = Poly_scaler.transform (x_val)

One of the objects involved here is standardized scaling, which is designed to avoid a single piece of data being too large, which in turn results in data processing anomalies, so that in order to reduce the effect of a single feature on the whole, the data set is derived, and then the data is divided by the derivative; There are primitive implementations of the NumPy:

>>> from Sklearn Import preprocessing

>>> Import NumPy as NP

>>> X_train = Np.array ([[1.,-1., 2.],

... [2., 0., 0.],

... [0., 1., 1.]]

>>> x_scaled = Preprocessing.scale (X_train)

>>> x_scaled

Array ([[0 ...],-1.22 ..., 1.33 ...],

[1.22 ..., 0 .....,-0.26 ...],

[-1.22 ..., 1.22 ...,-1.06 ...])

>>> X_scaled.mean (axis=0)

Array ([0., 0., 0.])

>>> X_SCALED.STD (axis=0)

Array ([1., 1., 1.])

Finally see the scaling data met, the mean is 0, the standard deviation is 1 (note that the parameter is specified here is 0, represents the standard deviation of the column, the last row of data returned, if it is 1, it represents the standard deviation of the row, the last return of a column), and later will be given axis processing.

This is the original processing in Sklearn; there is also a encapsulated class specifically for this processing: Standardscale.

>>> scaler = preprocessing. Standardscaler (). Fit (X_train)

>>> Scaler

Standardscaler (Copy=true, With_mean=true, With_std=true)

>>> Scaler.mean_

Array ([1 ...., 0 ....., 0.33 ...])

>>> Scaler.scale_

Array ([0.81 ..., 0.81 ..., 1.24 ...])

>>> Scaler.transform (X_train)

Array ([[0 ...],-1.22 ..., 1.33 ...],

[1.22 ..., 0 .....,-0.26 ...],

[-1.22 ..., 1.22 ...,-1.06 ...])

This form has become the fit-transform we described above, after fit, will be able to get to mean and STD, and then will be a bit of data deformation, get to the final matrix, below we want to see if the matrix is not satisfied with the mean value of 0, the standard deviation is 1 of the normal distribution?

Import NumPy as NP

Formated_data =scaler.transform (X_train)

Print (Np.mean (formated_data, 0))

Print (NP.STD (formated_data))

Return information:

[0.0. 0.]

1.0

Note that in the Np.mean, the second parameter is passed, and the value is 0 (the column is the mean value, return one row); Note that this value and no value result is completely different, there is no passing parameter just return a value, should be for the row of uniform do a bit of the mean value;

As mentioned in the above description of the transform, the internal implementation is standardized, exactly what to do with the data, exactly why to standardize it?

The first normalization/normalization is to scale (map) the data to a range, such as [0,1],[-1,1], and the color in the graphics processing as [0,255]; the benefit of normalization is that data of different latitudes is within a similar range of values, so that when the gradient is reduced, The curve will be simpler (by turning the original ellipse into a circle), as shown in:

As for the principle of scaling is the dimensional representation, such as height and nail width, if the unity is a centimeter then the two is not an order of magnitude, if the height of the dimension to meters, then you will find the height of the value range and nail width is actually similar to the value range, so as to avoid a dimension becomes the influence of learning results.

Common Normalization/Normalization

1. Standard Scala (Z-score standardization): is standardized, and the element is processed by the following formula:

x = (X- ??) /??

Standard scaling is only a scenario where the data is approximated to a normal distribution, and the distribution of the data changes to a relatively standard normal distribution

2. Maxminscala: In fact, this is normalized processing, Maxminscaler does not change the data distribution, but will be scaled according to a rule, the processing formula is as follows:

x = (x-min)/(Max-min)

Suitable for overall data comparison distribution and (concentration), there is no outlier data, or very little outlier data, otherwise the value of Max Deviation will lead to the calculation is inaccurate, and maxminscala instability, if the new data added need to be recalculated;

3. Robustscaler: is a very robust algorithm, if the data has a lot of outliers, then use this method to process the data;

4. Non-linear normalization: For some scenarios where the data is very differentiated, the data can be scaled using log, exponential, and inverse tangent;

Log function: x = LG (x)/LG (max), inverse tangent function: x = atan (x) *2/pi

Summarize:

1) in the classification and clustering algorithm, involving the calculation of distance, then, the PCA dimensionality reduction calculation, standard scaler performance is better;

2) does not involve the distance, the covariance computation, the data does not conform to the normal distribution scene, may use the minmaxscaler to calculate;

The normality test can be done by scipy the contents of the library;

What is the covariance? A concept used to indicate whether (x, y) is independent

Reference:

Https://www.cnblogs.com/bjwu/p/8977141.html

The realization of normality test

72861387

Http://scikit-learn.org/stable/modules/preprocessing.html#standardization-or-mean-removal-and-variance-scaling

51461696

About fit and transform

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

About fit and transform

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

About fit and transform

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support