Fit is the data to fit, so-called fitting, is based on data, calculate some of the data inside the indicators, such as mean, variance; the next step is that many APIs require these parameters for subsequent operations on the data, such as the transform below.
Transform, deformation of the data; common deformations are normalization and normalization. Standardization is the need of mean and variance, standardization of the data is essentially normal distribution;
Most of the time, when the training data and test data need to be processed in turn, the training data should be processed first, this time need to call fit, then call Tranform, or directly use Fit_transform, and then process the test data, this time, Direct transform can be, because the processing of training data, in fact, has been through the fit to obtain the average square index;
Rnd.seed (42)
m = 100
X = 6 * Rnd.rand (M, 1)-3
y = 2 + X + 0.5 * x**2 + RND.RANDN (M, 1)
X_train, X_val, y_train, Y_val = Train_test_split (x[:50], Y[:50].ravel (), test_size=0.5, random_state=10)
Poly_scaler = Pipeline ((
("Poly_features", Polynomialfeatures (degree=90, Include_bias=false)),
("Std_scaler", Standardscaler ()),
))
x_train_poly_scaled = Poly_scaler.fit_transform (X_train)
x_val_poly_scaled = Poly_scaler.transform (x_val)
One of the objects involved here is standardized scaling, which is designed to avoid a single piece of data being too large, which in turn results in data processing anomalies, so that in order to reduce the effect of a single feature on the whole, the data set is derived, and then the data is divided by the derivative; There are primitive implementations of the NumPy:
>>> from Sklearn Import preprocessing
>>> Import NumPy as NP
>>> X_train = Np.array ([[1.,-1., 2.],
... [2., 0., 0.],
... [0., 1., 1.]]
>>> x_scaled = Preprocessing.scale (X_train)
>>> x_scaled
Array ([[0 ...],-1.22 ..., 1.33 ...],
[1.22 ..., 0 .....,-0.26 ...],
[-1.22 ..., 1.22 ...,-1.06 ...])
>>> X_scaled.mean (axis=0)
Array ([0., 0., 0.])
>>> X_SCALED.STD (axis=0)
Array ([1., 1., 1.])
Finally see the scaling data met, the mean is 0, the standard deviation is 1 (note that the parameter is specified here is 0, represents the standard deviation of the column, the last row of data returned, if it is 1, it represents the standard deviation of the row, the last return of a column), and later will be given axis processing.
This is the original processing in Sklearn; there is also a encapsulated class specifically for this processing: Standardscale.
>>> scaler = preprocessing. Standardscaler (). Fit (X_train)
>>> Scaler
Standardscaler (Copy=true, With_mean=true, With_std=true)
>>> Scaler.mean_
Array ([1 ...., 0 ....., 0.33 ...])
>>> Scaler.scale_
Array ([0.81 ..., 0.81 ..., 1.24 ...])
>>> Scaler.transform (X_train)
Array ([[0 ...],-1.22 ..., 1.33 ...],
[1.22 ..., 0 .....,-0.26 ...],
[-1.22 ..., 1.22 ...,-1.06 ...])
This form has become the fit-transform we described above, after fit, will be able to get to mean and STD, and then will be a bit of data deformation, get to the final matrix, below we want to see if the matrix is not satisfied with the mean value of 0, the standard deviation is 1 of the normal distribution?
Import NumPy as NP
Formated_data =scaler.transform (X_train)
Print (Np.mean (formated_data, 0))
Print (NP.STD (formated_data))
Return information:
[0.0. 0.]
1.0
Note that in the Np.mean, the second parameter is passed, and the value is 0 (the column is the mean value, return one row); Note that this value and no value result is completely different, there is no passing parameter just return a value, should be for the row of uniform do a bit of the mean value;
As mentioned in the above description of the transform, the internal implementation is standardized, exactly what to do with the data, exactly why to standardize it?
The first normalization/normalization is to scale (map) the data to a range, such as [0,1],[-1,1], and the color in the graphics processing as [0,255]; the benefit of normalization is that data of different latitudes is within a similar range of values, so that when the gradient is reduced, The curve will be simpler (by turning the original ellipse into a circle), as shown in:
As for the principle of scaling is the dimensional representation, such as height and nail width, if the unity is a centimeter then the two is not an order of magnitude, if the height of the dimension to meters, then you will find the height of the value range and nail width is actually similar to the value range, so as to avoid a dimension becomes the influence of learning results.
Common Normalization/Normalization
1. Standard Scala (Z-score standardization): is standardized, and the element is processed by the following formula:
x = (X- ??) /??
Standard scaling is only a scenario where the data is approximated to a normal distribution, and the distribution of the data changes to a relatively standard normal distribution
2. Maxminscala: In fact, this is normalized processing, Maxminscaler does not change the data distribution, but will be scaled according to a rule, the processing formula is as follows:
x = (x-min)/(Max-min)
Suitable for overall data comparison distribution and (concentration), there is no outlier data, or very little outlier data, otherwise the value of Max Deviation will lead to the calculation is inaccurate, and maxminscala instability, if the new data added need to be recalculated;
3. Robustscaler: is a very robust algorithm, if the data has a lot of outliers, then use this method to process the data;
4. Non-linear normalization: For some scenarios where the data is very differentiated, the data can be scaled using log, exponential, and inverse tangent;
Log function: x = LG (x)/LG (max), inverse tangent function: x = atan (x) *2/pi
Summarize:
1) in the classification and clustering algorithm, involving the calculation of distance, then, the PCA dimensionality reduction calculation, standard scaler performance is better;
2) does not involve the distance, the covariance computation, the data does not conform to the normal distribution scene, may use the minmaxscaler to calculate;
The normality test can be done by scipy the contents of the library;
What is the covariance? A concept used to indicate whether (x, y) is independent
Reference:
Https://www.cnblogs.com/bjwu/p/8977141.html
The realization of normality test
72861387
Http://scikit-learn.org/stable/modules/preprocessing.html#standardization-or-mean-removal-and-variance-scaling
51461696
About fit and transform