Machine Learning: this paper uses the analysis of the taste of red wine as an example to describe the cross-validation arbitrage model.
The least squares (OLS) algorithm is commonly used in linear regression. Its core idea is to find the best function matching of data by minimizing the sum of squares of errors.
However, the most common problem with OLS is that it is easy to over-fit: that is, the attribute values (x) and the target values (y) in the sample dataset correspond one to one. Such results seem to fit the results.
Very good, but the error in the new dataset is very large.
At present, there are two main ways to solve this problem: Forward gradual regression and penalty linear regression. The reason is that there are two ideas, not two algorithms.
Based on these algorithms, two algorithm families are formed. In particular, the latter has a variety of well-known algorithms (such as ridge regression, nested regression, least angle regression, and Glmnet ).
Basic algorithm idea of forward-step regression: traverse each column in the attribute, find the column with the smallest sum of mean square error (MSE) (that is, the column with the best effect), and then find the combination effect with this column.
The best second column property, and so on until all columns. In this process, the number of MSE attributes in the coordinate system changes significantly.
The desired result, or the final result is obtained by printing the MES value.
Basic algorithm idea of penalty linear regression: Add a penalty item to the formula of Ordinary Least Squares. If the mathematical formula below is expressed by the least square method:
The Lasso regression formula is:. Where, α | W | = α (| w1 | + | w2 | +... + | Wn | ).
Lasso constructs a penalty function to obtain a more refined model, compress some coefficients, and set some coefficients to zero, that is, the Lasso coefficient vectors are sparse.
Lasso is the L1 norm regularization (one of the three extension methods for OLS in variable selection, also called the contraction method). The L1 norm is the sum of the absolute values of each element pointing to a volume.
The question about the model above is described as follows: model evaluation.
Currently, there are two main methods for evaluating model performance: Reserved Test Sets and n-fold Cross verification.
The reserved Test Set divides sample data into two types: one for training model and the other for testing model. Generally, the test set accounts for 25% of all data ~ 35%.
The n-fold crossover verification divides the data into n non-Intersecting subsets, one of which is used as the test set, and the other n-1 is used as the training set. Assume that the data is divided into five parts, numbered 1 ~ 5. For the first time, 1 is used as the test set, 2, 3, 4, and 5 are used as the training set, and 2 is used as the test set for the second time, 1, 3, 4, and 5 serve as the training set, and so on until the end of the training.
The core functions here are:sklearn.linear_model.
LassoCV
LassoCV has many parameters. We only use the parameter cv, which indicates that cross verification is performed with a few folds.
The following describes the relevant attributes:
Alpha _: penalty coefficient obtained after cross-validation, that is, the alpha value in the formula
Coef _: parameter vector (w in the formula)
Mse_path _: mean square error of each cross Verification
Alphas _: alpha value used during verification
There are so many theoretical foundations related to this test. Next we will start the experiment: Data Source
Import numpy as npimport matplotlib. pyplot as pltfrom sklearn import linear_modelfrom sklearn. linear_model import LassoCVimport OS # base_dir = OS. getcwd () data = np. loadtxt (base_dir + r "\ wine.txt", delimiter = ";") # length of the matrix: number of rows dataLen = len (data) # width of the matrix: number of columns dataWid = len (data [0]) # Average value of each column xMeans = [] # variance xSD = [] # normalization sample set xNorm = [] # normalization label lableNorm = [] # first processing data: calculate the mean and variance of each column for j in range (dataWid): # Read the value of each column x = [data [I] [j] for I in range (dataLen)] # mean of each column = np. mean (x) xMeans. append (mean) # variance sd = np for each column. std (x) xSD. append (sd) # second data processing: normalized sample set and tag for j in range (dataLen ): # sample set normalization xn = [(data [j] [I]-xMeans [I])/xSD [I] for I in range (dataWid-1)] xNorm. append (xn) # label normalization ln = (data [j] [dataWid-1]-xMeans [dataWid-1])/xSD [dataWid-1] lableNorm. append (ln) # The parameter format is an array, so convert X = np. array (xNorm) Y = np. array (lableNorm) # Start to perform cross verification: cv = 10 indicates that a 10-fold Cross verification is adopted. wineModel = LassoCV (cv = 10 ). fit (X, Y) # print the coefficient of each item of the optimal solution: [0,-0.22773828, 0.09423888, 0,-0.02215153, #0.09903605,-, 0, -0.06787363, 0.16804092, 0.3750958] print (wineModel. coef _) # print the penalty coefficient of the optimal solution: 0.013561387701 print (wineModel. alpha _) # drawing plt. figure () # plt, the variation curve of mean square error, as the alpha value changes. plot (wineModel. alphas _, wineModel. mse_path _, ':') # mean curve plt of mean square error with the change of alpha value during verification. plot (wineModel. alphas _, wineModel. mse_path _. mean (axis =-1), label = 'average MSE into SS Folds ', linewidth = 2) # The most appropriate alpha value recognized by the system for each verification plt. axvline (wineModel. alpha _, linestyle = '--', label = 'cv Estimate of Best alpha') plt. semilogx () plt. legend () ax = plt. gca () ax. invert_xaxis () plt. xlabel ('alpha') plt. ylabel ('mean Square error') plt. axis ('tight ') plt. show ()
The penalty coefficient and vector coefficient of the optimal solution have been written in the code in the form of annotations. The generated result is as follows: