"Python Machine learning" notes (vi)

Last Update:2018-02-13 Source: Internet

Author: User

Tags svm

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Model Evaluation and parameter tuning combat pipeline-based workflow

An easy-to-use tool: The Pipline class in Scikit-learn. It allows us to fit a model that contains any number of processing steps and use the model for predictions of new data.

Loading Wisconsin breast Cancer data set

1. Use pandas to read data sets directly from the UCI Web site

 as pddf=pd.read_csv ('https://archive.ics.uci.edu/ml/machine-learning-databases/ Breast-cancer-wisconsin/wdbc.data', Header=none)

2. Next, assign the 30 characteristics of the dataset to a NumPy array object x. Using the Labelencoder class in Scikit-learn, we can convert the class label from the original string representation (M or B) to an integer:

 from sklearn.preprocessing Import Labelencoderx=df.loc[:,2:].valuesy=df.loc[:,1 ].valuesle=labelencoder () y=le.fit_transform (y)

The converted class label (diagnostic result) is stored in an array y where malignant and benign tumors are identified as classes 1 and 0 respectively, and we can display virtual class labels (0 and 1) by invoking the Transform method of Labelencoder

3. When building the first pipelined model clamp, first divide the dataset into a training dataset (data from the original DataSet 80%) and a separate test data set (20% of the original dataset)

 from sklearn.cross_validation Import train_test_splitx_train,x_test,y_train,y_test=train_test_split (X,y,test _size=0.2, random_state=1)

Integrated Data transformation and evaluation operations in the pipeline

We want to compress the initial 30-dimensional data into a two-dimensional subspace through PCA. Instead of fitting on the training data set and the test set, the data conversion, and the Standardscaler, PCA, and Logisticregression objects are concatenated by pipeline:

 fromsklearn.preprocessing Import Standardscaler fromsklearn.decomposition Import PCA fromSklearn.linear_model Import logisticregression fromsklearn.pipeline Import PIPELINEPIPE_LR=pipeline ([('SCL', Standardscaler ()), ('PCA', PCA (n_components=2)),('CLF', Logisticregression (random_state=1) ]) pipe_lr.fit (x_train,y_train) print ('Test accuracy:%.3f'%pipe_lr.score (X_test,y_test))

The Pipline object takes the sequence of the progenitor as input, where the first value in each tuple is a string, which can be any identifier, which we access to the elements in the pipeline, while the second value of Ganso is a converter or evaluator in Scikit-learn

The pipeline contains classes for data preprocessing in the Scikit-learn, and an evaluator is included at the end. In the preceding example code, there are two preprocessing links in the pipeline, namely Standardscaler and PCA for data scaling and conversion, and finally a logistic regression classifier as an evaluator. When the Fit method is executed on the pipelining PIPE_LR, Standardscaler performs fit and transform operations on the training data, and the transformed training data is passed to the next object--PCA on the pipeline. Like the previous steps, PCA performs fit and transform operations on the input data after the previous step, and passes the processed data to the last object in the pipeline-the evaluator. We should note that there is no limit to the number of intermediate steps in the pipeline. The way the pipeline works can be described as:

Evaluating model performance using K-fold cross-validation

A key step in building a machine learning model is to evaluate the performance of the model on new data.

Common cross-validation techniques: holdout cross-validation and K-fold cross-validation.

Holdout cross-validation

Holdout cross-validation is a classic and common method for evaluating the generalization performance of machine learning models. Using the holdout method, we divide the initial data set into training datasets and test datasets: The former is used for model training and the latter for performance evaluation. However, in a typical machine learning application, in order to further improve the performance of the model in predicting unknown data, we also need to tune and compare the different parameter settings. This process, called model selection, refers to the process of adjusting parameters for a given classification problem to find the optimal value (also known as a parameter).

A better approach to model selection using holdout is to divide the data into three parts: training datasets, validating datasets, and test datasets. The training data set is used for the fitting of different models, and the performance of the model on the validation data set is the standard for model selection. The advantage of using model training and data that has not been used in the model selection phase as a test data set is that the evaluation model can be used to get a small deviation from the new data. Are some of the concepts of holdout cross-validation.

One drawback of the holdout approach is that the evaluation of model performance is sensitive to the method of training data set divided into training and validating subsets, and the results of the evaluation will vary with the sample.

K-fold cross-validation

In K-fold cross-validation, we randomly divide the training data set into K, where k-1 is used for model training and the remaining 1 are for testing. Repeat this process k times, we get the K model and the evaluation of the model performance.

Based on the results of the model performance evaluations obtained on these separate and different subsets of data, we can calculate their average performance. Compared with the holdout method, the results are relatively less sensitive to the data partitioning method. Generally, we use K-fold cross-validation for model tuning, which is to find the parameter values that make the model generalization performance optimal. Once a satisfactory parameter is found, we can retrain the model on all the training data and use independent test data to make a final assessment of the performance of the model.

Since K-fold cross-validation uses a no-repeat sampling technique, the advantage of this approach is that each sample point is only once scored in the training dataset or test data set, compared to the holdout method, which makes the evaluation of the model's performance less variance

Import NumPy asNP fromsklearn.cross_validation Import Stratifiedkfoldkfold=stratifiedkfold (y=y_train,n_folds=Ten, random_state=1) scores=[] forK, (Train,test)inchEnumerate (kfold): Pipe_lr.fit (X_train[train],y_train[train]) score=Pipe_lr.score (X_train[test],y_train[test]) scores.append (score) print ('Fold:%s,class dist.:%s,acc:%.3f'% (k +1, Np.bincount (Y_train[train]), score))

First, we initialize the Stratifiedkfold iterator under the Sklearn.cross_validation module with the class tag Y_train in the training set, and set the number of blocks by the N_folds parameter. When we use the Kfold iterator to loop through a K block, we use the index returned in train to fit the logistic regression pipeline that was built earlier. With the PILE_LR pipeline, we can ensure that samples are scaled appropriately in each iteration (such as normalization). The test index is then used to calculate the accuracy of the model and store it in the score list for calculating the average accuracy and the standard deviation of the performance evaluation.

Debugging algorithms by learning and validating curves

Two simple but powerful decision tools to help improve the performance of learning algorithms: learning curve and validation curve.

Using the learning curve to determine the problem of deviation and variance

If a model is too complex to be constructed on a given training data-there are too many degrees of freedom or parameters in the model-the model may be too fit for the training data, and the ability to generalize to unknown data is low. Often, more training samples are collected to help reduce the degree of overfitting of the model. But in practice, collecting more data can be costly or not feasible at all. By looking at the training and accuracy validation of the model as a function of training data set size and drawing its image, it is easy to see whether the model is subject to high or high variance, and if collecting more data helps solve the problem. Discuss two common problems with models:

The left image shows a high deviation model. The training accuracy and cross-validation accuracy of this model are very low, which indicates that the model does not fit well with the data. A common way to solve this problem is to increase the number of parameters in the model. The model in the upper right image faces the problem of high variance, which indicates that there is a great difference between training accuracy and cross-validation accuracy. For this kind of overfitting problem, we can collect more training data or reduce the complexity of the model, such as increasing regularization parameters.

How to evaluate a model using the learning curve function in Scikit-learn:

Import Matplotlib.pyplot asPLT fromsklearn.learning_curve Import LEARNING_CURVEPIPE_LR=pipeline ([('SCL', Standardscaler ()), ('CLF', Logisticregression (penalty='L2', random_state=0) ]) Train_size,train_scores,test_scores=learning_curve (Estimator=pipe_lr,x=x_train,y=y_train,train_sizes=np.linspace (0.1,1,Ten), cv=Ten, n_jobs=1) Train_mean=np.mean (train_scores,axis=1) TRAIN_STD=NP.STD (train_scores,axis=1) Test_mean=np.mean (test_scores,axis=1) TEST_STD=NP.STD (test_scores,axis=1) Plt.plot (Train_size,train_mean,color='Blue', marker='o', markersize=5, label='Training Accuracy') Plt.fill_between (Train_size,train_mean+train_std,train_mean-train_std,alpha=0.15, color='Blue') Plt.plot (Train_size,test_mean,color='Green', linestyle='--', marker='s', markersize=5, label='Validation Accuracy') Plt.fill_between (Train_size,test_mean+test_std,test_mean-test_std,alpha=0.15, color='Green') Plt.grid () Plt.xlabel ('Number of training samples') Plt.ylabel ('accuracy') plt.legend (Loc='Lower Right') Plt.ylim ([0.8,1.0]) plt.show ()

With the train_size parameter of the Learning_curve function, we can control the absolute or relative quantity of the sample used to generate the learning curve. Here, you use the 10 samples of the equidistant interval on the training data set by setting Train_sizes=np.linspace (0.1,1,10). By default, the Learning_curve function uses layered K-fold cross-validation to calculate the accuracy of cross-validation. The value of K is set to 10 through the CV parameter. We can then simply calculate the average accuracy rate from the cross-validation and test scores on the different scale training sets, and we use Matplotlib's plot function to draw an accurate rate image. In addition, when drawing an image, we use the Fill_between function to add information about the average accuracy standard deviation, which is used to represent the variance of the evaluation result.

Determination of over-fitting and under-fitting by verification curve

Validating a curve is an effective tool to help improve the performance of a model by locating a number of problems, such as fitting or under-fitting. The verification curve is similar to the learning curve, but it is not the relation function between the sample size and the training accuracy rate and the test accuracy rate, but the relationship between the accuracy rate and the model parameters.

 fromsklearn.learning_curve Import Validation_curveparam_range=[0.001,0.01,0.1,1,Ten, -]train_scores,test_scores=validation_curve (estimator=pipe_lr,x=x_train,y=y_train,param_name='Clf__c', param_range=param_range,cv=Ten) Train_mean=np.mean (train_scores,axis=1) TRAIN_STD=NP.STD (train_scores,axis=1) Test_mean=np.mean (test_scores,axis=1) TEST_STD=NP.STD (test_scores,axis=1) Plt.plot (Param_range,train_mean,color='Blue', marker='o', markersize=5, label='Training Accuracy') Plt.fill_between (Param_range,train_mean+train_std,train_mean-train_std,alpha=0.15, color='Blue') Plt.plot (Param_range,test_mean,color='Green', linestyle='--', marker='s', markersize=5, label='Validation Accuracy') Plt.fill_between (Param_range,test_mean+test_std,test_mean-test_std,alpha=0.15, color='Green') Plt.grid () Plt.xscale ('Log') plt.legend (Loc='Lower Right') Plt.xlabel ('Parameter C') Plt.ylabel ('accuracy') Plt.ylim ([0.8,1]) plt.show ()

Similar to the Learning_curve function, if we are using a classification algorithm, the Validation_curve function defaults to evaluating the performance of the model using layered K-fold cross-validation. Within the Validation_curve function, we can specify the parameters that you want to validate.

Using grid search to tune machine learning models

In machine learning, there are two types of parameters: a class of parameters that are learned through training data. The other class is the tuning parameter, also known as the Super parameter. A powerful super-parametric optimization technique: Grid search, which can further improve the performance of the model by finding the optimal combination of parameters.

Using grid to search for tuning parameters

The grid search method is very simple, and it uses a brute force search of the different parameter lists we specify, and calculates the effect of estimating each combination on the performance of the model to obtain the optimal combination of parameters.

 fromsklearn.grid_search Import GRIDSEARCHCV fromSKLEARN.SVM Import Svcpipe_svc=pipeline ([('SCL', Standardscaler ()), ('CLF', SVC (random_state=1) ]) Param_range=[0.0001,0.001,0.01,0.1,1,Ten, -, +]param_grid=[{'Clf__c':p Aram_range,'Clf__kernel':['Linear']},{'Clf__c':p Aram_range,'Clf__gamma':p Aram_range,'Clf__kernel':['RBF']}]gs=GRIDSEARCHCV (estimator=pipe_svc,param_grid=param_grid,scoring='accuracy', cv=Ten, n_jobs=-1) GS=gs.fit (x_train,y_train) print (gs.best_score_) print (GS.BEST_PARAMS_)

Using the above code, we initialize a GRIDSEARCHCV object under the Sklearn.grid_search module for training and tuning the support vector machine pipeline. We define the Param_grid parameter of GRIDSEARCHCV as the parameter to be tuned as a dictionary. For linear SVM, we only need to tune the regularization parameter (c), and for the RBF-based kernel SVM, we should also tune the C and gamma parameters. Please note that the gamma here is specifically defined for the kernel SVM. After the completion of the grid search on the training data set, the performance score of the optimal model can be obtained through the best_score_ attribute, and the specific parameter information can be obtained through the Best_params_ attribute.

Selecting algorithms by nested cross-validation

Combined with the grid search for K-fold cross-validation, it is an effective way to improve the performance of machine learning model by optimizing the machine learning model through the change of super parameters. If the choice is to be made in different machine learning algorithms, another method-nested cross-validation, which uses nested cross-validation in the study of the deviation of error estimates, is that there is almost no gap between the errors and the results obtained on the test set.

In the perimeter loop of nested cross-validation, we divide the data into training blocks and test blocks, whereas in internal loops for model selection we use K-fold cross-validation based on these training blocks. After the selection of the model is completed, the test block is used for evaluation of model performance. The concept of nested cross-validation, also known as 5x2 cross-validation, is explained through 5 peripheral blocks and 2 internal modules.

"Python Machine learning" notes (vi)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More