Model Evaluation and parameter tuning in Python machine learning

Source: Internet
Author: User

In doing data processing, need to use different methods, such as feature standardization, principal component analysis, and so on will be reused some parameters, Sklearn provides a pipeline, can solve the problem at once

First show the usual way first

Import Pandas asPD fromsklearn.preprocessing Import Standardscaler fromsklearn.decomposition Import PCA fromSklearn.linear_model Import logisticregressiondf= Pd.read_csv ('Wdbc.csv') X= df.iloc[:,2:].valuesy= df.iloc[:,1].values# Standardized SC=Standardscaler () x_train_std=sc.fit_transform (x_train) x_test_std=Sc.transform (x_test) # principal component Analysis PCAPCA= PCA (n_components=2) X_TRAIN_PCA=pca.fit_transform (X_TRAIN_STD) X_TEST_PCA=Pca.transform (X_TEST_STD) # Logistic regression predictive LR= Logisticregression (random_state=1) Lr.fit (X_TRAIN_PCA, y_train) y_pred= Lr.predict (X_TEST_PCA)

Standardize the data first, then the decision-making component analysis, the final regression prediction

Use Pipeline Now

from = Pipeline (['sc', Standardscaler ()), ('PCA') , PCA (n_components=2)), ('lr', logisticregression (random_state=  1)]) Pipe_lr.fit (X_train, Y_train) Pipe_lr.score (x_test, y_test)

The pipeline object receives a list of tuples as input, each tuple has the first value as the variable name, and the second element of the tuple is transformer or estimator in Sklearn.

Each step in the middle of the pipeline is made up of transformer in Sklearn, and the final step is a estimator. In our example, the pipeline contains two intermediate steps, a standardscaler and a PCA, both of which are transformer, and the logistic regression classifier is estimator.

When the pipeline PIPE_LR executes the Fit method, first Standardscaler executes the fit and transform methods and then inputs the converted data to PCA,PCA to perform the fit and transform methods as well. Finally, input the data to logisticregression and train an LR model.

How many transformer can be in the middle for a pipe? Work in the following ways

Using pipelines to reduce the amount of code

Evaluation and tuning of regression models now

The key step in training a machine learning model is to evaluate the generalization capability of the model. It is clearly illogical to use a training set to evaluate the performance of a model after we have trained the model. A model with poor performance, either because the model is too complex to cause overfitting (high variance), or because the model is too simplistic to result in an under-fitting (high deviation). But what is the method for evaluating the performance of the model? This is the SectionTo solution, you will learn two cross-validation counts, holdout cross-validation and K-fold cross-validation to evaluate the generalization capability of the model.

Holdout Cross validation (evaluate model performance)

The holdout approach is simply to divide the dataset into training sets and test sets, which are used for training and the latter for evaluating

If, in the course of model selection, we always use the test set to evaluate the performance of the model, which actually turns the test set into a training set in a disguised way, the optimal model chosen is probably overfitting.

A better holdout method is to divide the original training set into three parts: the training set, the validation set, and the test set. Training machines are used to train different models, and validation sets are used for model selection. While the test set is not used in both the training model and the model selection, it is unknown to the model, so it can be used to evaluate the generalization ability of the model. Shows the steps for the holdout method:

Disadvantage: It is sensitive to the way data is split, if the original data set is not properly segmented, this includes the percentage of samples in the training set, the validation set and the test set, and whether the distribution of the data after the split is the same as the original dataset distribution, and so on. Therefore, different partitioning methods may result in different optimal model parameters.

Second, K-fold cross-validation (evaluate model performance)

K-fold cross-validation process, the first step we use non-repeating sampling to randomly divide the original data into K parts, the second step k-1 data for model training, the remaining data for the test model. And then repeat the second step k, we get the K model and his evaluation results (translator Note: In order to reduce the error due to data segmentation, usually K-fold cross-validation to randomly use different partitioning method to repeat P-times, common 10 times 10 percent cross-validation)

We then calculate the average of the K-fold cross-validation results as a performance evaluation of the parameter/model. Using K-fold cross-validation to find the optimal parameters is more stable than the holdout method. Once we find the optimal parameters, we use this set of parameters to train the model on the original data set as the final model.

K-fold cross-validation uses non-repeating sampling, with the advantage that each sample appears only once in the training set or test so that the resulting model evaluates to a lower method.

10 percent cross-validation was demonstrated:

10 plays 10 percent cross-validation my understanding is that we will divide the data randomly into K points, k-1 training, K-Test in 10 different ways. Get 10 models and evaluation results, then take 10 averages as performance evaluations

From sklearn.cross_validation import Stratifiedkfold

PIPE_LR = Pipeline ([('SC', Standardscaler ()), ('PCA', PCA (n_components=2)), ('LR', Logisticregression (random_state=1) ]) Pipe_lr.fit (X_train, Y_train)Kfold= Stratifiedkfold (Y=y_train, n_folds=10, random_state=1) scores= [] forK, (train, test)inchEnumerate (kfold): Pipe_lr.fit (X_train[train], Y_train[train]) score=Pipe_lr.score (X_train[test], y_train[test]) scores.append (scores)Print('Fold:%s, Class Dist.:%s, ACC:%.3f'% (k+1, Np.bincount (Y_train[train]), score))Print('CV Accuracy:%.3f +/-%.3f'% (Np.mean (scores), NP.STD (scores)))

A simpler approach

 fromSklearn.cross_validationImportStratifiedkfold PIPE_LR= Pipeline ([('SC', Standardscaler ()), ('PCA', PCA (n_components=2)), ('LR', Logisticregression (random_state=1) ]) Pipe_lr.fit (X_train, Y_train)scores= Cross_val_score (ESTIMATOR=PIPE_LR, X=x_train, Y=y_train, cv=10, N_jobs=1)    Print('CV Accuracy scores:%s'%scores)Print('CV Accuracy:%.3f +/-%.3f'% (Np.mean (scores), NP.STD (scores)))

CV is k

Three, learning curve (debugging algorithm)

  fromSklearn.learning_curveImportLearning_curve PIPE_LR= Pipeline ([('SCL', Standardscaler ()), ('CLF', Logisticregression (penalty='L2', random_state=0)]) train_sizes, train_scores, Test_scores= Learning_curve (ESTIMATOR=PIPE_LR, X=x_train, Y=y_train, Train_sizes=np.linspace (0.1, 1.0, ten), cv=10, N_jobs=1) Train_mean= Np.mean (Train_scores, Axis=1) TRAIN_STD= NP.STD (Train_scores, Axis=1) Test_mean= Np.mean (Test_scores, Axis=1) TEST_STD= NP.STD (Test_scores, Axis=1) Plt.plot (train_sizes, Train_mean, color='Blue', marker='0', Markersize=5, label='Training Accuracy') Plt.fill_between (train_sizes, Train_mean+ TRAIN_STD, TRAIN_MEAN-TRAIN_STD, alpha=0.15, color='Blue') Plt.plot (train_sizes, Test_mean, color='Green', linestyle='--', marker='s', Markersize=5, label='Validation Accuracy') Plt.fill_between (train_sizes, Test_mean+ TEST_STD, TEST_MEAN-TEST_STD, alpha=0.15, color='Green') Plt.grid () Plt.xlabel ('Number of training samples') Plt.ylabel ('accuracy') plt.legend (Loc='Lower Right') Plt.ylim ([0.8, 1.0]) plt.show ()

Model Evaluation and parameter tuning in Python machine learning

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.