Machine learning Exercises (iii)--cross-validation cross-validation

Last Update:2018-07-26 Source: Internet

Author: User

Tags rounds

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

First, choose the correct model basis verification method

From sklearn.datasets Import Load_iris # Iris DataSet from
sklearn.model_selection import train_test_split # split data Module
From sklearn.neighbors import Kneighborsclassifier # k nearest neighbor (Knn,k-nearestneighbor) classification algorithm

#加载iris数据集
iris = Load_ Iris ()
x = iris.data
y = iris.target

#分割数据并
x_train, X_test, y_train, y_test = Train_test_split (X, Y, ran dom_state=4)

#建立模型
KNN = Kneighborsclassifier ()

#训练模型
knn.fit (X_train, y_train)

#将准确率打印出
Print (Knn.score (x_test, Y_test))
# 0.973684210526     The accuracy of basic verification

Second, choose the correct model cross-validation method (Cross-validation)
The basic idea of cross-validation is to group the original data (dataset) in a certain sense, part as the training set (train set), and the other part as the validation set (validation set or test set), first training the classifier with the training set, Then the validation set is used to test the trained model, which is the performance index of the evaluation classifier.

From sklearn.cross_validation import Cross_val_score # K-fold cross-Validate module

#使用K折交叉验证模块
scores = Cross_val_score (KNN, X, Y, cv=5, scoring= ' accuracy ')

#将5次的预测准确率打印出
print (scores)
# [0.96666667  1.          0.93333333  0.96666667  1.        ]

#将5次的预测准确平均率打印出
Print (Scores.mean ())
# 0.973333333333

Iii. accuracy and mean variance
The general accuracy (accuracy) is used to determine the quality of the classification (classification) model.

Import Matplotlib.pyplot as Plt #可视化模块

#建立测试参数集
k_range = range (1)

k_scores = []

# The effect of different parameters on the model is calculated by iteration, and the average accuracy rate after cross-validation is returned for the
K in k_range:
    KNN = Kneighborsclassifier (n_neighbors=k)
    Scores = Cross_val_score (KNN, X, y, cv=10, scoring= ' accuracy ')
    k_scores.append (Scores.mean ())

#可视化数据
Plt.plot (K_range, K_scores)
Plt.xlabel (' Value of K for KNN ')
plt.ylabel (' cross-validated accuracy ')
Plt.show ()

As you can tell from the diagram, the K value of choosing 12~18 is the best. After 18, the accuracy starts to fall due to the problem of overfitting (over fitting).

In general, the average variance (Mean squared error) is used to determine the regression (Regression) model.

Import Matplotlib.pyplot as plt
k_range = range (1, +)
k_scores = [] for
K in k_range:
    KNN = Kneighborscla Ssifier (n_neighbors=k)
    loss =-cross_val_score (KNN, X, y, cv=10, scoring= ' Mean_squared_error ')
    k_ Scores.append (Loss.mean ())

Plt.plot (K_range, K_scores)
Plt.xlabel (' Value of K for KNN ')
Plt.ylabel (' cross-validated MSE ')
plt.show ()

It can be known from the graph that the lower the average variance, the better, so choosing the K value around 13~18 is best.

Iv. the problem of fitting (Overfitting) is examined by the learning curve (learning Curve)

From Sklearn.learning_curve import Learning_curve #学习曲线模块 the from
sklearn.datasets import load_digits #digits数据集 From
SKLEARN.SVM import SVC #Support Vector Classifier
import matplotlib.pyplot as plt #可视化模块
import nump Y as NP

Loads the digits dataset, which contains the handwritten numerals, from 0 to 9. There are 1797 samples in the data lumped, each of which consists of 64 features, respectively, the 8x8 pixels corresponding to their handwriting, each with a value of 0~16.

digits = Load_digits ()
X = digits.data
y = digits.target

Observing samples from small to large learning curve changes, using K-fold cross-validation cv=10, select Average variance view model effectiveness scoring= ' mean_squared_error ', sample from small to large divided into 5 rounds to examine the learning curve (10%, 25%, 50%, 75%, 100% ) :

Train_sizes, Train_loss, Test_loss = Learning_curve (
    SVC (gamma=0.001), X, Y, cv=10, scoring= ' Mean_squared_error ', c3/>train_sizes=[0.1, 0.25, 0.5, 0.75, 1])

#平均每一轮所得到的平均方差 (5 rounds, respectively, sample 10%, 25%, 50%, 75%, 100%)
Train_loss_mean =- Np.mean (Train_loss, Axis=1)
Test_loss_mean =-np.mean (Test_loss, Axis=1)

Visualize Graphics:

Plt.plot (train_sizes, Train_loss_mean, ' O ', color= "R",
         label= "Training")
Plt.plot (train_sizes, Test_loss_ Mean, ' o ', color= "G",
        label= "cross-validation")

Plt.xlabel ("Training examples")
Plt.ylabel ("Loss")
plt.legend (loc= "best")
plt.show ()

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More