First, choose the correct model basis verification method
From sklearn.datasets Import Load_iris # Iris DataSet from
sklearn.model_selection import train_test_split # split data Module
From sklearn.neighbors import Kneighborsclassifier # k nearest neighbor (Knn,k-nearestneighbor) classification algorithm
#加载iris数据集
iris = Load_ Iris ()
x = iris.data
y = iris.target
#分割数据并
x_train, X_test, y_train, y_test = Train_test_split (X, Y, ran dom_state=4)
#建立模型
KNN = Kneighborsclassifier ()
#训练模型
knn.fit (X_train, y_train)
#将准确率打印出
Print (Knn.score (x_test, Y_test))
# 0.973684210526 The accuracy of basic verification
Second, choose the correct model cross-validation method (Cross-validation)
The basic idea of cross-validation is to group the original data (dataset) in a certain sense, part as the training set (train set), and the other part as the validation set (validation set or test set), first training the classifier with the training set, Then the validation set is used to test the trained model, which is the performance index of the evaluation classifier.
From sklearn.cross_validation import Cross_val_score # K-fold cross-Validate module
#使用K折交叉验证模块
scores = Cross_val_score (KNN, X, Y, cv=5, scoring= ' accuracy ')
#将5次的预测准确率打印出
print (scores)
# [0.96666667 1. 0.93333333 0.96666667 1. ]
#将5次的预测准确平均率打印出
Print (Scores.mean ())
# 0.973333333333
Iii. accuracy and mean variance
The general accuracy (accuracy) is used to determine the quality of the classification (classification) model.
Import Matplotlib.pyplot as Plt #可视化模块
#建立测试参数集
k_range = range (1)
k_scores = []
# The effect of different parameters on the model is calculated by iteration, and the average accuracy rate after cross-validation is returned for the
K in k_range:
KNN = Kneighborsclassifier (n_neighbors=k)
Scores = Cross_val_score (KNN, X, y, cv=10, scoring= ' accuracy ')
k_scores.append (Scores.mean ())
#可视化数据
Plt.plot (K_range, K_scores)
Plt.xlabel (' Value of K for KNN ')
plt.ylabel (' cross-validated accuracy ')
Plt.show ()
As you can tell from the diagram, the K value of choosing 12~18 is the best. After 18, the accuracy starts to fall due to the problem of overfitting (over fitting).
In general, the average variance (Mean squared error) is used to determine the regression (Regression) model.
Import Matplotlib.pyplot as plt
k_range = range (1, +)
k_scores = [] for
K in k_range:
KNN = Kneighborscla Ssifier (n_neighbors=k)
loss =-cross_val_score (KNN, X, y, cv=10, scoring= ' Mean_squared_error ')
k_ Scores.append (Loss.mean ())
Plt.plot (K_range, K_scores)
Plt.xlabel (' Value of K for KNN ')
Plt.ylabel (' cross-validated MSE ')
plt.show ()
It can be known from the graph that the lower the average variance, the better, so choosing the K value around 13~18 is best.
Iv. the problem of fitting (Overfitting) is examined by the learning curve (learning Curve)
From Sklearn.learning_curve import Learning_curve #学习曲线模块 the from
sklearn.datasets import load_digits #digits数据集 From
SKLEARN.SVM import SVC #Support Vector Classifier
import matplotlib.pyplot as plt #可视化模块
import nump Y as NP
Loads the digits dataset, which contains the handwritten numerals, from 0 to 9. There are 1797 samples in the data lumped, each of which consists of 64 features, respectively, the 8x8 pixels corresponding to their handwriting, each with a value of 0~16.
digits = Load_digits ()
X = digits.data
y = digits.target
Observing samples from small to large learning curve changes, using K-fold cross-validation cv=10, select Average variance view model effectiveness scoring= ' mean_squared_error ', sample from small to large divided into 5 rounds to examine the learning curve (10%, 25%, 50%, 75%, 100% ) :
Train_sizes, Train_loss, Test_loss = Learning_curve (
SVC (gamma=0.001), X, Y, cv=10, scoring= ' Mean_squared_error ', c3/>train_sizes=[0.1, 0.25, 0.5, 0.75, 1])
#平均每一轮所得到的平均方差 (5 rounds, respectively, sample 10%, 25%, 50%, 75%, 100%)
Train_loss_mean =- Np.mean (Train_loss, Axis=1)
Test_loss_mean =-np.mean (Test_loss, Axis=1)
Visualize Graphics:
Plt.plot (train_sizes, Train_loss_mean, ' O ', color= "R",
label= "Training")
Plt.plot (train_sizes, Test_loss_ Mean, ' o ', color= "G",
label= "cross-validation")
Plt.xlabel ("Training examples")
Plt.ylabel ("Loss")
plt.legend (loc= "best")
plt.show ()