Cross-validation in sklearn)
Sklearn is a very comprehensive and useful third-party library for machine learning using python. Today, I will record the usage of cross-validation in sklearn. I will mainly explain sklearn official documents cross-validation: Evaluating estimator performance. I suggest you read the official documents for good English skills, the knowledge points are detailed.
1. cross_val_score
Perform a specified number of cross-validation on the dataset and evaluate the effectiveness of each verification.
The score is evaluated based on scoring = 'f1 _ macro 'by default. For classification or regression, the remainder is as follows:
This requires the from sklearn import metrics to set the evaluation standard by specifying parameters in cross_val_score;
When CV is set to int type, kfold or stratifiedkfold is used by default to disrupt the dataset. The following describes kfold and stratifiedkfold.
In [15]: from sklearn.model_selection import cross_val_scoreIn [16]: clf = svm.SVC(kernel=‘linear‘, C=1)In [17]: scores = cross_val_score(clf, iris.data, iris.target, cv=5)In [18]: scoresOut[18]: array([ 0.96666667, 1. , 0.96666667, 0.96666667, 1. ])In [19]: scores.mean()Out[19]: 0.98000000000000009
In addition to the default cross-validation method, you can specify the cross-validation method, such as the number of verifications and the proportion of the training set test set.
In [20]: from sklearn.model_selection import ShuffleSplitIn [21]: n_samples = iris.data.shape[0]In [22]: cv = ShuffleSplit(n_splits=3, test_size=.3, random_state=0)In [23]: cross_val_score(clf, iris.data, iris.target, cv=cv)Out[23]: array([ 0.97777778, 0.97777778, 1. ])
2. cross_val_predict
Cross_val_predict is very similar to cross_val_score, but unlike the returned result, cross_val_predict returns the estimator classification result (or regression value), which is important for later model improvement, the prediction output can be used to compare the actual target values and accurately locate the predicted error. This is very important for Parameter Optimization and troubleshooting.
In [28]: from sklearn.model_selection import cross_val_predictIn [29]: from sklearn import metricsIn [30]: predicted = cross_val_predict(clf, iris.data, iris.target, cv=10)In [31]: predictedOut[31]: array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])In [32]: metrics.accuracy_score(iris.target, predicted)Out[32]: 0.96666666666666667
3. kfold
K-fold cross-verification is an official solution for dividing a dataset into k parts. K-fold means that a dataset is divided into k parts, so that all data has been present in the training set, it has appeared in the test set again. Of course, there will be no overlap in each split. This is equivalent to sampling without replacement.
In [33]: from sklearn.model_selection import KFoldIn [34]: X = [‘a‘,‘b‘,‘c‘,‘d‘]In [35]: kf = KFold(n_splits=2)In [36]: for train, test in kf.split(X): ...: print train, test ...: print np.array(X)[train], np.array(X)[test] ...: print ‘\n‘ ...: [2 3] [0 1][‘c‘ ‘d‘] [‘a‘ ‘b‘][0 1] [2 3][‘a‘ ‘b‘] [‘c‘ ‘d‘]
4. leaveoneout
Leaveoneout is actually a special case of kfold. Because there are many times of use, it is defined independently and can be fully implemented through kfold.
In [37]: From sklearn. model_selection import leaveoneoutin [38]: x = [1, 2, 4] in [39]: loo = leaveoneout () in [41]: for train, test in loo. split (x ):...: Print train, test...: [1 2 3] [0] [0 2 3] [1] [0 1 3] [2] [0 1 2] [3] # Use kfold to implement leaveoneotutin [42]: kf = kfold (n_splits = Len (x) in [43]: for train, test in KF. split (x ):...: Print train, test...: [1 2 3] [0] [0 2 3] [1] [0 1 3] [2] [0 1 2] [3]
5. leavepout
This is also a special case of kfold. It is a little complicated to implement it with kfold, which is similar to leaveoneout.
In [44]: from sklearn.model_selection import LeavePOutIn [45]: X = np.ones(4)In [46]: lpo = LeavePOut(p=2)In [47]: for train, test in lpo.split(X): ...: print train, test ...: [2 3] [0 1][1 3] [0 2][1 2] [0 3][0 3] [1 2][0 2] [1 3][0 1] [2 3]
6. shufflesplit
Shufflesplit: its usage is similar to leavepout. In fact, the two are completely different. leavepout is a set of elements that appear in all test sets after the dataset is divided several times, that is, sampling without replacement, while shufflesplit Is Sampling with replacement. It can only be said that after a large enough number of samples, the test set has a multiple of the completed data sets.
In [48]: from sklearn.model_selection import ShuffleSplitIn [49]: X = np.arange(5)In [50]: ss = ShuffleSplit(n_splits=3, test_size=.25, random_state=0)In [51]: for train_index, test_index in ss.split(X): ...: print train_index, test_index ...: [1 3 4] [2 0][1 4 3] [0 2][4 0 2] [1 3]
7. stratifiedkfold
Sampling the test set without replacement
In [52]: from sklearn.model_selection import StratifiedKFoldIn [53]: X = np.ones(10)In [54]: y = [0,0,0,0,1,1,1,1,1,1]In [55]: skf = StratifiedKFold(n_splits=3)In [56]: for train, test in skf.split(X,y): ...: print train, test ...: [2 3 6 7 8 9] [0 1 4 5][0 1 3 4 5 8 9] [2 6 7][0 1 2 4 5 6 7] [3 8 9]
Original: 71915259
Cross verification in sklearn