This article introduces the content of cross-validation and Python code implementation, has a certain reference value, now share to everyone, the need for friends can refer to
Two methods of model selection: regularization (typical method), cross-validation.
Cross-validation and its Python code implementations are described here.
Cross-validation
If a given sample data is sufficient, a simple way to choose a model is to randomly divide the dataset into 3 parts, divided into training sets, validation sets, and test sets.
Training set: training model
validation Set: selection of models
Test set: final evaluation of the model
In the study of different complexity models, select the model with the minimum prediction error for the validation set. Because the validation set has enough data, it is also valid to use it for model selection. In many practical applications where data is insufficient, a cross-validation approach can be used.
Basic idea: Using data repeatedly, slicing a given data into training sets and test sets, on the basis of repeated training, testing and model selection.
Simple cross-validation:
Randomly divides the data into two parts, a training set and a test set. General 70% of the data is the training set, 30% is the test set.
Code (dividing training set, test set):
From sklearn.cross_validation import train_test_split# data (all data) labels (all target values) X_train training Set (all features) Y_ Train training Set target value X_train, X_test, y_train, y_test = Train_test_split (Data,labels, test_size=0.25, random_state=0) #这里训练集75% : Test Set 25%
one of the random_state
Source Interpretation: int, randomstate instance or None, optional (default=none)
int, randomstate instance or None, optional (default=none) If int, random_state is the seed used by the random number Gener Ator;
If Randomstate instance, random_state is the random number generator;
If None, the random number generator is the Randomstate instance used
By ' Np.random '.
If you set a specific value, such as: random_state=10 , the data after each partition is the same, and runs multiple times. If set to None, that is , random_state=none, the data after each partition is different, and each run divides the data differently.
Code (dividing training sets, validation sets, test sets):
From Sklearn import cross_validationtrain_and_valid, test = cross_validation.train_test_split (data, test_size=0.3, random_state=0) # First divided into two parts: training and validation , test set train, valid = Cross_validation.train_test_split (data, test_size= 0.5,random_state=0) # then the training and validation are divided into: training set, validation set
Related recommendations:
Cross-validation
3 Types of cross-validation
The usefulness of cross-validation
Why use cross-validation