in machine learning with supervised (supervise), datasets are often divided into two or three:
Training set (train set) validation set (validation set) test set
It is generally necessary to divide the sample into separate three-part training sets (train set), validation set (validation set), and test set. The training set is used to estimate the model, the validation set is used to determine the network structure or to control the complexity of the model parameters, while the test set examines the final selection of the optimal model performance. A typical division is that the training set accounts for 50% of the total sample, while the others account for 25%, and the three parts are randomly extracted from the sample.
When the sample is small, the division above is not appropriate. The usual is to leave a few parts to do the test set. Then the K-fold cross-validation method is used for the remaining n samples. is to disrupt the sample, and then evenly divided into K-parts, take turns to choose the K-1 training, the remaining one to do the verification, calculate the sum of squares of the prediction error, and finally the K-th error squared and then the average to select the optimal model structure basis. The Special K takes N, is to leave a method (leave one out).
Training set is used to train the model or to determine the parameters of the model, such as the weights in Ann, etc., validation set is used to make model selection, that is, the final optimization and determination of the model, such as the structure of Ann, and test Set is purely to test the ability to promote a well-trained model. Of course, test set does not guarantee the correctness of the model, he just says similar data will be used in this model to produce similar results. In practice, however, the data set is generally divided into two categories, namely training set and test set, and most articles do not involve validation set. Train
Training data. Fit the model and use this part of the data to build the model.
Are some of the data sets that we already know input and output train the machine to learn, by fitting to find the initial parameters of the model. For example, in a neural network (neural Networks), we use the training data set and the reverse propagation algorithm (backpropagation) to find the optimal specific gravity (Weights) for each neuron. Validation
Verify the data. Train built a model, but the model's effect only embodies the training data, but not necessarily for the same kind of other data, so before modeling data into two parts, part of the training data, part of the validation data (the proportion of the two parts of the data is roughly 7:3, depending on the method you verify). Alternatively, you might train multiple models, but you don't know which model performs better, and you can compare the validation data into different models.
is a set of data we already know about inputs and outputs, by allowing machine learning to optimize the parameters of the tuning model, in a neural network, we use the validation data set to find the optimal network depth (number of hidden layers), or to determine the stopping point of the reverse propagation algorithm Cross-validation, commonly used in common machine learning, Validation the training data set itself into different validation data sets to train the model. Test
Test data. The biggest difference between the two is that both train and validation data are data of the same object, but for testing, we need to validate the stability of the model with cross-object data.
The data set that the user tests the model to represent, according to the error (usually the difference between the predicted output and the actual output) to judge a model's good or bad.
why both validate datasets and test datasets are required.
Because the validation data set (Validation set) is used to adjust the model parameters to select the optimal model, the model itself already knows the inputs and outputs, so the error (error) from the validation data set is biased (Bias).
However, we only use the test set to evaluate the performance of the model, and we do not adjust the optimization model.
In the traditional machine learning, these three general ratio is training/validation/test = 50/25/25, but sometimes if the model does not need a lot of adjustment as long as the fit can be, or training itself is training+ Validation (such as cross validation) can also training/test =7/3.
But in deep learning, because the data volume itself is very large, and training the neural network needs a lot of data, you can divide more data to training, and correspondingly reduce validation and test.