The data set is generally divided into three parts: train set, valid set, test set
It is used to train the model, adjust the hyper-parameters and test the model.
Where valid set is also called development set, referred to as Dev set. Cross-validation (hold-out crosses validation)
Randomly extract part of the data from a set of measurement data to build the model, and use the rest of the data to test the model's approach. The most common is 10 cross-validation, that is, the training set is randomly divided into 10 parts, each take a copy of the valid set, the remaining as train set. This gives the n model, n the result of the validation. Use the average of these n results to measure the performance of the model. distribution Ratio
Traditional machine learning phase (data set at the order of magnitude), the general distribution ratio is 6:2:2
In the era of big data, this ratio is less applicable. Because the millions data set, even with 1% of the data to do test also has 10,000, is enough. You can do the training with more data. So the common proportion can reach 98:1:1, even can reach 99.5:0.4:0.1 and so on. mismatched train/test distribution
In the actual project, there will be a training set and a validation set, the test set is not the same situation.
For example, the training set is to crawl the cat slices on the Internet, the verification set and the test set are the photos taken by their mobile phones.
In this case, the validation set and the test set are guaranteed to come from the same distribution, otherwise the evaluation of the model is problematic. Only train set and dev set are available without test set.
Many teams will refer to the dev set in this case as the test set