Go: Training set (train set) validation set (validation set) test set

Source: Internet
Author: User

The following are all transferred from http://www.cnblogs.com/xfzhang/archive/2013/05/24/3096412.html

In machine learning with supervised (supervise), datasets are often divided into two or three groups: the training set (train set) validation set (validation set) test set.

Http://blog.sina.com.cn/s/blog_4d2f6cf201000cjx.html

It is generally necessary to divide the sample into separate three-part training sets (train set), validation set (validation set), and test set. The training set is used to estimate the model, the validation set is used to determine the network structure or to control the complexity of the model parameters, while the test set examines the final selection of the optimal model performance. A typical division is that the training set accounts for 50% of the total sample, while the others account for 25%, and the three parts are randomly extracted from the sample.
When the sample is small, the division above is not appropriate. The usual is to leave a few parts to do the test set. Then the K-fold cross-validation method is used for the remaining n samples. is to disrupt the sample, and then evenly divided into K-parts, take turns to choose the K-1 training, the remaining one to do the verification, calculate the sum of squares of the prediction error, and finally the K-th error squared and then the average to select the optimal model structure basis. The Special K takes N, is to leave a method (leave one out).

Http://www.cppblog.com/guijie/archive/2008/07/29/57407.html

These three nouns are extremely common in the field of machine learning, but many people are not particularly clear about their concepts, especially when the latter two are often mixed with each other. Ripley, B.D (1996) gives the definition of these three words in his classic monograph pattern Recognition and neural networks.
Training set:a Set of examples used for learning, which are to fit the parameters [i.e., weights] of the classifier.
Validation set:a Set of examples used to tune the parameters [i.e., architecture, not weights] of A classifier, for Examp Le to choose the number of hidden units in a neural network.
Test set:a Set of examples used only to assess the performance [generalization] of A fully specified classifier.
Obviously, the training set is used to train the model or to determine the parameters of the model, such as the weights in Ann, etc., validation set is used to make model selection, that is, the final optimization and determination of the model, such as the structure of Ann, and test Set is purely to test the ability to promote a well-trained model. Of course, test set does not guarantee the correctness of the model, he just says similar data will be used in this model to produce similar results. In practice, however, the data set is generally divided into two categories, namely training set and test set, and most articles do not involve validation set.
Ripley also talked about why separate test and validation sets?
1. The error rate estimate of the final model on validation data would be biased (smaller than the true error rate) since T He validation set is used to select the final model.
2. After assessing the final model with the test set, you must not tune the model any further.

Http://stats.stackexchange.com/questions/19048/what-is-the-difference-between-test-set-and-validation-set

Step 1) Training:each type of algorithm have its own parameter options (the number of layers in a neural Network, the numb Er of trees in a Random Forest, etc). For each of the your algorithms, you must pick one option. That's why we have a validation set.

Step 2) Validating:you now has a collection of algorithms. You must pick one algorithm. That's why we have a test set. Most people pick the algorithm, the performs best on the validation set (and that ' s OK). But measure your top-performing algorithm's error rate on the test set, and just go with its The validation set, then you have blindly mistaken the ' best possible scenario ' for the ' most likely scenario. ' That's a recipe for disaster.

Step 3) Testing:i Suppose that if your algorithms do not has any parameters then you would not need a third step. In the case, your validation step would is your test step. Perhaps Matlab does not ask you for parameters or you had chosen not to use them and that's the source of your confusion .

My idea is this those option in Neural network Toolbox was for avoiding overfitting. In this situation the weights is specified for the training data only and don ' t show the global trend. By have a validation set, the iterations is adaptable to where decreases in the training data error cause decreases in Validation data and increases in validation data error; Along with decreases in training data error, this demonstrates the overfitting phenomenon.

Http://blog.sciencenet.cn/blog-397960-666113.html

Http://stackoverflow.com/questions/2976452/whats-is-the-difference-between-train-validation-and-test-set-in-neural-networ

For each epoch
For each training data instance
Propagate error through the network
Adjust the weights
Calculate the accuracy over training data
For each validation data instance
Calculate the accuracy over the validation data
If the threshold validation accuracy is met
Exit training
Else
Continue training

Once you ' re finished training, then you run against your testing set and verify that the accuracy is sufficient.

Training set:this data Set is used to adjust the weights on the neural network.

Validation set:this data Set is used to minimize overfitting. You ' re not adjusting the weights of the network with this data set, you ' re just verifying that any increase in accuracy ov Er the training data set actually yields an increase on accuracy over a data set that have not been shown to the network be Fore, or at least the network hasn ' t trained on it (i.e. validation data set). If the accuracy over the training data set increases and the accuracy over then validation data set stays the same or Dec Reases, then your ' re overfitting your neural network and you should stop training.

Testing Set:this Data Set is used-testing the final solution in order to confirm the actual predictive power of The network.

Validating set is used in the process of training. Testing set is not. The testing set allows

1) to see if the training set is enough and
2) Whether the validation set did the job of preventing overfitting. If You use the testing set in the process of training then it'll be just another validation set and it won ' t show what H Appens when new data was feeded in the network.

Training set:a Set of examples used for learning, that's to fit the parameters [i.e., weights] of the classifier.

Validation set:a Set of examples used to tune the parameters [i.e., architecture, not weights] of A classifier, for Examp Le to choose the number of hidden units in a neural network.

Test set:a Set of examples used only to assess the performance [generalization] of A fully specified classifier.

The error surface is different for different sets of data from your data set (batch learning). Therefore If you find a very good the local minima for your test set data, which is not is a very good point, and May is a ver Y bad point on the surface generated by some and other set of data for the same problem. Therefore need to compute such a model which isn't only finds a good weight configuration for the training set but also Should be able to predict new data (which was not in the training set) with good error. In other words the network should is able to generalize the examples so it learns the data and does not simply REMEMB ERS or loads the training set by overfitting the training data.

The validation data set is a set of data for the function want to learn, which you be not directly using to train The network. You is training the network with a set of data which the training data set. If you is using gradient based algorithm to train the network then the error surface and the gradient at some point'll Completely depend on the training data set thus the training data set are being directly used to adjust the weights. To make sure your overfit the network need to input the validation dataset to the network and check if the error is within some range. Because the validation set is not being using directly to adjust the weights of the netowork, therefore a good error for t He validation and also the test set indicates the network predicts well for the train set examples, also it is expect Ed to perform well when new example is presented to the network which is not used in the training process.

Early stopping is a-to-stop training. There is different variations available, the main outline was, both the train and the validation set errors are monitored, The train error decreases at each iteration (Backprop and he) and at first the validation error decreases. The training is stopped on the moment the validation error starts to rise. The weight configuration at the indicates a model, which predicts the training data well, as well as the data which Is isn't seen by the network. But because the validation data actually affects the weight configuration indirectly to select the weight configuration. The is where the Test set comes in. This set of data was never used in the training process. Once a model is selected based on the validation set, the test set data are applied on the network model and the error for This set is found. This error was a representative of the error which we can expect from absolutely new data for the same problem.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.