A collection of data in machine learning

Source: Internet
Author: User
Data Set Classification

in machine learning with supervised (supervise), datasets are often divided into two or three groups: the training set (train set) validation set (validation set) test set.

The training set is used to estimate the model, the validation set is used to determine the network structure or the parameters that control the complexity of the model, while the test set verifies the performance of the model that ultimately chooses the best.
Ripley, B.D (1996) gives the definition of these three words in his classic monograph pattern Recognition and neural networks.

Training set:
 A set of examples used for learning, which are to fit the parameters [i.e., weights] of the classifier.
  validation set: 
A set of examples used to tune the parameters [i.e., architecture, not weights] of a classifier, fo R example to choose the number of hidden units in a neural network. 
Test set:
 A set of examples used only to assess the performance [generalization] of a fully specified classifier.

Obviously, the training set is used to train the model or to determine the parameters of the model, such as the weights in Ann, etc., validation set is used to make model selection, that is, the final optimization and determination of the model, such as the structure of Ann, and test Set is purely to test the ability to promote a well-trained model. Of course, test set does not guarantee the correctness of the model, he just says similar data will be used in this model to produce similar results. In practice, however, the data set is generally divided into two categories, namely training set and test set, and most articles do not involve validation set. Select a training set and a test set

One of the typical divisions was that the training set accounted for 50% of the total sample, while the others accounted for 25% and three were randomly extracted from the sample. When the sample is small, the division above is not appropriate. The usual is to leave a few parts to do the test set. Then the K-fold cross-validation method is used for the remaining n samples. is to disrupt the sample, and then evenly divided into K-parts, take turns to choose the K-1 training, the remaining one to do the verification, calculate the sum of squares of the prediction error, and finally the K-th error squared and then the average to select the optimal model structure basis. The Special K takes N, is to leave a method (leave one out). Normalization of data

Data normalization belongs to the preprocessing of data. Because the sigmoid function is based on a different transformation, the output is between 0 and 1 or 1 to 1, so if you do not, the sample output will be out of the range of the neural network output. Select the maximum value Max and Min min to do the following transformations

x= (x-min)/(Max-min)

is normalization.
It is important to note that Max and Min should not be directly selected as the maximum and minimum values in X. The reason is that the sample is only limited observation, there may be larger or smaller observations, so the appropriate choice should be max selection xmax larger and min than xmin smaller. Normalization is not always suitable for preprocessing, because it does not make the distribution of asymmetric samples become more symmetrical, and standardization is better. In addition, the principal component analysis can also play a role in dimensionality reduction.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.