[ML] Learning Theory

Source: Internet
Author: User

Up to now, we have learned how the common machine learning algorithms work and how the learning steps are implemented. However, the application background of machine learning is diverse. To do practical engineering, you must learn how to evaluate the quality of a learning model based on specific problems, and how to rationally select and extract features, how to optimize parameters. These are also links that I used to do Pattern Recognition in the past, so when the recognition rate is very low, I am often confused and do not know how to improve: whether the model should be improved to change features, whether the number of training samples should be increased, whether the iterative algorithm should be optimized or the target function should be changed. Through learning theory, we can draw some guiding conclusions.

First, it is the bias-variance trade off problem. Assume that there are k Alternative Models in the training model set h, k Represents the complexity of the model, and the training set has m samples, the formula test error <= training error + 2 * (log (2 k/delta) * 1/2 m) ^ 0.5 is true in probability 1-Delta. Training error is a so-called bias, which indicates the degree of fit between the training sample and the model. The larger the bias, the higher the training error, the lower the degree of fit between the training sample and the model, that is, there is a situation of "under learning"; 2 * (log (2 k/delta) * 1/2 m) ^ 0.5 is variance, and the larger the K (that is, the greater the complexity of the Model) the smaller the m value (that is, the smaller the number of training samples), the larger the variance value, the worse the model promotion capability, that is, the "over-learning" situation.

This conclusion has another inference: given delta and gamma, if test error <= training error + 2 * gamma is true under probability 1-delta, the number of training samples m must satisfy: m> = O (1/gamma * log (K/delta )). This inference shows that, to ensure that the test error is not too large, the number of training samples m must be proportional to the complexity log (K) of the model. The actual model complexity is generally not represented by K. Instead, assuming that the model has D parameters, the dimension of each sample point is d, and each parameter is of the double type, then K = 2 ^ (64d ), the above condition becomes m> = O (D/gamma * log (1/delta), that is, the number of M in the training sample is proportional to the number of model parameters d. The above conclusion is for the finite dimension space. For the infinite dimension space, D is replaced by the VC Dimension of H. A similar conclusion can be obtained. Generally, the VC dimension is proportional to the number of model parameters D, but in some special cases, the VC dimension is not necessarily related to the sample dimension, such as SVM. The bias-variance trade off process is actually the process of Model Selection and feature selection. For model selection, the most practical method is to perform cross verification to obtain the model with the smallest test error. For feature selection, you can use the forward selection or Backward Selection Method to select good features, delete bad features, or use the filtering method to calculate the amount of mutual information between each feature XI and Y, take the feature with a large amount of mutual information.

Bias-variance trade off aims to find a balance between training error and promotion capability. To achieve this balance, you can also add regularation. Looking at machine learning from the Statistical Inference perspective: Without regularation corresponding to the frequency school method, the theta parameter is regarded as an unknown deterministic variable, the learning process is to find Theta corresponding to the maximum likelihood of Y and X, and add regularation to the Bayesian school. The Theta parameter is considered as a random variable, the learning process is to know the prior probability of theta and calculate the maximum posterior probability of Theta. After regularation is added, Lamda * | Theta | ^ 2 is added to the target function. For a regression problem, after adding a regular item, the fitting results will be smoother and "over-fitting" will be effectively reduced.

After learning so many learning theory, let's go back to the question raised at the beginning of the Note: how to optimize the learning algorithm? First, identify whether it is a high bias problem or a high variance problem. There are two methods to judge: 1. If the test error is large, it is a high variance problem, and if the training error is large, it is a high bias problem; 2. Increase the number of training samples and check the variation trend of two types of errors. Test Error becomes smaller, which is a high variance problem. Increasing the number of training samples and reducing the number of features can solve the high variance problem, and increasing the number of features can solve the high bias problem.

 

Http://www.cnblogs.com/uchihaitachi/archive/2012/09/11/2680410.html

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.