California Institute of Technology Open Class: machine learning and data Mining _three Learning Principles (17th lesson)

Source: Internet
Author: User

Course Description:This lesson focuses on the things you should be aware of in machine learning, including: Occam's Razor, sampling Bias, and Data snooping.
Syllabus: 1, Occam ' s razor.2, sampling bias.3, Data snooping.
1, Occam ' s Razor.Einstein once said a word:     An explanation of the data should is made as simple as possible, but no simpler.There are similar sayings in software engineering:Keep It simple, stupid (KISS)Similar sayings exist in the field of machine learning:The simplest model that fits the most plausible of the data is also. -occam ' s RazorSince keeping things simple is so important, then1) How simple is it for a model?2) Why is simple generally good?
First answer the first question: what is simple?There are two types of complexity in machine learning: The complexity of the model and the complexity of the hypothesis set.There are two evaluation criteria for model complexity: MDL and the Order of the model. The MDL is the abbreviation for the minimum description length (minimum description lengths). For a model, the less information you need to describe it, the simpler it is. For example, there is a model, where the first number is 2, and the number of I is twice times the number of I-1 (geometric series), then the model can be described as: a[i] = 2^i; compared to the model: a[i] = ax^2 + bx + C to be simple. The order of the model refers to the highest order, the smaller the simpler, the square order is simpler than the cubic order.The evaluation criteria of the complexity of the hypothesis set are: Entropy and VC dimension. Entropy describes the degree of dispersion of each model in a hypothetical set. VC dimension describes the learning ability of this hypothesis set, the higher the VC dimension is more complex. Some models look complex, but they are not complex, such as SVM. The model consists of only a small number of support vectors, so its MDL is not large.Example:There was a story: football predictions.B receives an email every week predicting the results of the next week's football match, with every forecast being accurate. If you were B would you pay for the next week's forecast?Analysis: For B, assume a set size of 1. But actually:A the number of people who send information to predict football results, half of which is the prediction: Home victory. The other half predicts: home failure. Certainly half of the predictions are accurate. Then use the same technique to send predictive information to the right half of the forecast. So many times, there must be some lucky people, each time they are given the right forecast. Unfortunately, B is one of the lucky ones.For the lucky ones, they see only the predictions that are sent to them, so they only consider the complexity of the models they use on them, so they think the accuracy rate is very high: 100%. But in fact the assumption set is 2^n, the prediction result is meaningless at all.This shows that we need to consider not only the complexity of the model, but also the complexity of the set of assumptions associated with it. The more complex the hypothesis, the less reliable the model is.
The second question: why is simple generally good?For a hypothetical set, a complex set of assumptions is more likely to fit the training data (VC dimension) than a simple set of assumptions, but its credibility increases if the simple model fits the training data. Because the complex model may have done too much.
2, sampling Bias.If the data is sampled in a biased, learning'll produce a similarly biased outcome.Sampling deviation, the problem is mainly in the data sampling when the data is ignored the distribution. Causes the sampled data to be inconsistent with the actual data distribution. The generalization ability of the model can be very poor. (The distribution of a portion of the data is not considered). Can be avoided by means of stratified sampling.In addition to the above-mentioned situation, there is a situation in the real test. Suppose now the training good one model: a credit loan. If someone wants to borrow now, the model can be used to produce a result: reject or accept. If the result is: Accept. But finally found that this guy is not able to repay the loan. So we can use that data to correct our model. However, if the original result is rejection, then we will have no way to know whether "rejection" is the right choice, that is to say we have no way to actually get the results of the "reject" the correctness of the data.The above situation is also one of sampling deviations. So how do we fix this? Local Tyrants Special method: When the result of the model is: rejection, you accept the person's loan, while recording the model's predictions. Finally, determine if your model is correct.
3, Data snooping.If the data affects every step of the machine learning process, its generalization ability is weakened. (?_?)(If a data set have affected any step in the learning process,its ability to assess the outcome have been compromised.)Data snooping is mainly in the following areas:1. View the data directly, then select the "Fit" model for training.2. Use the same data set multiple times: Different models use the same data set for learning.Why does data snooping weaken the generalization capabilities of the model?In the first case, when we choose the right model, we are also doing "learning": Select the model based on the distribution of the training data. So we are actually enlarging the complexity of the hypothesis set. According to the VC dimension theory, we need more data to get the same generalization ability.For the second case, there is the same reason. We also inadvertently enlarged the size of the hypothesis set.can refer to Raymond Paul Mapa generalization theory (lesson six)There are two ways to resolve this:1, avoid data snooping. -_-2, can not avoid in the calculation of generalization theory when the data snooping into consideration. For example, consider increasing the complexity of the hypothesis set, increase the data set and so on.

California Institute of Technology Open Class: machine learning and data Mining _three Learning Principles (17th lesson)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.