noise and probability objective function (Noise and probalistic target)The data in the actual application is basically disturbed:
Or a credit card issue, for example:
Tag error: The customer should be marked as non-card issuer, or two data of the same customer one card issuer does not issue;
Input error: The user's data itself is wrong, such as the annual income less write a 0, the gender of the anti-write something.
================================================================================
VC bound how to work with noisy data? Do you remember the test that took the ball from the jar?
Before we specified that all h (x) ≠f (x) data (small balls), he was painted orange, otherwise green. The proportion of orange balls in all spheres is the error rate. But now there's a disturbance, and a piece of data may have H (x)≠f (x), but we mistakenly paint it green, or vice versa. This affects the judgment of the error rate. If we know the probability of making a mistake on a single piece of data, for each x, its output y obeys the following distribution :
Put P (y| x) is called the target distribution, and has it to get the mini target function f (x). For example, if there is a target distribution:
Error Measurement (Measure)
There are two methods of error calculation:
- The first is called a 0/1 error, as long as the "predictive ≠ target" is considered to be wrong, usually used for classification;
- The second is called a squared error, which measures "the distance between predictions and targets" and is usually used for regression.
To illustrate:
There are three possible outputs: 1, 2, 3, and the corresponding probabilities are 0.2, 0.7, 0.1, respectively.
- If the 0/1 error is measured, the output 2 is the least likely to err on either input, so the mini target f (x) = 2;
- if measured in squared error, for either input, the output 1.9 the probability of error is the lowest, so the mini target f (x) = 1.9.
Error weighted algorithmic error MeasureTake fingerprint identification as an example:
The target function identifies the fingerprint to differentiate between the legal identity and the illegal identity, where the error is 0/1 error. One is false reject called false refusal, that is, the original legal identification is illegal, and the other is called false accept, that is, illegal identification is legal.
Imagine an application, a supermarket through the fingerprint identification of members, if it is a member to give a certain discount. If a member is wrongly rejected, he is likely to be angry and refuse to come to the supermarket because he has not enjoyed the rights he should have, and the supermarket will lose a stable source; If an ordinary customer is wrongly accepted, the supermarket gives him some discount, and there is not much loss. The customer may have frequented the supermarket because he took up the petty. So, false reject and false accept here have different cost, they to the supermarket losses are different, so we need to give them different weights, so that the learning algorithm in the selection of approximate function in the wrong way to measure the bias:
Similarly, the CIA's top-secret database can only be open to those who have access, if the fingerprint verification of personnel status, false accept the price becomes very large, which means that a person without access to the state secrets! Can not endure ah, so the engineer to false accept added a huge weight, training if false accept, this alternative function basically is to be shot off .
And if it's false reject, it doesn't matter,:-).
Weighted Classification model (Weighted classification)
Still using the CIA authentication example to illustrate what is weighted classification, then its error calculation becomes:
Give this special e-in a name:
The W superscript stands for weighted.
Mathematicians have thought of a way to transform the data to reflect the impact of weight:
For example, false accept (1 is recognized as +1) weight is 1000, we will be training data in all marked-1 points copied 1000 times, if the approximate function at these points error, there will be 1000 times times the penalty. This problem is turned into a weight-free problem:
And we already know that the pokect algorithm can solve the problem of no weight.
In fact, in the application we will not really copy some data 1000 times, we only need to calculate the error, the weight of the high-weighted data is increased by 1000 times times the probability, which is equivalent to replication. However, if you are traversing the entire test set (not sampling) to calculate the error, there is no need to modify the call probability, just add the weights of the corresponding errors and divide by N. So far, we have expanded the VC Bound, which is also set up on the issue of multiple classifications!
Summary
For more discussion and exchange on machine learning, please follow this blog and Sina Weibo songzi_tea.
Ntu-coursera machine Learning: Noise and Error