**Reprint Please specify source: http://www.cnblogs.com/ymingjingr/p/4271742.html**

Directory machine Learning Cornerstone Note When you can use machine learning (1) Machine learning Cornerstone Note 2--When you can use machine learning (2) Machine learning Cornerstone Note 3--When you can use machine learning (3) (modified version) machine learning Cornerstone Notes 4-- When to use machine learning (4) Machine learning Cornerstone Note 5--Why machines can learn (1) machine learning Cornerstone Notes 6--Why machines can learn (2) machine learning Cornerstone Notes 7--Why machines can learn (3) machine learning Cornerstone Notes 8-- Why machines can learn (4) machine learning Cornerstone Note 9--machine how to learn (1) machine learning Cornerstone Note 10--machine how to learn (2) machine learning Cornerstone Note 11--machine how to learn (3) machine learning Cornerstone Note 12-- How machines can learn (4) machine learning Cornerstone Note 13--Machine How to learn better (1) machine learning Cornerstone Note 14--Machine How to learn better (2) machine learning Cornerstone Note 15--Machine How to learn better (3) machine learning Cornerstone Note 16-- How the machine can learn better (4) Viii. Noise and Error

Noise (clutter) and errors.

8.1 Noise and Probabilistic Target

Noise (clutter) and probabilistic objective functions.

This section focuses on whether the VC limit is still available in the case of noise (mentioned in section 2.4), as shown in flowchart 8-1.

Figure 8-1 Machine learning flowchart with noise

What is noise, or what are the causes of noise? Still cite examples of banks issuing credit cards to more visually articulate the 3 causes of noise generation:

- Mark y The noise that is present in the, that is, the wrong mark. If the credit card should be issued by the user, the error is marked as non-compliant users.
- Mark y Another type of noise that exists in the evaluation of different criteria for the same input sample to get different results, such as two users all the attribute conditions are consistent, marked as one can be issued, and the other can not be issued.
- Input Sample x in the presence of noise, that is, the input information is inaccurate, such as the user's information input error.

Back to the focus of this section, when there is noise, VC restrictions can also play a role? Next give a simple deduction, different from the previous chapters, the VC limit of this situation is fully re-deduced again, this section mainly provides a proof of ideas.

The example of a small ball jar comes up again, and the input sample and the whole sample are subject to the same probability distribution when the noise-free condition is discussed. Whether it is a small ball in the jar or a sampled ball, its color is fixed. In this case, a small ball of certainty (deterministic) is used. The color of the ball is analogous to the objective function f (x) and the hypothetical function h (x). The Mark Y depends on the objective function F, this learning method is called the discriminant method, but in reality, most of the cases contain noise, the data is still subject to the uniform probability distribution, but the ball color is not fixed, can be imagined as the color of the ball in constant discoloration, only in a moment to determine its color, This kind of ball can be called probability (probabilistic) ball. corresponding to machine learning, is a sample of noise, that is, not sure, where the mark Y obeys the probability distribution, this form is called the target distribution (target distribution) instead of the target function, this method is called the Generation method.

Why this is called the target distribution, give a simple example, such as a sample point that conforms to the following equation 8-1.

(Equation 8-1)

Then the target with a small error rate (Mini-target) will be chosen, according to which the example chooses to mark +1, while the 30% chance is labeled-1 what does it mean? It's noise. The objective function f is a special target distribution whose probability conforms to equation 8-2.

(Equation 8-2)

There are two distribution functions and, the larger the probability that X is chosen as the training sample, the greater the likelihood that the sample will be a certain class, the combination of the two, that is, the class at the common sample points is as correct as possible.

Therefore, the VC limit is still applicable, because this noise-containing input samples and markers are obeyed separately, that is, the joint probability distribution of obedience.

After understanding the contents of this section, the machine learning flowchart is modified with the concept of noise and target distribution, where the objective function f becomes the target distribution, which produces the marker Y for the training sample, while the test's Mark Y also obeys the distribution.

Fig. 8-2 machine learning flowchart combining noise and target distribution

8.2 Error Measure

Error measurement.

This section focuses on the impact of error measurements on machine learning.

Many of these chapters describe how to make a machine learn something, that is, how to make the hypothesis function g and the target function f close, that is, how to make as small as possible. What issues are considered in the known error measures? The main three factors are as follows:

- the entire sample space ( Out-of-sample ): All unknown samples x the average;
- One point one point evaluation (pointwise) : Each point is assessed separately;
- Use Categories ( Classification ) is evaluated as: because there are only two categories in the two-tuple category, the same is true for 0 , different for 1 .

The above classification errors (classification error) are also called 0/1 errors (0/1)

You can use functions to represent point-by-points error measurements (pointwise error measure), so the error measurements of the training samples and the error measurements of the entire sample space can be expressed using equation 8-3 and Equation 8-4, respectively.

(Equation 8-3)

(Equation 8-4)

For the sake of simple expression, the assumption function is represented as the target function.

In addition to the 0/1 error measurements commonly used on classification (classification), it is also useful to measure the square error on regression (regression), which is also a point-by-spot (pointwise) error measurement, as shown in Equation 8-5.

(Equation 8-5)

Two forms of error-weighted expression are given, followed by the relationship between error measurement and learning. Error measurement has a guiding effect on machine learning.

In the case of noise, the target distribution function and the point-by-step error function together determine the ideal error rate minimum objective function (ideal mini-target) F.

The effect of the target distribution function on the minimum error rate objective function f is explained in the previous section, and then an example is given to illustrate the effect of the point-in-order error function.

Assume 3 target distribution functions,. In the case of 0/1 error measurements, it is not difficult to figure out the various error rates, as shown in 8-3.

Figure 8-3 0/1 error rate for each marker in error measurement

It is not difficult to get the y=2 error rate is the lowest, so you should choose the y=2 tag, which has a weird mark of 1.9, the mark in 0/1 error-weighted standard error rate of 1.

In other cases, the error rate for these target distributions is calculated, and 8-4 is the error rate for each tag under the squared error.

Figure 8-4 Error rate for each marker under squared error measurement

At this point the lowest error rate is the Mark 1.9 with Error rate 1 in the 0/1 error measure, so select Mark y=1.9 when using squared error measurement.

It's easy to roll. The minimum objective function f can be expressed in equation 8-6 and Equation 8-7, respectively, under two error measurements.

(Equation 8-6)

(Equation 8-7)

At this point in the previous section, the machine learning Flow diagram has been further modified, 8-5, added the error measurement module, which has a great impact on the algorithm and the final choice of assumptions.

Figure 8-5 Machine learning flow diagram with error measurement

8.3 Algorithmic Error Measure

Error measurement of the algorithm.

There are two types of problem errors for the two-tuple classification, as shown in 8-6.

Figure 8-6 Two types of errors for classification errors

When the target function f is +1, assuming that the function g is given 1, this error is called false rejection (false reject), and the other is the target function f is 1, assuming that the function g is given +1, this error is called false acceptance (false accept).

The 0/1 error measures mentioned in the previous section simply equate the loss of these two types of errors with the fact that in reality the two errors differ in their loss in different scenarios.

To cite two common examples, in supermarkets to the high annual consumption of the members to give gifts, if the wrong acceptance, it means that the member is not eligible to receive the gift, the supermarket or give him a gift, the loss is only a few supermarket gifts, but if the wrong refusal means that the member is eligible to receive the giveaway, The supermarket refused to give him, this loss is the credibility of the supermarket, it may be a large number of users. Another example is in the security department, the employee has access to a certain information, the system refused his request, which is a false refusal, as an employee, the most is to complain, but if it is an employee does not have access to a certain information, the system agreed, this is a false acceptance, The loss could be very large and could even threaten the interests of the State. The loss of both cases may be as shown in 8-7.

Figure 8-7 A) The error loss of the supermarket giveaway B) error loss in the security sector

These two error losses also demonstrate the need to use different error measurements in different applications.

When designing algorithms, the best way to do this is to design the error measurements in the case of various error losses, but the biggest problem is how the value of the error loss is determined (how these 10 and 1000 are quantitatively given). Therefore, in the design of the algorithm, usually in an alternative way to design, there are two main alternatives to the principle, as follows:

- > It makes sense: in the classification error measurement, it can be imagined that the situation of this noise relative to the whole must be small, So you just need to find a small enough error, in squared error measurement, as long as the recognition noise obeys the Gaussian distribution, the reduction of the square in Gauss, as in reducing the square error. This approximate error measure is used with the
- friendly: easy to design an algorithm a 0/1 error is a np Hard problem, and really in the algorithm, the use of error rate is smaller than the principle of the former, that is, looking for smaller error rates. There are two ways to find the results directly ( closed-form solution convex objective function

Because it is difficult to know the exact error measurement when designing the algorithm, it produces an approximate error measure, which is the focus of this section, as shown in flowchart 8-8 of machine learning after joining.

Figure 8-8 Machine learning flow diagram using approximate error measurements

8.4 Weighted Classification

Weighted classification.

The 8-3 error representations that exist in the previous section can be referred to as the cost matrix or the loss matrix (loss matrix) or the error matrix.

Figure 8-3 Error Matrix

Use the error matrix as above, respectively, as shown in Equation 8-8 and equation 8-9.

(Equation 8-8)

(Equation 8-9)

Because VC restrictions can work in a variety of algorithms, so in the solution of known algorithms only need to make as small as possible, but the use here and before the mention of no weighting or some difference, in order to differentiate, the weighted (weighted) is expressed as the formula, as shown in 8-10.

(Equation 8-10)

Assuming that in a situation that is not linear (if it is a linear can be divided into a certain situation), such as the pocked algorithm, the algorithm in this case, the idea of a conjecture, most and the original algorithm is consistent, that is, if the smaller than the original is used instead.

The pocket algorithm can be used as the error measure is proven, but the above weighting method is not proven, how to find a guaranteed pocket algorithm in the case of weighted error, you can use a similar way of the algorithm flow?

An easy way to do this is to turn the original problem used above into a problem that is used and is equivalent to the original problem, as shown in 8-4.

Figure 8-4 A) The equivalence problem of using the original question B)

The equivalence problem is to copy all data samples marked as-1 in DataSet D in the original problem 1000 times, and then to represent the loss matrix without weighted loss matrix, which is the error measure used by the pocket algorithm, the only difference, such as sample error, The other 999 that were copied also made a mistake, and the loss amounted to 1000 times times the previous one. Of course, careful people have discovered that the probability of a data sample labeled 1 is increased by 1000 when the algorithm is searching.

Machine learning Cornerstone Note 8--Why machines can learn (4)