Machine learning Cornerstone Note 9--machine how to learn (1)

Last Update:2015-03-05 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Reprint Please specify source: http://www.cnblogs.com/ymingjingr/p/4271742.html

Directory machine Learning Cornerstone Note When you can use machine learning (1) Machine learning Cornerstone Note 2--When you can use machine learning (2) Machine learning Cornerstone Note 3--When you can use machine learning (3) (modified version) machine learning Cornerstone Notes 4-- When to use machine learning (4) Machine learning Cornerstone Note 5--Why machines can learn (1) machine learning Cornerstone Notes 6--Why machines can learn (2) machine learning Cornerstone Notes 7--Why machines can learn (3) machine learning Cornerstone Notes 8-- Why machines can learn (4) machine learning Cornerstone Note 9--machine how to learn (1) machine learning Cornerstone Note 10--machine how to learn (2) machine learning Cornerstone Note 11--machine how to learn (3) machine learning Cornerstone Note 12-- How machines can learn (4) machine learning Cornerstone Note 13--Machine How to learn better (1) machine learning Cornerstone Note 14--Machine How to learn better (2) machine learning Cornerstone Note 15--Machine How to learn better (3) machine learning Cornerstone Note 16-- How the machine can learn better (4) nine, Linear Regression

Linear regression.

9.1 Linear Regression Problem

Linear regression problem.

In the second chapter mentioned in the issue of credit cards issued by the Bank, through the issuance of credit cards to raise the two-dollar classification problem; This chapter again uses this example to issue a return (regression) or linear regression (linear regression) by issuing a user's credit card. The biggest difference between the regression problem and the two-dollar classification problem is the output space, the output space of the two-tuple classification is two-dollar mark, or +1 or 1, and the output space of the regression problem is the whole real-space, namely.

In the case of bank credit cards, the input set is still the user's characteristic space, such as age, income and so on, can use the same expression as the two-yuan classification; Because the transformation of the output set leads to the assumption function of the regression problem differs from the two-yuan classification, but the thought is consistent, It is still necessary to consider weighting sums for each component of each input sample, so the representation of the final objective function f (which contains noise, denoted by y) is shown in Equation 9-1.

(Equation 9-1)

Instead, assume that the vector representation of the function is shown in Equation 9-2.

(Equation 9-2)

As can be seen from the representation of equation 9-2, the representation of the hypothetical function with the two-dollar classification is only a sign of a signed function.

Use the image to describe a more visually linear regression, as shown in 9-1.

Figure 9-1 A) linear regression of 1-D input space B) linear regression of 2-dimensional space

Figure 9-1A indicates that the input space is a 1-dimensional linear regression representation where the circle 0 represents the input sample point, the Blue line represents the assumption function, and the red line between the connecting circle and the Blue line represents the distance of the sample point to the assumed function, called the residual error (residuals), which has a similar representation in 9-1b. The core idea of the design algorithm is to minimize the total residual error.

The error measures used in regression are also mentioned in the previous chapter as squared errors, so as shown in Equation 9-3.

(Equation 9-3)

In the linear regression problem, it is assumed that the function H has a one by one correspondence with the weight vector, so equation 9-3 is usually represented as a weight vector, as shown in Equation 9-4.

(Equation 9-4)

Similarly, as shown in Equation 9-5, note that there is a noise-containing form, so obey the joint probability distribution p.

(Equation 9-5)

VC restrictions can constrain the learning model of various situations, of course, the model of regression type is also constrained by this, want to learn knowledge, only need to find enough urine to meet the needs of small enough.

9.2 Linear Regression algorithm

Linear regression algorithm.

This section focuses on how to find the smallest, in order to express the simplicity, the summation formula into a vector and matrix form, the formula 9-4 is converted into the form of Equation 9-6.

(The vector w is exchanged with the vector x position for easy display because it is a vector inner product, which conforms to the commutative law)

(The sum of squares is converted into the form of matrix squares)

(re-disassembled into matrix X and vector w with vector y form)

(Equation 9-6)

Then go back to the original goal to find a minimum, as shown in Equation 9-7.

(Equation 9-7)

To solve this problem, you need to understand the left-hand, one-dimensional (d=1) 9-2.

Figure 9-2 One-dimensional

It can be seen that the function is a continuous (continuous), micro (differentiable) convex (convex) function, in which the concept of continuous and micro, the study of advanced mathematics should have some understanding, convex function said the popular point is 9-2, Like a valley of the form (note that the concave function in the domestic mathematics textbook is the definition of convex function here, a bit embarrassing), the best is to find the lowest point in the valley, corresponding to the black dots in the graph, in the form of a mathematical gradient (gradient) of 0 points. (The gradient I understand is probably meant to be a vector of the partial derivatives of its various components), and the gradient is expressed as shown in Equation 9-8.

(Equation 9-8)

Where the gradient symbol, which needs to be found, is that the vector satisfies, and here the subscript denotes the meaning of the linear linear. The immediate question is how to solve the problem.

Continue to convert to equation 9-6, as shown in Equation 9-9.

(Equation 9-9)

Which is expressed in matrix A , expressed in vector b , with scalar C, followed by the gradient. Vector-to-vector derivation number, probably many people have not contacted even heard, at most is to understand the vector to a scalar derivative. The steps of the gradient can be understood by the comparison form in the case of the W for scalar in Figure 9-3.

Fig. 9-3 a) W is scalar time gradient b) W is the gradient of the vector

The beauty of linear algebra is that it is so similar. It can therefore be written in the form of Equation 9-10.

(Equation 9-10)

The formula 9-10 results in a gradient of 0, even if it is minimal. How to solve the best hypothetical function when the input space X and the output to Y are known? Solving this problem is divided into two cases, one is in the reversible case, the solution of the problem is very simple, the right portion of Equation 9-10 is set to 0, such as Equation 9-11.

(Equation 9-11)

which represents the pseudo-inverse of the Matrix X (pseudo-inverse), note that the input matrix X is in rare cases the Phalanx (n=d+1). The form of this pseudo-inverse matrix and the inverse matrix in the square have many similar properties, so it has this name. It is also important to note that, in most cases, it is reversible because, in the case of machine learning, it is usually satisfied that the sample number n is much larger than the dimension D plus 1 of the sample, so that there is sufficient freedom in it to satisfy the reversible condition.

The other is the irreversible situation, can actually get a lot of satisfying conditions of the solution, only need to solve in other ways, select one of the solutions to satisfy the condition.

This paper summarizes the solution process of linear regression algorithm, first constructs the input matrix x and the output vector y through the known data set, as shown in Equation 9-12.

(Equation 9-12)

The pseudo-inverse is obtained directly by Equation 9-12.

The hypothetical function is obtained by Equation 9-11, as shown in Equation 9-13.

(Equation 9-13)

9.3 Generalization Issue

Generalization problems.

The questions discussed in this section are not easy to understand, and are now smattering, if there is an incorrect expression, I would like to correct it.

The first question to answer is whether the algorithm used in the previous section to solve the best hypothesis function is machine learning.

If the answer is no, the reason is simple, the solution is completed in one step, unlike the learning method mentioned in the previous section, which requires a lot of steps. In fact, this solution is called analytic solution (analytical solution) or closed solution or closed solution (Closed-form solution) in mathematics, this kind of solution is a strict formula, the dependent variable can be obtained by giving any independent variable, and it is usually corresponding to the numerical solution. Therefore, this solution is not the smallest solution that is solved step by step, as mentioned earlier by the PLA algorithm.

Answer is the reason for more emphasis on the results, the direct solution is the mathematical derivation of the exact solution, so that the minimum solution is obtained, in line with the solution conditions, but also to solve the pseudo-inverse algorithm (this method is called Gaussian elimination method, see also Gauss, looked at a total of 110 of the results named after him, Throughout the machine learning notes you will continue to hear the results of his name not as shown in the formula show, the final result can be obtained in one step, but it takes several iterations of the Loop (observing the program of matrix Pseudo-inverse, as if it were a three-layer loop, It also confirms the complexity of the matrix inversion that NG is referring to in his machine learning course, which is only a process that is encapsulated by the program that does not see the iteration. The most important criterion for judging whether or not a machine learning process is occurring is whether the learning is good enough!

In fact, by improving the VC limit, you can also prove that in the linear regression problem VC played a very good binding role, that is to find a good can be guaranteed to be good, here is no longer proof, because it is a very cumbersome process. Just remember that VC limits work not only in the two-dollar classification problem, but also in linear regression problems.

But this section uses a guarantee that is easier to prove than the VC limit, to show that the analytic solution can also get a good one.

The following is a proof: why the results obtained by the analytic solution are good. And the evidence is similar.

The first observed average, denoted by a symbol, can be written as shown in Equation 9-14.

(Equation 9-14)

It is expected to continuously extract the sample set from the whole sample space, calculate its average value, represent the noise in the data, n is the number of samples per sample, d+1 is the dimension of the weight vector W.

As you learned from the previous section, you can write equation 9-15, notice and make the vector form.

(Equation 9-15)

Where I is the unit matrix, you can use the H-matrix (hat matrix) representation.

This is a more specific understanding of the physical meaning of the H matrix through geometry, as shown in 9-4.

Figure 9-4 Geometry of the H-matrix

Where the purple vector represents the actual output vector y.

The pink area represents the space of the input matrix x multiplied by the different weights vector w, and from this definition, it is known that the output vector represented by the analytic solution to the optimal weight vector also falls in the space, in which the n-dimensional vector is not difficult to imagine is the actual output vector y projection on the space.

And the green dashed line represents the gap between the actual output and the optimal hypothesis output, writing. From the above situation.

Therefore, the H matrix is a projection process, the vector is the vector y through the matrix H projection, the matrix H can be understood as a series of rotation and contraction of the action, there.

Matrix is also a linear change, which is a vector of the linear variation of vector y through.

Add a little element to figure 9-4, which makes up figure 9-5.

Figure 9-5 Adding an ideal target output f (x)

If the actual output matrix Y is composed of the ideal target output f (x) plus the noise portion (in red and black dashed portions). The form can also be formed by the way the noise is transformed. So get the equation 9-16.

(Equation 9-16)

These are the traces (trace) that are obtained. Before solving it, it can be imagined that, because the error vector was the first time after two conversions, the solution of Trace (i-h) is like Equation 9-17. (Why the use of the trace here, I do not know, hope the great God advice)

(according to the nature of the trace)

(Equation 9-17)

Finally, this paper introduces the physical meaning of this i-h transformation: there is a vector y with n degrees of freedom, projected into a space X with D+1 dimension (representing a column of degrees of freedom, that is, the parameters of a single input sample), and the remainder has a maximum of n (d+1) of degrees of freedom.

The result can be written in the end, as well as the noise representation, as shown in Equation 9-18 and equation 9-19.

(Equation 9-18)

(Equation 9-19)

This proves to be more complicated, I did not find the relevant information, it does not prove that, here only to introduce, in the philosophical sense, there is a difference between the reasons.

Because the former is optimized, there is a chance that there is a little bit more noise than the "ideal" fit data, so it will be better than the "ideal value", but the parts will have to pay a price (imagine getting a "completely different" noise from the training data at the time of the test), so it may be more distant from the ideal value.

From the above two formulas you can get a learning gap graph, shown in 9-6.

Figure 9-6 Learning gap in machine learning

where n tends to infinity, and both will approach the value of noise level, i.e.

The gap between generalization errors:.

This shows that it is possible to find good results in linear regression, so linear regression can be learned.

The first time I spit, the content of this section I wrote a full day, is to start to write notes so far, the first time a bar written a day, the general situation while writing a day can also write 2 to 3 bars, mainly this section is really very abstract, and many of the proofs are not given, the proof is not explained clearly, it may be I am too stupid, alas , the proof of the trace is that I according to the prompt myself to organize, if there are errors please correct me, if you understand this part of the reason why the use of the concept of the Trace also please guide me.

9.4 Linear Regression for Binary classification

Use linear regression to do two-dollar classification.

First, we compare the difference between the two-tuple linear classification and the linear regression, and compare them in three parts, the output space, the hypothesis function and the error measurement function, 9-7.

Fig. 9-7 comparison of two-tuple linear classification with linear regression

Considering the difficulty of solving the problem, the solution of the two-yuan classification is a NP-hard problem, which can only be solved by the approximate solution, and the linear regression is easy to solve and the program is easy to write by solving the analytic solution.

Therefore, it is considered possible to solve the problem of two-tuple classification by solving linear regression, because the output space { -1,+1} of the two-tuple classification belongs to the output space of linear regression. Where the data set of the token is greater than 0 of the expression +1, less than 0 of the representation-1, through the linear regression obtained by the analytic solution, the direct conclusion of the optimal hypothesis. But this reasoning is only intuitive, and how to use mathematical knowledge to illustrate the rationality of this way?

Observe the two methods of error measurement and represent them as equation 9-20 and Equation 9-21, respectively.

(Equation 9-20)

(Equation 9-21)

Observe that the common characteristics of the two formula all contain this constant volume of the inner product of the form, if it will be used as the horizontal axis, the err result as a longitudinal shaft, you can draw figure 9-8.

where Figure 9-8a) is y=+1, the image representation of the two err values, and the image representation of the two err values when the figure 9-8b) is y=-1. The red lines in the two graphs indicate that the blue is the first.

Figure 9-8 A) y=+1, two -err value representation B) Y=-1 when two err values are represented

Draw the conclusion of Equation 9-22.

(Equation 9-22)

Recall the seventh chapter under the two-dollar classification of the upper limit, combined with the conclusion of Formula 9-22, the formula 9-23.

(Equation 9-23)

Therefore, the two-tuple classification problem gets a looser upper bound, but it is also a more efficient way to solve it.

In practical application, the analytic solution obtained by linear regression is used as the initial value of PLA or pocket to achieve the fast solution.

Machine learning Cornerstone Note 9--machine how to learn (1)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Machine learning Cornerstone Note 9--machine how to learn (1)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Machine learning Cornerstone Note 9--machine how to learn (1)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support