1.Gradient Descent
2.Normal euqation
First introduce what is normal equation, if a dataset X has m samples, n features. If the function is:. The eigenvector of a DataSet X is expressed as:
Represents the first training sample, which indicates the J characteristic of the first training sample. The first column in x is all 1, so that if you want the function to fit Y. Because of this, we can find the parameters by the matrix operation. Students who are familiar with linear algebra should know how to find the number of references. But the premise is that matrix X has a inverse matrix.
But only a phalanx can exist inverse matrix (not familiar with the theorem of the students suggest to complement the linear algebra), so can be left to make the equation into, therefore, some students may have doubts do not exist ah, really is, but very little does not exist, the following will introduce the treatment method does not exist, first don't worry. Now you just need to be clear about why you can. And remember.
After the introduction of the normal equation solving parameters, we already know two methods to solve the parameters. Normal equation and gradients drop. Now compare the pros and cons of these two approaches and what scenarios to choose.
See the following table for details:
Back to the above is not necessarily exist, such a situation is very rare. Assuming irreversible, it is generally considered that the two situations are: (1) removing redundant features. Some features have a linear dependency. (2) Feature too much, to remove some features. For example (m<n), the use of regularization for small sample data. Gradient Descent for Linear Regression
Note: [At 6:15 ' H (x) = -900-0.1x "should be" h (x) = 900-0.1x "]
When specifically applied to the case of linear regression, a new form of the gradient descent equation can be derived. We can substitute our actual cost function and our actual hypothesis function and modify the equation to:
Repeat until Convergence: {Theta0: =θ0?Alpha1M∑I=1m(hθ(xi)? Yi) Theta1: =θ1?Alpha1M∑I=1M((HTheta (xi) ?< Span id= "mathjax-span-271" class= "Msubsup" >yi) xi) < Span id= "mathjax-span-188" class= "Msubsup" > < Span id= "mathjax-span-231" class= "Mrow" >} |
where m is the size of the training set,θ0 A constant that'll be changing simultaneously withθ1 and xi,yiarevalues of the given training set (data).
Note that we had separated out of the cases forθJ into separate equations forθ0 andθ1 ; and that forθ1 We are multiplyingxi At the end of due to the derivative. The following is a derivation of?? θJJ(θ) For a single example:
The point of any this is so if we start with a guess for our hypothesis and then repeatedly apply these gradient descent Equations, our hypothesis would become and more accurate.
So, this is simply gradient descent in the original cost function J. This method looks at every example in the entire training set on every step, and is called batch gradient descent . Note that while gradient descent can is susceptible to local minima in general, the optimization problem we have posed he Re for linear regression have only one global, and no other local, Optima; Thus gradient descent always converges (assuming the learning rateαis not too large) to the global minimum. Indeed, J is a convex quadratic function. Here's an example of gradient descent as it's run to minimize a quadratic function.
The ellipses shown above is the contours of a quadratic function. Also shown is the trajectory taken by gradient descent, which were initialized at (48,30). The x's in the figure (joined by straight lines) mark the successive values Ofθthat gradient descent went through as it converged to its minimum.
Machine learning Week 1