Machine learning-2-linear regression
First of all, our teacher really sucks in class. It's really rotten.
PPT also only to meaningless formula, but also do not explain what is doing.
Regression
What is regression
First, regression is a kind of supervised learning , regression problem, try to predict the continuous output, and try to predict the discrete output of the classification problem is the opposite.
As an example:
- Forecast rate
- Predict height
- ...
-
Regression model
Footage:
- feature \ (x\)
- Predictive value \ (y\) /span>
- Training set \ ((x_i,y_i) \)
- Learning algorithm
- regression function \ (f\)
Linear regression:
\[f (X) = \omega_0 + \sum_{i = 1}^{m}\omega_ix_ i \]
vectorization (Increase \ (x_0 = 1\) , indicating intercept entries):
\[f (X) = w^tx\]
Generalize (when the base function is not a polynomial base function):
\[y (x,w) = \sum_{i = 0}^{m-1}\omega_i \phi_j (x) = W^t\phi (x) \]
The nature of the problem
Split it up:
- Defining target functions
- Using training set data (real data)
- The difference between minimizing the predicted value \ (f\) and the true output value \ (y\)
- Determining parameters in the Model \ (w^t\)
Objective function (cost function):
\[j (W) = \frac{1}{2}\sum_{i=1}^{n} (f (x_i)-y_i) ^2\]
Further to the Mission \ (J (W) \) the smallest \ (w\) can.
Solution Regression
Gradient Descent method
Strategy:
- Random assignment \ (w\) initial value
- Change \ (w_i\) value so that \ (J (w) \) is getting smaller
- Descending in the opposite direction of the gradient
The gradient is a vector that indicates that a function obtains the maximum value in the direction derivative of a certain point along the way, that is, the function has the fastest change along that direction, and the rate of change is the highest.
As an example:
When climbing, climb in the direction perpendicular to the contour, the road is the steepest
How to operate:
\[\omega_j^t = \omega_j^{t-1}-\alpha\frac{\partial}{\partial\omega_j}j (W) \]
\[\frac{\partial}{\partial\omega_j}j (W) = \sum_{i=1}^{n} (f (x_i)-y_i) \cdot x_{i,j}\]
All \ (w_i\) updates at the same time, where \ (\alpha\) is the learning rate/update step
Some derivatives:
- Batch processing gradient descent
- Every update takes advantage of all data
- Iteration is slow under large samples
- Random gradient descent
- Use only one sample at a time
- Faster iteration, more effective under large samples, also known as online learning
Add:
- Wunda "Machine learning" course study Notes (iii)--multivariate linear regression and polynomial regression
Standard equations
Matrix:
\[j (W) = \frac{1}{2}\sum_{i=1}^{n} (f (x_i)-y_i) ^2 = (xw-y) ^t (xw-y) \]
Derivative, another 0:
\[\frac{\partial}{\partial w}j (W) = 2x^t (xw-y) = 0\]
Solution to:
\[w = (X^TX) ^{-1}x^ty\]
What is the best or inferior
Gradient Descent |
Standard Equations |
Need to choose a learning rate |
Don't need |
Iterations are many times |
A |
\ (O (kn^2) \) |
\ (O (n^3) \) |
Good performance when n is large |
n is very slow when it is very large |
Data needs to be normalized |
Don't need |
Conclusion:
When the sample size is small, the standard equation group is used, and the gradient descent method is adopted to solve the large sample size.
Supplemental connections
Matrix vector derivatives for machine learning
Machine learning-2-linear regression