Machine Learning-multiple linear regression and machine Linear Regression
What is multivariate linear regression?
In linear regression analysis, if there are two or more independent variablesMultivariable linear regression). If we want to predict the price of a house, the factors that affect the price may include area, number of bedrooms, number of floors, and age, we use x1, x2, x3, and x4 to represent these four features ). Here, we assume that the price of the house is linearly related to the four feature values, and h θ is used to predict the price of the house: \ [{h _ \ theta} (x) = {\ theta _ 0} + {\ theta _ 1} {x_1} + {\ theta _ 2} {x_2} + {\ theta _ 3} {x_3} + {\ theta _ 4} {x_4} \]
θ is a parameter in linear regression. We represent x and θ as column vectors respectively:
\ [\ Theta = {\ left ({\ theta _ 0}, {\ theta _ 1}, {\ theta _ 2}, {\ theta _ 3 }, {\ theta _ 4 }}\ right) ^ T} \]
\ [X = {\ left ({1, {x_1}, {x_2}, {x_3}, {x_4 }}\ right) ^ T} \]
Then we can find \ [{h _ \ theta} (x) ={\ theta ^ T} \ bullet x \]
How to achieve multiple linear regression?
For data with n feature values and m data samples, we can use an m * (n + 1) matrix, each row of the matrix (that is, x (I) is a data sample, each column represents an feature value, xj (I) representing the j-th feature value in the I-th data sample (x0 (I) = 0)
\ [X = \ left ({\ begin {array} {* {20} {c }}
1 & {x _ {^ 1} ^ {(1) }}& {x_2 ^ {(1) }}& \ cdots & {x_n ^ {(1 )}} \\
1 & {x _ {^ 1} ^ {(2) }}& {x_2 ^ {(2) }}& \ cdots & {x_n ^ {(2 )}} \\
1 & {x _ {^ 1} ^ {(3) }}& {x_2 ^ {(3) }}& \ cdots & {x_n ^ {(3 )}} \\
\ Vdots & \ cdots & \ vdots \\
1 & {x _ {^ 1} ^ {(m) }}& {x_2 ^ {(m) }}& \ cdots & {x_n ^ {(m )}}
\ End {array }}\ right) = \ left ({\ begin {array} {* {20} {c }}
{X ^ {(1) }}^ T }\\
{X ^ {(2) }}^ T }\\
{X ^ {(3) }}^ T }\\
\ Vdots \\
{X ^ {(m) }}^ T}
\ End {array }}\ right) \]
The actual value of the sample (for house price prediction, it is house price) is represented by y:
\ [Y = {\ left ({y_1}, {y_2}, \ ldots, {y_m }}\ right) ^ T} \]
The linear regression parameter θ is:
\ [\ Theta = {\ left ({\ theta _ 0}, {\ theta _ 1}, \ ldots, {\ theta _ n} \ right) ^ T} \]
Next we define the cost function J θ (cost function):
\ [{J _ \ theta} (X) =\ frac {1 }{{ 2 m }}\ sum \ limits _ {I = 1} ^ m {{{ ({h _ \ theta} ({x ^ {( I )}}) -{y ^ {(I )}})} ^ 2 }}=\ frac {1} {2 m }}\ sum \ limits _ {I = 1} ^ m {{{ ({\ theta ^ T} \ bullet {x ^ {(I )}} -{y ^ {(I )}})} ^ 2 }}=\ frac {1} {2 m }}\ sum \ limits _ {I = 1} ^ m {{{ ({x ^ {(I) T }}\ bullet \ theta-{y ^ {(I)} ^ 2 }}\]
We can also use the Matrix Expression of J θ to obtain a more concise form.
\ [{J _ \ theta} (X) = \ frac {1} {2 m }}{ (X \ bullet \ theta-y) ^ T} \ bullet (X \ bullet \ theta-y) \]
Now, we only need to minimize the cost function J θ to obtain the optimal θ parameter. So how can we minimize jθ? There are two ways: one isGradient descent), One isNorm equation).
J θ is a function of θ when the sample data X is determined, and θ is a column vector of (n + 1) dimension, that is, J θ is actually a (n + 1) meta functions. People who have learned calculus should know that they need the Extreme Value of a multivariate function.Let the partial derivative of each unknown element of the multivariate function be 0, so as to find the unknown element, and then substitute it into the function, the extreme value of the function can be obtained.Similarly, we now require the Extreme Value of J θ, so we need to first make J θ about the partial derivative of θ, or about θ 0, θ 1, θ 2 ,... the partial derivative of θ n is 0.
The partial derivative of J θ j is:
\ [\ Frac {\ delta J }}{{ \ delta {\ theta _ j }}=\ frac {1} {m} \ sum \ limits _ {I = 1} ^ m {({x ^ {(I) T }}\ bullet \ theta-{y ^ {(I)}) x_j ^ {(I )}} =\ frac {1} {m} \ left ({\ begin {array} {* {20} {c }}
{X_j ^ {(1) }}& {x_j ^ {(2) }}& \ cdots & {x_j ^ {(m )}}
\ End {array }}\ right) \ bullet \ left ({X \ theta-y} \ right) \]
Then the partial derivative of J θ to θ is:
\ [\ Frac {\ delta J }}{{ \ delta \ theta }}=\ left ({\ begin {array} {* {20} {c }}
{\ Frac {\ delta J }}{\ delta {\ theta _ 0 }}}}\\
{\ Frac {\ delta J }}{\ delta {\ theta _ 1 }}}}\\
\ Vdots \\
{\ Frac {\ delta J }}{\ delta {\ theta _ n }}}}
\ End {array }}\ right) = \ frac {1} {m} \ left ({\ begin {array} {* {20} {c }}
1 & 1 & \ cdots & 1 \\
{X_1 ^ {(1) }}& {x_1 ^ {(2) }}& \ cdots & {x_1 ^ {(m )}}\\
\ Vdots & {}& \ vdots \\
{X_n ^ {(1) }}& {x_n ^ {(2) }}& {x_n ^ {(m )}}
\ End {array }}\ right) (X \ theta-y) = \ frac {1} {m} {X ^ T} (X \ theta-y) \]
Standard Equation method (Normal Equation)
Now that we know the expression formula of the partial derivative of J θ, we just need to make it equal to 0 to find the θ we want, that is, the order:
\ [\ Begin {gathered}
\ Frac {\ delta J }}{{ \ delta \ theta }}=\ frac {1} {m} {X ^ T} (X \ theta-y) = 0 \ hfill \\
X \ theta-y = 0 \ hfill \\
X \ theta = y \ hfill \\
{X ^ T} X \ theta = {X ^ T} y \ hfill \\
\ Theta = {({X ^ T} X) ^ {-1 }}{ X ^ T} y \ hfill \\
\ End {gathered} \]
Note: In Step 4, the two sides multiply by a XT on the left to obtain a matrix (XTX) on the left, because only the square matrix has an inverse matrix.
In Python, you only need two steps to implement this algorithm:
1 from numpy.linalg import pinv2 theta = pinv(X.T.dot(X)).dot(X.T).dot(y)
Gradient Descent)
Although the standard equation method is easy to implement, it takes a lot of time to solve the inverse of such a large matrix in the face of 100,000 or even millions of data samples, in this case, you need to consider the gradient descent method. The idea of the gradient descent method is to gradually approach the minimum value. through multiple iterations, The θ value is constantly updated, so that the cost function J θ converges to the minimum value. The iteration formula is as follows:
\ [\ Theta = \ theta-\ alpha \ frac {\ delta J }}{\ delta \ theta} \]
We already know how to obtain the partial derivative of θ of J θ, As long as iteration continues until J θ is equal to 0 or small enough (such as less than 1e-5 ), finally, we get the expected θ value.
Below I used Python to find θ by Gradient Descent:
1 import numpy as np 2 3 def partial_derivative(X, theta, y): 4 derivative = X.T.dot(X.dot(theta) - y) / X.shape[0] 5 return derivative 6 7 8 def gradient_descent(X, y, alpha=0.1): 9 theta = np.ones(shape=y.shape, dtype=float)10 partial_derivative_of_J = partial_derivative(X, theta, y)11 while any(abs(partial_derivative_of_J) > 1e-5):12 theta = theta - alpha * partial_derivative_of_J13 partial_derivative_of_J = partial_derivative(X, theta, y)14 return theta