Directory of this chapter:
========================================================== ================
7.1 Introduction
Linear regression is the most basic model in the field of statistics and machine learning. In fact, in the field of scientific research, classic models are the most used models. A linear model is a classic model.
========================================================== ================
7.2 Model Setting <Model Specification>
Figure 7.1 shows the results of linear regression with different base functions.
Basis Function Expansion <bfe> base function expansion,
For example, bfe is [1 x] and [1 sin (x)] respectively.
Why is x ^ 2 still called linear? Here, the linearity of the W component is considered.
========================================================== ================
7.3 maximum likelihood estimation (Least Squares)
In statistics, given the probability model of data (Gaussian Model in linear regression), "Maximum Likelihood Estimation" is a common method for estimating parameters (W in linear regression ". Under the premise of introducing Gaussian model assumption, "Maximum Likelihood Estimation" is equivalent to "Least Squares.
7.3.1 Maximum Likelihood Estimation derivation <derivation of the MLE>
In the following derivation, note that the "format" of x and W is very simple. Note that when processing "vectors", the standards are column vectors. If a book appears in the form of a single vector, you can only say: Throw it.
Detailed derivation:
7.3.2 ry <geometric interpretation>
After finding W, the ry of y_hat = w'x is that y_hat is the projection point of Y in the Space Formed by X columns ".
7.3.3 convex Convexity
Only when the function is a convex function can we ensure that "the local minimum point must be a global minimum point ".
In advanced mathematics, the <first derivative = 0> of the continuous guide function is not necessarily a minimum point;
In optimization courses (optimization, operational research, optimization methods, and other courses), we generally assume that the function is convex and then conduct optimization research and discussion.
In addition, it is recommended that you learn about machine learning by looking for a book on optimization to see linear planning, unrestricted optimization, and constrained optimization.
Among them, constraint-free optimization is used in ml, such as Newton method, gradient method, and BFGS;
The constraints of regular expressions can be viewed as the penalty function method or the Laplace Multiplier Method in the "constraint optimization problem.
========================================================== ================
7.4 robust linear regression <robust linear regression>
<Robust-robust translation is a transliteration and is also commonly used>
This section can be seen as: the least squares cost function in the use of the European distance (yi-yi_hat) ^ 2 to make modifications to improve robustness, or choose different distances <for different distances, I will summarize and write a blog> to define the cost function in the near future.
In the Gaussian Model, the outlier has a large square to the expected distance, which has a great impact on Parameter Estimation and is a bad influence. Select different distance calculation methods to alleviate this error. Heavy tails means that the P at the outlier is not too small when the maximum likelihood is used.
========================================================== ================
7.5 Ridge Regression <Ridge Regression>
When there is noise in the data or the model is too complex, it will produce an over-fitting phenomenon.
Look, the blue line is the result of the model solution. Obviously, the pink line is a better simulation.
In the <numerical approximation> course of mathematics, the shock caused by fitting data with higher polynomials is called the long-box Kuta phenomenon.
To solve this problem, you can add the regular term/Constraint item. The following discussion is based on the Gaussian Model (MLE can be simplified to the least square form ):
As shown in: After adding a regular expression, the larger the lambda expression, the smoother the function defect is"
In most cases, adding a regular item is an effective method for processing overfiting.
One disadvantage of adding a regular expression is that it may lead to a very important feature (the coefficient wi should be large) being restrained-the final WI is smaller.
========================================================== ================
7.6 Bayesian Linear Regression <Bayesian Linear Regression>
I will add it later.