Supervised Learning issues:
1. Linear regression Model:
Applies to the independent variable x and the dependent variable y for The linear relationship 2, the generalized linear model:
One area change in the input space affects all other areas, as follows: dividing the input space into several regions and then fitting each region with a different polynomial function
is to overcome the shortcomings of linear regression model, which is the generalization of linear regression model.
The first argument can be discrete or continuous. Discrete can be 0-1 variables, or can be a variety of variables to take values.
Compared with the linear regression model, the following generalizations are available:
Depending on the data, different models can be freely selected. We are familiar with the logit model is the use of logit joins, random error terms subject to two distribution of the model. The linear model of regression affects all other areas in one area of the input space, and is solved by dividing the input space into several regions and then fitting each region with a different polynomial function
polynomial Curve Fitting
For God the horse is not the absolute value of the difference. Take a look at the following decomposition:
When we look for models to fit data, deviations are unavoidable. For a well-fitted model, the overall deviation should be in accordance with the normal distribution, according to the Bayes theorem: P (h| d) =p (d|h) *p (h)/P (d) i.e. P (h| D) ∝p (d|h) *p (h) (∝ means "proportional to")
In conjunction with the previous normal distribution, we can write this equation: The actual ordinate is Yi's point (Xi, Yi) The probability of the occurrence of P (di|h) ∝exp (-(Δyi) ^2) Each data point deviation is independent, so you can multiply each probability. So the probability of generating N numbers of points is exp[-(δy1) ^2] * exp[-(ΔY2) ^2] * exp[-(ΔY3) ^2] *.. = exp{-[(δy1) ^2 + (Δy2) ^2 + (δy3) ^2 + ...]} The probability of maximizing this is to minimize (δy1) ^2 + (Δy2) ^2 + (δy3) ^2 +. 。 To solve.
The probability density function of normal distribution is the Power function form of Euler number. Not all models can have optimal solutions, some are only local, others are not, such as NPC problems. Absolute value and can not be converted into a solvable optimization problem, since it is not possible to find out how to get the proper parameter estimation.
The estimated value of x (i) is the sum of the squares of the difference between the true value Y (i) and the error estimation function, and the 1/2 in front of it is for the sake of derivation, the coefficient is gone. There are many ways to adjust θ so that J (θ) gets the minimum value:
Gradient reduction process: for our function J (θ) deviation-guided J:
The following is the update process, that is, the θi will be reduced to the least direction of the gradient. Θi represents the value before the update,-the latter part represents the amount of decrease in gradient direction, and α indicates the step size, that is, how much to change in the direction of the gradient reduction each time.
For a vector θ, each dimension component θi can find the direction of a gradient, we can be found in the direction of the whole, when the change, we will be in the direction of the most downward change to achieve a minimum point, whether it is local or global.
The gradient descent method is carried out according to the following process:
1) First assign a value to X, which can be random, or let X be a vector of all zeros.
2) Change the value of x so that f (x) is reduced in the direction of gradient descent.
3) Iterate step 2 until the value of x changes to make F (x) The difference between two iterations is small enough, for example, 0.00000001, that is, until the two iterations of the calculated F (x) basically no change, then f (x) has reached the local minimum value.
The convergence rate slows down when it is close to the minimum value. A straight-line search may cause some problems. May fall in the "zigzag". This algorithm will be affected to a large extent by the selection of the initial point and into the local minimum point
1. The solution of batch gradient descent is as follows:
(1) The J (Theta) is biased to the theta, and the corresponding gradient of each theta is obtained.
(2) Since the risk function is minimized, each theta is updated according to the gradient negative direction of each parameter theta
(3) from the above formula can be noted that it is a global optimal solution, but each iteration step, will be used to the training set all the data, if M is large, then it is conceivable that the iterative speed of this method. So, this introduces another method, a random gradient descent.
2. The method of solving the stochastic gradient descent is as follows:
(1) The above risk function can be written as follows, the loss function corresponds to the granularity of each sample in the training set, and the above batch gradient drop corresponds to all training samples:
(2) The loss function of each sample, the corresponding gradient is obtained for Theta, to update the theta
(3) The random gradient descent is to iterate through each sample to update once, if the sample size is very large (for example, hundreds of thousands of), then perhaps only tens of thousands of or thousands of of the sample, it has been theta iterative to the optimal solution, compared to the batch gradient above the lower, iterative need to use a hundred thousand of training samples, One iteration is unlikely to be optimal, and if you iterate 10 times, you need to traverse the training sample 10 times. However, one of the problems associated with SGD is that the noise is more bgd, making SGD not each iteration toward the overall optimization direction.
For the above linear regression problem, the optimization problem on the theta distribution is unimodal, that is, from the graph above only one peak, so the gradient of the final result is the global optimal solution. For multimodal problems, however, because there are multiple peak values, it is possible that the final result of the gradient descent is local optimization.
A metric error indicator is root mean square error:
Over-fitting solution, currently speaking three points: increase the training data set added to this book "Balm" Bayesian method to join regularization. regularization control the over-fitting phenomenon
The parameter w is very sensitive to the changes in the training data, as far as possible to capture the change (to minimize the error), and he captures the way is to arbitrarily change their own size, regardless of the size of the training data (the value of the vector w will be offset, and a target variable equivalent to the output). This is the defect of the model and the culprit of overfitting. With the introduction of the regularization term, the vector W wants to become larger and offset to fit the target, the |w| will become very large, and the target variable, the plot failed to fit.
Ordinary Least Squares
Fits a linear model with coefficients W-minimize the residual sum of squares between the observed responses in the data Set, and the responses predicted by the linear approximation. Mathematically it solves a problem of the form:
However, coefficient estimates for ordinary Least squares rely on the independence of the model terms.
ordinary Least squares Complex