The concept learning of linear regression, logistic regression and various regression

Source: Internet
Author: User
Tags svm

Conditions/Prerequisites for regression problems:

1) The data collected

2) The hypothetical model, a function, which contains unknown parameters, can be estimated by learning the parameters. The model is then used to predict/classify new data.

1. Linear regression

Assume that both features and results are linear. That is, no more than one-time party. This is for the data collected.
Each component of the collected data can be viewed as a characteristic data. Each feature corresponds to at least one unknown parameter. This creates a linear model function, a vector representation:

This is a combinatorial problem, known to some data, how to find the unknown parameters inside, give an optimal solution. A linear matrix equation, which is directly solved, may not be solved directly. A data set with a unique solution, very little.

are basically solutions to the non-existent set of equations. Therefore, it is necessary to take a step back and transform the parametric solution into a minimum error problem to find the nearest solution, which is a relaxation solution.

To find the most approximate solution, intuitively, can think of, the smallest error expression form. is still a linear model with unknown parameters, a pile of observational data, the model with the smallest error in the data, the sum of the squares of the model and the data is minimal:

This is the source of the loss function. Next, is the method to solve this function, there are least squares, gradient descent method.


http://zh.wikipedia.org/wiki/%E7%BA%BF%E6%80%A7%E6%96%B9%E7%A8%8B%E7%BB%84

Least squares

is a straightforward mathematical solution, but it requires X to be full-rank,

Gradient Descent method

Gradient descent method, batch gradient descent method and increment gradient descent were respectively. Essentially, it is the partial derivative, the step/best learning rate, the update, the convergence problem. This algorithm is only an ordinary method in the optimization principle, can be combined with the optimization principle to learn, it is easy to understand.

2. Logistic regression

The relations and similarities and differences between logistic regression and linear regression?

The model of logistic regression is a non-linear model, sigmoid function, also called logistic regression function. But it is essentially a linear regression model, because except for the sigmoid mapping function relation, the other steps, the algorithm is linear regression. It can be said that logistic regression is supported by the theory of linear regression.

However, linear models cannot be sigmoid in a non-linear form, and sigmoid can easily handle 0/1 classification problems.

In addition its derivation meaning: the same as the maximum likelihood estimation of linear regression, the maximum likelihood function continuous product (the distribution here, can make Bernoulli distribution, or other distribution forms such as Poisson distribution), derivative, get loss function.

Logistic regression functions

Showed the form of 0, 1 classifications.

Application Examples:

is spam classified?

Are tumors, cancer diagnosed?

Is it financial fraud?

3. General linear Regression

The linear regression is based on the Gaussian distribution as the error analysis model, while the logistic regression uses the Bernoulli distribution analysis error.

The Gaussian distribution, Bernoulli distribution, beta distribution and Dietritt distribution are all exponential distributions.

In general linear regression, the probability distribution P (y|x) of y is the exponential distribution under x condition.

Through the derivation of maximum likelihood estimation, the error analysis model of general linear regression (minimization error model) can be derived.

Softmax regression is an example of general linear regression.

There are supervised learning regression, for multi-Class problems (logistic regression, solve the two class partitioning problem), such as the classification of characters, 0-9, 10 numbers, Y value has 10 possibilities.

And this possible distribution is an exponential distribution. And all possible and 1, the result of an input can be expressed as:

the parameter is a k-dimensional vector. and the cost function:

Is the generalization of the cost function of logistic regression.

But for the solution of Softmax, there is no closed solution (the solution of higher order polynomial equations), the gradient descent method or L-BFGS solution is still used.

When k=2, Softmax degenerate into logistic regression, which can also reflect that Softmax regression is the generalization of logistic regression.

Linear regression, logistic regression, Softmax regression three relations, need to repeat aftertaste, think more, understanding can go deep.

4. Fit: Fit model/function

By measuring the data, an assumed model/function is estimated. How to fit, fit the model is appropriate? Can be divided into the following three categories

Fit fit

Under-fit

Over fitting

Read an article (appendix) of the diagram, the understanding is very good:

Under-fit:

Fit fit

Over fitting

How to solve the problem of overfitting?

Problem origin? The model is too complex, with too many parameters and too many features.

Method: 1) Reduce the number of features, have manual selection, or use the model selection algorithm

Http://www.cnblogs.com/heaad/archive/2011/01/02/1924088.html (a review of feature selection algorithms)

2) regularization, that is, preserving all characteristics, but reducing the effect of the value of the parameter. The advantage of regularization is that when there are many features, each feature has an appropriate influence factor.

5. Probability interpretation: Why is the square sum used as the error function in linear regression?

Assuming that the model result is satisfied with the measured value error, the mean value is 0 Gaussian distribution, i.e. normal distribution. This hypothesis is reliable and conforms to the general objective statistic law.

The conditional probabilities of data x and Y:

The Shut up model is closest to the measured data, so the probability product is the largest. The probability product is the continuous product of the probability density function, so that a maximum likelihood function estimation is formed. When the maximum likelihood function estimation is deduced, the result of derivation is obtained: the sum of squares and the minimum formula

6. Parameter estimation and data relations

Fit relationship

7. Error function/cost function/Loss function:

The form of square sum in linear regression is usually derived from the maximum likelihood function probability product of the model conditional probability.

In statistics, the loss function generally has the following types:

1) 0-1 loss function

L(Y,F(X))={1, 0, y ≠f ( x) y =f ( X)

2) square loss function

l (y, F (x ) ) = (y −f ( x) ) 2

3) Absolute loss function

L(Y,F(X))=| Y−f(X)|

4) Logarithmic loss function

L(Y,P(y| X) ) =−l og p ( Y| x)

The smaller the loss function, the better the model, and the loss function as a convex function to facilitate convergence calculation.

Linear regression, using a square loss function. A logarithmic loss function is used for logistic regression. These are just some of the results, no derivation.

8. Regularization:

To prevent over-fitting models from appearing (overly complex models), a penalty factor for each feature is added to the loss function. This is regularization. Loss functions such as regularization of linear regression:

Lambda is the penalty factor.

Regularization is a typical method of model processing. is also the least structured risk strategy. On the basis of empirical risk (sum of squared errors), a penalty item/regularization item is added.

The solution of linear regression, also from

Θ=(xtx)−1xty

Translates to

The matrix in parentheses is reversible even if the number of samples is less than the number of features.

Regularization of logistic regression:

From Bayesian estimation, the priori probability of the regularization term corresponds to the model, and the complex model has a greater prior probability, and the simple model has a smaller priori probability. There are several concepts in this.

What is the minimization of structural risk? A priori probability? is the model simple or not related to a priori probability?

Experience risk, expected risk, experience loss, structural risk

Expected risk (real risk), can be understood as the model function fixed, the average loss of the degree of data, or "average" error level. The expected risk is dependent on the loss function and the probability distribution.

Only the sample is not able to calculate the expected risk.

Therefore, the empirical risk is used to estimate the expected risk, and the learning algorithm is designed to minimize it. That is, the empirical risk minimization (empirical Risk minimization) ERM, and the empirical risk is evaluated and calculated using the loss function.

For classification problems, empirical risk, on training sample error rate.

For function approximation, fitting problem, empirical risk, square training error.

For the probability density estimation problem, ERM is the maximum likelihood estimation method.

The least risk of experience is not necessarily the least expected risk, no theoretical basis. Only when the sample is infinitely large, the empirical risk approaches the expected risk.

How to solve this problem? SLT The theory of statistical learning, SVM of support vector machine is a special solution to this problem.

Under the condition of finite samples, a better model is learned.

Due to the limited sample, the empirical risk remp[f] cannot approximate the expected risk r[f]. Therefore, the statistical learning theory gives the relationship between the two: R[f] <= (remp[f] + e)

The right-hand expression is the structural risk, which is the upper bound of the expected risk. The E = g (h/n) is the confidence interval, which is the increment function of VC dimension H and also the subtraction function of sample number N.

The definition of VC dimension is described in detail in svm,slt. E relies on H and N, shut up expects the least risk, and only cares about its upper bound, i.e. e minimization. So, you need to choose the right H and N. This is the structural risk minimization of structure Risk MINIMIZATION,SRM.

SVM is the approximate implementation of SRM, the concept of SVM has another big basket. Stop there.

Physical meaning of 1 norm, 2 norm:

Norm, which can map a thing to a nonnegative real number, and satisfy the nonnegative, homogeneous, triangular inequalities. is a function with the concept of "length".

Why can the 1 norm get sparse solutions?

The theory of compression perception, solving and reconstructing, solving a L1 norm regularization least squares problem. The solution is precisely the solution of the nonlinear system.

Why can the 2 norm get the maximum interval solution?

The 2 norm represents the unit of measure of energy used to reconstruct the error.

Some of the above concepts need to be supplemented.

9. Minimum description Length criteria:

That is, a set of instance data, when stored, using a model, encoding compression. The length of the model, plus the length of the compression, is the total description length for that data. The minimum description length criterion is to select a model with the smallest length of description.

The minimum description length MDL criterion, an important feature is to avoid over-fitting phenomena.

For example, using Bayesian networks to compress data, the length of the model itself increases with the increase of the complexity of the model. On the other hand, the length of the data set description decreases with the increase of the complexity of the model. Therefore, the MD L of Bayesian Networks always strives to find a balance between model accuracy and model complexity. When the model is too complex, the minimum description length criterion will function and limit the complexity.

The Ames Razor Principle:

If you have two principles that can explain the observed facts, then you should use the simple one until more evidence is found.

Everything should be as simple as possible, not simpler.

11. Convex Relaxation Technology:

The combinatorial optimization problem is transformed into a convex optimization technique which is easy to solve the extremum point. The derivation of convex function/cost function and the maximum likelihood estimation method.

12. Newton method to solve maximum likelihood estimation

Preconditions: The derivative iteration, the likelihood function can be guided, and the differentiable guide.

Iteration formula:

If the vector form,

H is the Hessian matrix of N*n.

Characteristics: Newton's method can converge quickly when it is close to the extremum point, while the Newton method may not converge where the extreme point is far away. The derivation of this?

This is contrary to the convergence characteristics of the gradient descent method.

Linearity and nonlinearity:

Linear, primary function, nonlinear, input, output is not proportional, not once function.

Linear limitations: XOR issues. Linear non-divided, form:

X 0

0 x

And linear can be divided, is only a linear function, the data classification. Linear function, straight line.

Linear Independent: Individual features, independent components, cannot be represented linearly by other components or features.

The physical meaning of nuclear function:

Mapped to a high dimension, making it linearly scalable. What is high-dimensional? As a one-dimensional data feature X, converted to (x,x^2, x^3), it becomes a three-dimensional feature, and linearly independent. A one-dimensional feature that is linearly irreducible can be linearly divided in high dimensions.

The logistic regression logicalistic regression is still essentially linear regression, why is it classified as a single category?

There is a non-linear mapping relationship, the processing is generally two yuan structure of 0, 1 problems, is the extension of linear regression, widely used, is a separate category.

Moreover, if linear regression is applied directly to fit logistic regression data, many local minimum values are formed. is a non-convex set, and the linear regression loss function is a convex function, that is, the minimum extremum point, which is the global minimum point. Model does not match.

If the loss function of logistic regression is used, the loss function can form a convex function.

Polynomial spline function fitting

Polynomial fitting, the model is a polynomial form, the spline function, the model is not only continuous, but also at the boundary, the derivatives number is continuous. Benefits: is a smooth curve that avoids the appearance of turbulence in the boundary (Runge linear)
Http://baike.baidu.com/view/301735.htm

Here are a few concepts that need to be understood slowly:

Unstructured predictive Models

Structured predictive Models

What is a structured problem?

AdaBoost, SVM, LR three algorithm relationship.

The distributions of three algorithms correspond to exponential loss (exponential loss function), hinge loss, log loss (logarithmic loss function), no essential difference. The convex upper bounds are used to replace the 0 and 1 loss, i.e. convex relaxation technique. From combinatorial optimization to convex set optimization problems. Convex function makes it easier to calculate extreme points.

The relation between regularization and Bayesian parameter estimation?

Some reference articles:

http://www.guzili.com/?p=45150

http://52opencourse.com/133/coursera%E5%85%AC%E5%BC%80%E8%AF%BE%E7%AC%94%E8%AE%B0-%E6%96%AF%E5%9D%A6%E7%A6%8F% e5%a4%a7%e5%ad%a6%e6%9c%ba%e5%99%a8%e5%ad%a6%e4%b9%a0%e7%ac%ac%e4%b8%83%e8%af%be-%e6%ad%a3%e5%88%99%e5%8c%96- Regularization

Http://www.cnblogs.com/jerrylead/archive/2011/03/05/1971867.html

The concept learning of linear regression, logistic regression and various regression

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.