Mathematics in Machine Learning (2)-linear regression, deviation and variance trade-offs

Source: Internet
Author: User

Copyright:

This article is owned by leftnoteasy and published in http://leftnoteasy.cnblogs.com. If it is reproduced, please indicate the source. If you use this article for commercial purposes without the consent of the author, you will be held legally responsible. If you have any questions, please contact the author's wheeleast@gmail.com

Preface:

Last sentArticleIt's almost half a month. Over the past half month, I have been exploring the way to machine learning and have accumulated some experience, in the future, I will write these ideas slowly. Writing an article is a good way to promote your understanding of knowledge. When reading a book, it is often not very detailed. Therefore, some formulas, knowledge points, and other things have passed along, it is difficult to understand some specific meanings. It is a challenge to write articles, especially scientific articles, to understand the specific meanings and even give more vivid examples. In order to write an article, you often need to re-Understand what you think you understand.

Machine learning is not a completely technical thing. I have been talking about this problem with the department boss during outing. Machine Learning is definitely not isolated one by one.AlgorithmIt is an undesirable way to read machine learning like an introduction to algorithms. There are several things in machine learning that keep going through the book, for example, data distribution, maximum likelihood (and several methods for extreme values, but this is more mathematical), deviation and variance trade-offs, and knowledge about feature selection, model selection, and hybrid models, these knowledge, like bricks and cement, forms an algorithm in machine learning. To learn these algorithms well, you must calm down and clarify these basic knowledge so that you can truly understand and implement various machine learning algorithms.

The topic of today is linear regression. We will also mention the balanced deviation and variance.

Linear regression definition:

The previous topic is also regression-related, but the previous section focuses more on the concept of gradient, which focuses more on the concept of regression itself and deviation and variance.

The simplest definition of regression is to give a point set D, use a function to fit the point set, and minimize the error between the point set and the fitting function.

Given a point set (x, y), we need to use a function to fit this point set. The blue point is the point in the point set, and the red curve is the function curve, the first figure is the simplest model. The corresponding function is y = f (x) = ax + B. This is a linear function,

The second figure shows a quadratic curve. The corresponding function is y = f (x) = AX ^ 2 + B.

I don't know what the function is.

The fourth figure can be considered as a n-curve, n = m-1, m is the number of points in the point set, and there is a theorem that for a given m point, we can use an m-1 function to perfectly pass through this point set.

Real Linear Regression not only considers the best degree of fitting between a curve and a fixed point set, but also considers the simplest model, this topic will be discussed in depth in the deviation and variance trade-offs later in this chapter. In addition, we can refer to my previous article: Bayesian, probability distribution, and machine learning, we have also discussed the complexity of the model.

Linear regression is not a linear function.

(For convenience, I will not add an arrow to the vector in the future)

X0, X1... It represents a different dimension. For example, as mentioned in the previous section, the price of a house is determined by factors such as the area, number of rooms, and orientation of the house. Instead, we use generalized linear functions:

Wj is a coefficient, and W is a vector composed of this coefficient, which affects the influence of different dimensions of Phi J (x) in regression functions. For example, for the price of a house, the orientation of the room must be smaller than that of the room. * (X) is a generalized linear model that can be replaced by different functions and does not necessarily require * (x) = x.

 

Least Square Method and Maximum Likelihood:

This topic is discussed in detail here. I will focus on understanding this topic here. The least square method is the simplest method in linear regression. Its Derivation assumes thatThe error between the estimated value and the actual value of the regression function is assumed to be a Gaussian distribution.. The formula is as follows: Y (x, W) is the estimated value of the regression function given the W coefficient vector, and T is the actual value, ε indicates the error. The following formula is available:

This is a simple conditional probability expression, indicating that the probability of a true value T is obtained when X, W, and β are given. Because ε is subject to Gaussian distribution, then the probability from the estimated value to the actual value is also Gaussian distribution, which looks like the following:

Bayesian, probability distribution, and machine learning have a lot of discussions on the impact of distribution. Let's look back at this topic, because the least square method has such a hypothesis, it will lead, if the estimated Function Y (x, W) and the real value T are not Gaussian distributions, or even a very wide gap, the calculated model must be incorrect. When a new vertex x' is given to obtain an estimated value y', it may be far from the actual value t.

Probability Distribution is a cute and hateful thing. When we can accurately predict the distribution of some data, we can make a very accurate model to predict it, however, in most real application scenarios, data distribution is unknown, and it is difficult for us to use a mixture of one distribution or even multiple distributions to represent the true distribution of data, for example, if 0.1 billion webpages are given, it is impossible to match the Word Frequency Distribution with an existing distribution (such as Gaussian mixture distribution. In this case, we can only get the probability of occurrence of words. For example, the probability of P () is 0.5, that is, a web page has a probability of 1/2 ". If an algorithm assumes the distribution in it, it may not perform well in real applications.The least square method is incapable of solving a similar complex problem.

 

Trade-off ):

Deviation (bias) and variance (variance) are statistical concepts. When I first got into the company, I felt terrible when I saw these two words coming out of my mouth at any time. First, we must make it clear that variance is a comparison between multiple models, not for a model. For a separate model, for example:

Such an estimation function with a given specific coefficient cannot be used to determine the variance of f (x. The deviation can be in a single dataset or multiple datasets, depending on the specific definition.

In general, variance and deviation are obtained from the same data set using a scientific sampling method to obtain several different subdatasets and the models obtained from these subdatasets, we can talk about their variance and deviation. Changes in variance and deviation are generally proportional to the complexity of the model, just like the four small pictures at the beginning of this article. When we blindly pursue precise model matching, this may cause different models to be trained for the same group of data. The differences between them are very large. This is called variance, but their deviation is very small, as shown in:

The blue and green points represent different subdatasets obtained from a centralized data sampling. We have two n curves to fit these points, the two curves (blue and dark green) are very different, but they are generated from the same dataset. This is the large Variance Caused by the complexity of the model. The more complex the model, the smaller the deviation. The simpler the model, the larger the deviation. The variance and deviation are changed as follows:

When variance and deviation sum up the optimum points, it is our best model complexity.

In a very popular example, our country is pursuing GDP blindly. GDP is like a model deviation. The country wants the difference between the existing GDP and the target GDP to be as small as possible, however, many complicated methods are used, such as reselling land and forcing demolition. This increases the complexity of the model and increases the deviation (Income Distribution of residents, the poorer the people (the people out of the city and those who enter the city and cannot afford to buy a house), the richer the people (the people who resell the land and the people who sell the house ). In fact, the original model does not need to be so complex. It is the best model to balance the income distribution of residents with the development of the country.

Finally, the deviation and variance are described in mathematical language:

E (l) is the loss function. h (x) represents the average of the real value. The first part is related to Y (the estimated function of the model, this part is due to the differences brought about by the selection of different estimation functions (models), while the second part is irrelevant to Y, which can be considered as the inherent noise of the model.

For the first part of the above formula, we can convert it into the following form:

This section is derived from PRML 1.5.5. The first half represents a deviation, and the second half represents a variance. We can conclude that the loss function = deviation ^ 2 + variance + inherent noise.

Also from PRML:

This is a curve fitting problem. different datasets of the same distribution are fitted for multiple times. The left side shows the variance, the right side shows the deviation, and the green side shows the real value function. Ln Lambda indicates the complexity of the model. The smaller the value, the higher the complexity of the model. In the first line, when everyone's complexity is very low (everyone is very poor, the variance is small, but the deviation is also small (the country is also very poor), but in the last figure, we can see that when everyone has a high degree of complexity, different functions are quite different (the difference between the rich and the poor), but the deviation is very small (the country is very rich ).

Notice:

Next we will discuss some issues about linear classification, so stay tuned :)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.