Machine Learning-Stanford: Learning Note 3-under-fitting and over-fitting concepts

Source: Internet
Author: User

The concept of under-fitting and over-fitting

This course outline:

1. Local weighted regression : Variation version of Linear regression

2. probability interpretation : Another possible interpretation of linear regression

3. Logistic regression : A classification algorithm based on 2

4, perceptron algorithm : for 3 of the extension, briefly speaking

Review:

– Training samples for the first I

To the parameter vector, for input x, the output is:

n is the number of features

Define the cost function J, defined as:

M for training samples

The conclusion is deduced by the normal equation set:

1, over-fitting and under-fitting

In general, the way you choose to hand over features to the learning algorithm has a significant impact on the algorithm's working process.

Example: In the last lesson example, the room size is indicated by X1. By linear regression, a fitting curve is drawn on the horizontal axis for the room size, and the longitudinal axes as the price chart. The curve equation for regression is:

If the definition feature set is: X1 represents the size of the house, X2 represents the square of the house size, using the same algorithm, the fitting to get a two-time function, in the diagram is a parabola, namely:

And so on, if there are 7 data in the training set, you can fit the polynomial up to 6 times to find a perfect curve that passes through each data point. But this model is too complex, the results of the fitting only reflect the specific characteristics of the given data, not the size of the house to estimate the universality of housing prices. The results of linear regression may not capture information for all training sets.

So, for a supervised learning model, too small a feature set makes the model too simple, too large a feature set makes the model too complex .

For cases where the feature set is too small, it is called under-fitting (underfitting);

For cases where the feature set is too large, it is called overfitting (overfitting)

Ways to solve this type of learning problem:

1) Feature Selection algorithm: A class of automation algorithms that select the features used in such regression problems

2) Non-parametric learning algorithm: Mitigating the need for selecting features, and leading to local weighted regression

Parametric learning Algorithm (parametric learning algorithm)

Definition: The parametric learning algorithm is a class of algorithms that have a fixed number of parameters to be used for data fitting. Set the set of parameters for the fixed parameter. Linear regression Even an example of a parametric learning algorithm

Non-parametric learning algorithm (Non-parametric learning algorithm)

Definition: An algorithm in which the number of parameters increases with the M (training set size) . Usually defined as the number of parameters although M linearly grows. In other words, what the algorithm needs will grow linearly with the training set, and the maintenance of the algorithm is based on the entire training set, even after learning.

2. Local weighted regression (locally Weighted Regression)

A specific non-parametric learning algorithm. Also known as loess.

Algorithm idea:

Suppose that for a certain query point x, your hypothesis h (x) is evaluated at x.

For linear regression, the steps are as follows:

1) Fitting out so that the smallest

2) return

For local weighted regression , when x is being processed:

1) Check the data collection and only consider the data points in the fixed area around x

2) linear regression of points in this area to fit a straight line

3) According to the output of this fitting line to X, as the result of the algorithm return

The description in mathematical language is:

1) Fitting out so that the smallest

2) W is the weighted value, there are many possible options, such as:

-The meaning is that the selected X (i) is closer to X, and the corresponding w (i) is closer to 1;x (i) and the farther away x,w (i) is closer to 0. Intuitively speaking, is near the point weight value is large, far away from the point of the weight of the small.

-This attenuation function is more general in comparison, although its curve is bell-shaped, but not Gaussian distribution.

-called the wavelength function, which controls the rate at which weights fall with distance. The smaller it is, the narrower the bell, the faster the W decays, and the larger it is, the slower it decays.

3) return

Summary: For the local weighted regression, each time the prediction, we have to re-fit the one-line curve. But if you do the same for each point along the x-axis, you get a local weighted regression prediction for the data set and trace to a nonlinear curve.

* Problem of local weighted regression:

Since each prediction has to fit the curve according to the training set, if the training set is too large, the training set used for each prediction becomes very large, and there is a way to make the local weighted regression more efficient for large datasets, see Andrew Moore's work on Kd-tree for details.

3. Probability interpretation

The problem solved by probability interpretation:

In linear regression, why is the least squares chosen as an indicator of the calculation parameters, so that the square of the area between the predicted and true y values is minimized?

We provide a set of assumptions to prove that the least squares are meaningful under this set of assumptions, but that this set of assumptions is not unique and that there are many other ways to prove their significance.

(1) hypothesis 1:

Assume that the input and output are linear function relationships, expressed as:

Among them, the error term, this parameter can be understood as the capture of the non-modeling effect, if there are other characteristics, this error term represents a feature we have not captured, or as a random noise.

Suppose to obey a probability distribution, such as a Gaussian distribution (normal distribution):, a mean value is 0, the variance is a Gaussian distribution.

Probability density function for Gaussian distribution:

According to the above two formula can get:

That is, after given the characteristics and parameters, the output is a random variable that obeys the Gaussian distribution, which can be described as:

* Why choose Gaussian distribution?

1) easy to handle mathematically

2) for most problems, if a linear regression model is used, and then the error distribution is measured, the error is usually found to be Gaussian distribution.

3) Central Limit law: the sum of several independent random variables tends to obey the Gaussian distribution. If there are multiple factors causing the error, the sum of the effects caused by these factors is close to the Gaussian distribution.

Note: Not a random variable, but an attempt to estimate the value, that is, it is a constant, but we do not know its value, so the above-mentioned with a semicolon. Semicolons should be read as "... As the parameter ", read as" given x (i) to the probability of the parameter Y (i) obey the Gaussian distribution ".

Assuming that each of the IID (independently and identically distributed) is independently distributed

That error items are independent of each other, and they obey the same Gaussian distribution of mean and variance

(2) hypothesis 2:

The likelihood of setting is (that is, given x (i) The probability of the parameter Y (i):

Since the distribution is independent, the upper formula can be written as the product of all distributions:

(3) hypothesis 3:

Maximum likelihood estimation: Maximize likelihood of selection (the likelihood of data appearing as large as possible)

The logarithmic likelihood function is defined as:

Two add-on, the previous item is a constant. So, the largest likelihood function is to make the latter one of the smallest, namely:

This is the previous one, whereby the previous least squares calculation of the parameters, in fact, assumes that the error term satisfies the Gaussian distribution, and independent of the distribution, the likelihood maximization to calculate the parameters.

Note: The variance of the Gaussian distribution has no effect on the final result, since the variance must be positive, so no matter what value is taken, the final result is the same. This nature will be discussed in the next lesson.

4. Logistic regression

This is the first classification algorithm that we want to learn. The previous regression problem tried to predict the variable y is a continuous variable, in this classification algorithm, the variable y is discrete, y only take {0,1} two values.

In general, the linear regression effect is not good for this discrete binary classification problem. such as X<=3,y=0;x>3,y=1, then when the x>3 of the sample accounted for a large proportion of the linear regression of the linear slope will be smaller, y=0.5 the corresponding X judgment point will be greater than 3, resulting in a prediction error.

If y takes the value {0,1}, first change the assumed form, so that the assumed value is always between [0,1], i.e.:

So, choose the following function:

which

The G function is generally referred to as the logistic function , with the following image:

Z is very small, g (z) tends to 0,z very large when g (z) tends to 1,z=0, G (z) =0.5

Explanation of probabilities for assumptions:

Suppose given x the probability that the Y=1 and y=0 of the parameters are assumed:

Can be simply written as:

The likelihood of a parameter:

To find the logarithm likelihood:

In order to maximize likelihood, similar to linear regression using gradient descent method, the deviation of logarithmic likelihood pairs is obtained, namely:

Because the maximum value is obtained, the gradient rises at this time.

Partial derivative expansion:

The

That is, the random gradient rise algorithm similar to the previous lesson, formal and linear regression is the same, but the symbolic opposite, for the logistic function, but essentially and linear regression is different learning algorithm.

5. Perceptron algorithm

In the logistic method, G (z) generates a decimal between [0,1], but how does G (z) generate only 0 or 1?

Therefore, the Perceptron algorithm defines g (Z) as follows:

Similarly, similar to the gradient ascent algorithm for logistic regression, the learning rules are as follows:

Although it looks similar to the previous learning algorithm, the perceptron algorithm is a very simple learning algorithm, and the threshold and output can only be 0 or 1, which is a simpler algorithm than the logistic. Subsequent to the theory of learning, it will be used as a basic construction step.

Machine Learning-Stanford: Learning Note 3-under-fitting and over-fitting concepts

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.