1. Multiple features (multidimensional features)
In the linear regression we mentioned in the single-variable linear regression (linear regression with one variable) of machine learning,We only have one single feature volume (variable)-house area x. We want to use this feature to predict the price of a house. Our assumptions are drawn out with the blue line:
Think about it. If we know not only the area of the house (as a feature of predicting the price of the house (variable), we also know the number of bedrooms, the number of floors, and the life of the house, this gives us more information to predict the price of a house.
That is, the assumption that multi-variables are supported is:
This formula contains N + 1 parameters and N feature quantities (variables). To simplify the formula, X0 = 1 is introduced., The formula is converted:
It can be simplified as follows:
2. Gradient Descent for multiple variables (multi-variable gradient descent)
3. Gradient Descent in practice: feature Scaling (feature scaling) 1
Next we will introduce some practical skills in gradient descent calculation. The first is the feature scaling method.
If you have a machine learning problem that has multiple features. If you can ensure that these features are in a similar range (ensure that the values of different features are within a similar range), then the gradient descent method can converge faster.
Specifically, if you have a problem with two characteristics, X1 is the area of the house, and its value ranges from 0 to 2000. X2 is the number of bedrooms, the value ranges from 1 to 5. If you plot the profile of the cost function J (θ:
The outline looks like the one on the left.
J (θ) is a function about θ 0, θ 1, and θ 2. I ignore θ 0 here (θ 0 is not considered for the time being ). Assume that the parameters of a function are only θ 1 and θ 2, but if the value range of the variable X1 is far greater than the value range of X2, then the final cost function J (θ) is drawn) the profile will show such a very skewed and elliptical shape. The ratio of 2000 to 5 will make the ellipse thinner.
Therefore, this is a thin and tall elliptical contour map. These very tall and slender elliptical shapes constitute the cost function J (θ). If you use this cost function for gradient descent, it may take a long time for you to obtain the gradient value. It may fluctuate back and forth, and after a long time, it will eventually converge to the global minimum value.
As a matter of fact, you can imagine that if these outlines are further enlarged, as shown on the far left (if you draw them more exaggerated, they will be finer and longer ), the situation may be worse. The gradient descent process may be slower. It takes a longer period of time to oscillate back and forth and finally find a correct path to the global minimum.
In this case, one effective method is to scale features ).
2
For example, feature X is defined as the size of the house area divided by 2000, and X2 is defined as the number of bedrooms divided by 5. As a result, the shape offset of the contour map that represents the cost function J (θ) will be less serious and may look more circular.
If you use such a cost function to perform gradient descent, the gradient descent algorithm will find a quicker path to the global minimum, instead of just like, find the global minimum value along a confusing path and a much more complex track.
Therefore, through feature scaling, we can "consume" the range of these values (in this example, the two features X1 and X2 are both between 0 and 1 ), the gradient descent algorithm you get will converge faster.
In general, we perform feature scaling to constrain the feature value to the range from-1 to + 1 (Note: feature x0 is always equal to 1 and is already in this range ).However, for other features, we may need to divide them by different numbers to put them in the same range. The numbers-1 and + 1 are not very important.
If you have a feature X1 with a value between 0 and 3, this is fine.
If you have another feature with a value ranging from-2 to + 0.5, OK does not matter. This is also very close to the range of-1 to + 1.
However, if you have another feature and Its range is-100 to + 100, the range is very different from-1 to + 1. Therefore, this may be a bad feature.
Similarly, if your features are in a very, very small range, for example, between-0.0001 and + 0.0001, it is also a much smaller range than-1 to + 1, therefore, we also think this feature is not very good.
Therefore, the scope you accept can be greater than or less than-1 to + 1, but it should not be too large or too small to be unacceptable. Usually different people have different experiences, but we generally consider this: if a feature is within the range of-3 to + 3, therefore, you should think this range is acceptable. However, if the range is greater than-3 to + 3, we may have to pay attention to it. If the value range is-1/3 to + 1/3, it is acceptable, or the typical range from 0 to 1/3 or-1/3 to 0, we all think it is acceptable. However, if the range of a feature is small, you need to consider scaling the feature.
In general, you don't have to worry too much about whether your features are in exactly the same range or range. As long as they are close enough, the gradient descent method will work normally.
3
In addition to feature scaling, features are divided by the maximum value. Sometimes we perform a task called mean normalization)
I mean, if you have a feature Xi, you should replace it with Xi-μ I. In this way, let your feature value, has an average value of 0 (μ I refers to the average value of all XI ).
Obviously, we do not need to apply this step to x0. Because x0 is always equal to 1, it cannot have an average value of 0.
However, for other features, such as the house size, the value ranges from 0 to 2000, and assuming that the average house area is equal to 1000, then you can use this formula to change the value of X1 to X1 minus the average value of μ1, and then divide it by 2000.
Similarly, if your house has five bedrooms and an average house has two bedrooms, you can use this formula to normalize your second feature X2.
In both cases, you can calculate the new features X1 and X2 so that their range can be between-0.5 and + 0.5.
Of course this is definitely not true. For example, the value of X2 may actually exceed 0.5 (for example (5-2)/5 = 0.6), but it is very close.
The more general rule is that you can use the following formula:
(X1-μ1)/S1
To replace the original feature X1.
The definition of μ1 indicates the average value of feature X1 in the training set. S1 is the range of the feature value (maximum value minus minimum value ).Of course, S1 can also be set as the standard deviation of the variable, but you can actually use the maximum value to reduce the minimum value.
Similarly, for the second feature X2, you can also use this feature minus the average value and then divide it by the range to replace the original feature (the range still means the maximum value minus the minimum value ).
This type of formula will change your features into the following range, maybe not exactly, but probably:
-0.5 <xi <0.5
By the way, some may read the article carefully. If we use the maximum value minus the minimum value to represent the range, the above 5 should be changed to 4 (if the maximum value is 5, then the minimum value is 1, the range value is 4 ).
However, in this case, these values are very similar. You only need to convert the feature into a similar range. Feature scaling does not need to be accurate, it's just to make gradient descent run faster.
Summary
By using feature scaling, you can make gradient descent faster and reduce the number of cycles required for gradient descent and convergence.
Next we will introduce another technique to make gradient descent work better in practice.
4. Gradient Descent in practice: learning rate (learning rate) 1
Next, we will focus on learning rate α.
Specifically, this is the update Rule of the gradient descent algorithm. Here I want to introduce how to debug (that is, how to determine whether gradient descent works normally) and how to select the learning rate α.
2
How can we determine whether gradient descent works normally?
What the gradient descent algorithm does is to find a θ value for you and want it to minimize the cost function J (θ ).
We can usually draw the value of the cost function J (θ) when the gradient descent algorithm is running. Note that the X axis indicates the number of iterations of the gradient descent algorithm.
You may get a curve like that.
The point on the curve is like this: assume that when I finish the gradient descent iteration of step 1, No matter what θ value I get (in any case, after Step 2 iteration, I will get a θ value). Based on the θ value obtained after Step 1 iteration, we can calculate the value of the cost function J (θ, the vertical height of this point represents the J (θ) value calculated by the gradient descent algorithm after Step 1 iteration.
Therefore, this curve shows the value of the cost function J (θ) in the gradient descent algorithm iteration process.
If the gradient descent algorithm works normally, J (θ) should be decreased after each iteration. This curve is useful because it can tell you some information: for example, you can observe this curve. After you reach step 1 iteration, that is, between Step 1 and Step 2 iterations, it seems that J (θ) has not dropped much. When you reach step 3, the curve looks flat.
That is to say, in step 1 iteration, the gradient descent algorithm basically converges because the cost function does not continue to decline.
Therefore, observing this curve can help you determine whether the gradient descent algorithm has converged.
By the way, for each specific problem, the number of iterations required by the gradient descent algorithm may vary greatly.
For one problem, gradient descent algorithms only need 30 iterations to converge. For another problem, gradient descent algorithms may need 3000 iterations. For another machine learning problem, step 3 may be required.
In fact, it is difficult to determine in advance how many iterations the gradient descent algorithm requires to converge. Therefore, we usually need to draw such curves,Plot the curve of the increase of the cost function with the number of iterations.
Generally, we can look at this curve to determine whether the gradient descent algorithm has been converged.
3
In addition, you can perform some automatic convergence tests. That is to say, you can use an algorithm to tell you whether the gradient descent algorithm has been converged.
A typical example of automatic convergence testing is that if the decline of the cost function J (θ) is smaller than a very small value ε, then it is considered to have been converged.
For example, you can select 1e-3,However, it is quite difficult to select an appropriate threshold ε.
Therefore, to check whether the gradient descent algorithm converges, we actually observeThe variation curve of the cost function as the number of iterations increases,Instead of relying on automatic convergence testing.
4
In addition, this type of curve chart can warn you in advance when the algorithm is not working properly.
Specifically, if the curve of the Price Function J (θ) changes with the number of iterations is in the upper left corner, J (θ) is actually on the rise, this clearly indicates that the gradient descent algorithm is not working properly.
Such a graph usually means that you should use a smaller learning rate α.
Similarly, you may see the J (θ) curve in the lower left corner. It first drops, then rises, then drops, and then rises again.
To solve this problem, we usually select a smaller Alpha value.
We do not intend to prove this, but the linear regression we discuss can easily prove from mathematics that as long as the learning rate is small enough, then after each iteration, the cost function J (θ) so if the cost function does not fall, it can be considered that the learning rate is too high. At this point, you should try a small learning rate.
Of course, you do not want the learning speed to be too small. If so, the gradient descent algorithm may slow down.
Summary:
If the learning rate α is too small, you will encounter slow convergence. If the learning rate α is too large, the cost function J (θ) may not decrease in each iteration, or even not converge, in some cases, if the learning rate α is too high, slow convergence may also occur. But more often, you will find that the cost function J (θ) does not fall after each iteration.
In order to debug all these situations, drawing the curve of J (θ) with the number of iterations usually helps you figure out what happened.
Specifically, when we run the gradient descent algorithm, we usually try a series of α values, such:
..., 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1 ,...
Then, for these different α Values, draw a curve in which J (θ) changes with the number of iterations, and then select an α value that seems to make J (θ) drop quickly.
Therefore, when selecting an appropriate learning rate for the gradient descent algorithm, we can take a series of α Values in multiples of 3 until we find a value that cannot be smaller and find another value, it cannot be larger. Then I try to select the largest Alpha value or a reasonable value slightly smaller than the maximum value. When we do the above work, we usually get a good learning rate value.
5. Features and polynomial regression (feature and polynomial regression) 1
Now you have learned about linear regression with multiple variables. This chapter describes how to select features and how to obtain different learning algorithms. These algorithms are often very effective when appropriate features are selected. In addition, we will introduce polynomial regression, which enables you to use linear regression methods to fit very complex functions, or even non-linear functions.
Take housing price prediction as an example. Assume that you have two features: the street width and the vertical width of the house.
In fact, when we use linear regression, you do not have to use the X1 and X2 given directly as features. In fact, you can create new features on your own.
If I want to predict the price of a house, what I really need to do is determine what factors really determine the size of my house. Therefore, I may create a new feature X, which is the product of the street width and depth. What I get is the area of my land. Then, I can assume that H only uses one feature, that is, the area of the land.
Sometimes, by defining new features, you can get a better model.
2
A concept closely related to selected features is called polynomial regression ).
If you have a housing price dataset, you may have multiple models to fit it.
A straight line does not seem to fit the data well. You can choose a secondary model, but you will find that the quadratic function model does not seem easy to use, because the curve of a quadratic function will eventually fall back. However, we do not think that the price of a house will decrease to a certain extent.
Therefore, we can choose another polynomial model, for example, a cubic function, where the cubic function fits the dataset better, because it will not fall back at the end.
So how should we fit the model with our data?
We only need to make a very simple modification to the multivariate linear regression algorithm to better fit the data.
According to our previous assumptions, we know how to fit such a model, which can be:
Hθ (x) = θ 0 + θ 1 x X1 + θ 2 x X2 + θ 3 x X3
X1, x2, and X3 represent different features.
Then, how can we combine the data into a cubic model?
You can set the first feature X1 as the house area, the second feature X2 as the square of the house area, and the third feature X3 as the cube of the house area.
Then, by setting these three features as above, and then applying the linear regression method, we can fit the model, and finally fit a three-way function to my data.
What we have done above can also be understood as converting the model into a linear regression model by setting three features. NOTE: If we use a polynomial regression model, it is necessary to scale features before running the gradient descent algorithm.
3
It should be noted that if you select features as described above, feature normalization becomes more important because the ranges of these three features are quite different.
Therefore, if you use the gradient descent method, it is very important to normalize feature values so that their ranges can be comparable.
Let's take a look at an example of how to make you really choose the features you want to use.
Previously we mentioned that a quadratic model like this is not ideal, because you know that a quadratic model can fit the data well, but the quadratic function will eventually fall, this is what we don't want to see.
However, apart from building a three-way model, you may have other methods to select features, as shown below:
Summary
This chapter discusses polynomial regression, that is, how to fit a polynomial, such as a quadratic function or a cubic function, into your data.
In addition, we also discuss the selectivity when using features. For example, we don't use the street width and depth of the House, but multiply them together to obtain the land area of the house. In fact, this seems a bit difficult to choose. Here there are so many different feature choices, how should I decide what features to use?
In subsequent articles, we will discuss some algorithms that can automatically select the features to use. Therefore, you can use an algorithm that uses the data you give, it automatically selects a quadratic function, a cubic function, or another function.
However, before learning that algorithm, I want you to know what features you need to choose and design different features, you can use more complex functions to fit your data, instead of using only one straight line to fit your data.
6. Normal Equation (regular equation)
Previously, we have been using gradient descent algorithms, but for some Linear Regression Problems, the regular equation is a better solution.
The regular equation solves the following equations:
To find the parameter that minimizes the cost function.
Linear regression with multiple variables)