Gradient descent in Practice-feature Scaling
Make sure features is on a similar scale.
The smaller the range of Features, the less likely the total is, and the faster the calculation will be.
Dividing by the range
By Feature/range each feature within the range of [-1, 1]
The next question is an example:
Mean Normalization
Changes the value to close to 0. Except for x0, because the value of x0 is 1.
MU1 is average value of X1 in trainning sets;
S1 is the size of the X1, such as the bedroom is [0, 5], then the range is 5-0 = 5.
Ensure gradient descent work correctly
For example, this image is correct, and as the number of cycles increases, the J (θ) primary key decreases. After a certain number of cycles, the J (θ) curve tends to flatten. Can be based on the image to see when to stop, or when each cycle, J (θ) change is less than ε stop.
Image rise
The value of α is large and should be reduced. The actual image may look like this:
If α is small enough, it can be slow but completely covered.
If α is too large: at each cycle, it may not be reduced so that it cannot be completely overwritten.
Features and polynomial regression
You can use a custom features instead of completely copying an existing features. For example, the house has a length and width of two properties, we can create a new property--area. The expression then becomes
, but this curve is reduced and then enlarged, and the actual data does not match (the larger the area, the higher the total price). So adjust to
。
Normal equation
Gradient descent gradually approximates the minimum value as the number of cycles increases.
Normal equation is calculated directly by means of θ.
The derivative is 0 o'clock min.
and solve the θ0 to Partθn.
Solving the equations of θ
Matrix concept See machine Learning-week 1
When to use Gradient descent or Normal equation
When n is large, the right side will be slow because the calculation is O (n3)
When n is small, the right side is faster because it is directly derived and does not require iterations or feature scaling.
What if it's non-invertible?
1. Redundant features (is not linearly independent).
e.g. X1 = size in Feet2; x2 = size in m2
2. Too many features (e.g. m <= N)
For example m = ten, n = 100, meaning you have only 10 data, but there are 100 features, obviously, the data is not enough to cover all the features.
You can delete some features (keep only data-related features) or use regularization.
Exercises
1.
Don't know how to use both methods at the same time, are these two methods sequential related?
Use dividing by the range
Range = Max-min = 8836-4761 = 4075
Vector/range after change to
1.9438
1.2721
2.1683
1.1683
For the above use mean normalization
AVG = 1.6382
Range = 2.1683-1.1683 = 1
X2 (4) = (1.1683-1.6382)/1 = 0.46990 reserved Two decimal places for-0.47
5.
As mentioned above, "the smaller the scope of the Features, the smaller the overall probability, and the faster the computational speed." (Multiple choice can also be selected)
Machine Learning-week 2-multivariate Linear Regression