Supervised learning: Classification and regression
Unsupervised Learning: Clustering and non-clustering
1. Classification and Clustering differences:
category (categorization or classification) is to label the object according to a certain criterion, and then classify it according to the label.
clustering is the process of finding the cause of aggregation between things without a "tag" in advance and by some sort of cluster analysis.
2. Differences in regression and classification:
when we try to predict the target variable is continuous, for example in our housing example, we refer to the learning problem as a regression problem. When y can only take a small fraction of the discrete value (for example, considering the residential area, we want to predict whether a dwelling is a house or an apartment), we call it a classification problem.
Individual characteristics:
Cost function:
is added for ease of calculation later
Gradient Descent Method:
The Learning rate amended selection cannot be too big or too small.
Taking linear regression fitting as an example, the gradient descent process is as follows:
Multiple features
hypothesis function (assuming functions)
: the No. 0 attribute value of 1 is understood as the first sample, so it is set to facilitate the matrix operation of θ ' and x
(try to think of theta as balancing the weights of each feature)
(Statistical learning Method .) The meaning of the subscript subscript in the Li Air is the opposite of this.
Gradient Descent method:
Feature Scaling(Feature scaling):
It can speed up the gradient descent and optimize the gradient descent method.
Μi represents the average of the I properties (characteristics) of all samples
Si denotes all sample I attributes (maximum-minimum), or standard deviation (normal deviation)
Polynomial Regression(polynomial regression)
Polynomial regression, such as two times three polynomial or even square root, can be considered when the data is not well-fitted with linear fitting.
Normal Equation Method
The origin of Normal equations
Suppose we have a sample of M. The dimension of the eigenvector is N. Therefore, the sample is {(X (1), Y (1)), (x (2), Y (2)),... ..., (x (m), Y (M))}, where x (i) is x (i) ={x1 (i), xn (i),... ..., xn (i)} for each sample. So H (θ) =θ0 +θ1x1 +θ2x2+ ... +θnxn, there is
if you want to H (θ) =y, there is
X θ= Y
Let's recall the two concepts: the inverse of the unit matrix and the matrix, to see what their nature is.
(1) unit matrix E
Ae=ea=a
(2) inverse A-1 of matrices
Requirements: A must be a square
Properties: Aa-1=a-1a=e
Let's look at the formula. X θ= Y
If you want to find theta, then we need to do some conversion:
Step1: First turn the Matrix on the left of θ into a square. By multiplying the XT can be achieved, there is
XTX θ= xty
Step2: Turn the left part of Theta into a unit matrix so that it disappears into the invisible ...
(XTX)-1 (XTX) · θ= (XTX) -1xty
Step3: Because (XTX)-1 (XTX) = E, the equation becomes
eθ= (XTX) -1xty
E can be removed, thus getting
θ= (XTX) -1xty
That's what we're talking about. The Normal equation.
Normal equation VS Gradient descent
Normal equation, like the Gradient descent (gradient descent), can be used to calculate the weight vector θ. But compared with gradient descent, it has both advantages and disadvantages.
Advantage:
Normal equation can not be the scale of the meaning X feature. For example, there are eigenvectors x={x1, X2}, where the range of X1 is 1~2000, and X2 's range is 1~4, you can see that their range is 500 times times the difference. If the gradient descent method is used, it can cause the ellipse to become very narrow and long, but the gradient descent is difficult, even the gradient can not be lowered (because the derivative may be rushed out of the ellipse after multiplying the step). However, if you use the normal equation method, you do not have to worry about this problem. Because it is purely a matrix algorithm.
Disadvantage:
compared to Gradient Descent,normal equation requires a large number of matrix operations, especially the inverse of the matrix. In the case of a large matrix, the computational complexity and the requirements for the memory capacity of the computer are greatly increased.
under what circumstances will it appear Normal equation Irreversible , how to deal with it?
(1) when the dimension of the eigenvector is too large (e.g., M <= N)
Workaround: ① Use regularization method
Or②delete Some of the feature dimensions
(2) redundant features (also known as linearly dependent feature)
For example, x1= size in Feet2
x2 = size in m2
The conversion of feet and M to 1m≈3.28feet so, x1≈3.282 * x2, so x1 and X2 are linearly related (it can be said that there is a redundancy between X1 and x2)
Workaround: Find redundant feature dimensions and delete them.
Comparison of gradient descent method and normal equation method:
Machine Learning--day1