Machine Learning--day1

Source: Internet
Author: User
Tags square root

Supervised learning: Classification and regression

Unsupervised Learning: Clustering and non-clustering

1. Classification and Clustering differences:

category (categorization or classification) is to label the object according to a certain criterion, and then classify it according to the label.

clustering is the process of finding the cause of aggregation between things without a "tag" in advance and by some sort of cluster analysis.

2. Differences in regression and classification:

when we try to predict the target variable is continuous, for example in our housing example, we refer to the learning problem as a regression problem. When y can only take a small fraction of the discrete value (for example, considering the residential area, we want to predict whether a dwelling is a house or an apartment), we call it a classification problem.

Individual characteristics:

Cost function:

is added for ease of calculation later

Gradient Descent Method:

The Learning rate amended selection cannot be too big or too small.

Taking linear regression fitting as an example, the gradient descent process is as follows:

Multiple features

hypothesis function (assuming functions)

: the No. 0 attribute value of 1 is understood as the first sample, so it is set to facilitate the matrix operation of θ ' and x

(try to think of theta as balancing the weights of each feature)

(Statistical learning Method .) The meaning of the subscript subscript in the Li Air is the opposite of this.

Gradient Descent method:

Feature Scaling(Feature scaling):

It can speed up the gradient descent and optimize the gradient descent method.

Μi represents the average of the I properties (characteristics) of all samples

Si denotes all sample I attributes (maximum-minimum), or standard deviation (normal deviation)

Polynomial Regression(polynomial regression)

Polynomial regression, such as two times three polynomial or even square root, can be considered when the data is not well-fitted with linear fitting.

Normal Equation Method

The origin of Normal equations

Suppose we have a sample of M. The dimension of the eigenvector is N. Therefore, the sample is {(X (1), Y (1)), (x (2), Y (2)),... ..., (x (m), Y (M))}, where x (i) is x (i) ={x1 (i), xn (i),... ..., xn (i)} for each sample. So H (θ) =θ0 +θ1x1 +θ2x2+ ... +θnxn, there is

if you want to H (θ) =y, there is

X θ= Y

Let's recall the two concepts: the inverse of the unit matrix and the matrix, to see what their nature is.

(1) unit matrix E

Ae=ea=a

(2) inverse A-1 of matrices

Requirements: A must be a square

Properties: Aa-1=a-1a=e

Let's look at the formula. X θ= Y

If you want to find theta, then we need to do some conversion:

Step1: First turn the Matrix on the left of θ into a square. By multiplying the XT can be achieved, there is

XTX θ= xty

Step2: Turn the left part of Theta into a unit matrix so that it disappears into the invisible ...

(XTX)-1 (XTX) · θ= (XTX) -1xty

Step3: Because (XTX)-1 (XTX) = E, the equation becomes

eθ= (XTX) -1xty

E can be removed, thus getting

θ= (XTX) -1xty

That's what we're talking about. The Normal equation.

Normal equation VS Gradient descent

Normal equation, like the Gradient descent (gradient descent), can be used to calculate the weight vector θ. But compared with gradient descent, it has both advantages and disadvantages.

Advantage:

Normal equation can not be the scale of the meaning X feature. For example, there are eigenvectors x={x1, X2}, where the range of X1 is 1~2000, and X2 's range is 1~4, you can see that their range is 500 times times the difference. If the gradient descent method is used, it can cause the ellipse to become very narrow and long, but the gradient descent is difficult, even the gradient can not be lowered (because the derivative may be rushed out of the ellipse after multiplying the step). However, if you use the normal equation method, you do not have to worry about this problem. Because it is purely a matrix algorithm.

Disadvantage:

compared to Gradient Descent,normal equation requires a large number of matrix operations, especially the inverse of the matrix. In the case of a large matrix, the computational complexity and the requirements for the memory capacity of the computer are greatly increased.

under what circumstances will it appear Normal equation Irreversible , how to deal with it?

(1) when the dimension of the eigenvector is too large (e.g., M <= N)

Workaround: ① Use regularization method

     Or②delete Some of the feature dimensions

(2) redundant features (also known as linearly dependent feature)

For example, x1= size in Feet2

    x2 = size in m2

    The conversion of feet and M to 1m≈3.28feet so, x1≈3.282 * x2, so x1 and X2 are linearly related (it can be said that there is a redundancy between X1 and x2)

Workaround: Find redundant feature dimensions and delete them.

Comparison of gradient descent method and normal equation method:

Machine Learning--day1

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.