Watermelon Book chapter III Linear Model

Source: Internet
Author: User

Reading notes Zhou Zhihua Teacher's "machine learning"

Because the side to read, so write in the essay, if the issue of copyright, please contact me immediately delete, [email protected]

3.1 Basic Forms

Example of a given D attribute description x = (x_1;x_2; ...;; X_3), where x_i is the value of x on the first attribute, the linear model view learns a function that predicts by a linear combination of attributes,

f (x) = w_1*x_1 + w_2*x_2 + ... + w_d*x_d + B, vector form

where w = (w_1;w_2; W_d).

W visually expresses the importance of each attribute in the prediction, so the linear model is well explanatory.

3.2 Linear regression linear regression (This section describes the concept of formulas, so all references ~)

Given DataSet D = {(x_1,y_1), (x_2,y_2),..., (x_m,y_m)}, where x_i = (x_i1;x_i2; ...; X_ID). Try to learn f (x_i) = Wx_i+b. Make f (x_i) ~y_i

How do I determine W and b? The mean square error is the most common performance measure in a regression task, which attempts to minimize the mean square error:

The mean square error has very good geometrical meaning, it corresponds to commonly used Euclidean distance or abbreviation "Euclidean distance" (Euclidean distance). The method of solving the model based on the minimization of mean square error is called "least squares" (least square method), in linear regression, the least squares is trying to find a straight line so that the sum of the Euclidean distances of all samples to the straight line is minimized.

E is the convex function of W and B, the optimal solution of W and B is obtained when the derivative of B and W are zero, and the function f defined on the interval [A, a], if he has an F ((X_1+x_2)/2) <= (f (x_1) + f (x_2)) for any two-point x_1,x_2 in the interval, The f is called the convex function on the interval [a, b], and the function of the U-shaped curve, such as f (x) = x * x, is usually the convex function, and the function on the real number set can be judged by the second derivative: the Jouquie derivative is not negative on the interval, it is called the convex function, and the Jouquie derivative

The process of solving W and b minimizes e, called the least Squares "parameter estimation" of the linear regression model (parameter estimation), which takes the derivative of e respectively to W and B,

A closed (Closed-form) solution for the optimal solution of W and B is obtained by the other equation zero.

which

A more general case is given DataSet D = {(x_1,y_1), (x_2,y_2),..., (x_m,y_m)}, where x_i = (x_i1;x_i2; ...; X_ID), the sample is described by a D attribute,

Called Multivariate linear regression (multivariable linear regression), similarly the least squares can be used to estimate W and B, and the DataSet D is represented as a m* (d+1) size matrix X, each line corresponds to an example, The first D element of the line corresponds to the D attribute value of the example, and the last element is set to 1, which is:

The mark is written in vector form y= (y_1;y_2; Y_m), please,

........

Linear regression model

The logarithm of the output mark as the target of the linear model approximation, that is to get "logarithmic linear regression" log-linear regression,

More generally, consider the monotone function g (), so that

The obtained model is the generalized linear model generalized linear model, the function g is called the relation function, and the logarithmic linear regression is a special case of the generalized linear model in G=LN.

3.3 Logarithmic probability regression

Classification tasks swollen? Just find a monotone function to associate the categorical task with the predicted value of the linear regression model in the true Mark Y.

Second classification, y-{0,1}, the predicted value produced by linear regression model is real value, ideal is "unit step function", the predicted value is the threshold value of arbitrary discriminant.

The unit step function is discontinuous, can not be directly used as the contact function, the logarithm probability function logistics function is a kind of sigmoid functions, monotone can be micro, the z value is converted to a near 0 or 1 y value, the output value changes very steep near the z=0,

Substituting it into the generalized linear model formula,

  

If Y is considered as the probability of the sample x as a positive example, the ratio of the two is called probability, which reflects the relative probability of x as a positive example, and uses the predicted results of the linear regression model to approximate the logarithm probability of the real mark, and the corresponding model is called "Logarithmic probability regression" logistic regression. is a classification learning method, the advantages of logistic regression model are as follows: 1. It is directly modeled on the probability of classification without prior assumptions about the distribution of the data, thus avoiding the problem caused by the inaccurate distribution of assumptions; 2. It does not only predict the "category", but it can get approximate probability prediction, This is useful for many tasks that require the use of probability-assisted decision-making; 3. The rate function is a convex function with arbitrary order and has good mathematical properties, and many existing numerical optimization algorithms can be directly used to find the optimal solution.

The values of W and B can be estimated by the maximum likelihood method maximum likelihood method,

The greater the probability that each sample belongs to its true mark, the better, the later solution can be solved by the gradient descent method gradient descent method and Newton's Newton methods.

3.4 Linear discriminant Analysis

Linear discriminant Analysis Lineard discriminant, LDA, is a classical linear learning method, which was first proposed by fisher,1936 in two classification problems, also known as "Fisher discriminant analysis".

LDA's idea: given a training sample set, try projecting the sample onto a straight line so that the projection points of the sample are as close as possible, the projection points of the heterogeneous sample are as far away as possible, and when the new sample is classified, it is projected onto the same line, and the new sample category is determined based on the location of the projection point.

Given DataSet D = {(x_1,y_1), (x_2,y_2),..., (x_m,y_m)},

  

If the data is projected onto a line w, the projection of the center of the two classes of samples to the line is

If all the sample points are projected onto a straight line, the covariance of the two types of samples is

To make the projection points of similar samples as close as possible, so that the covariance of the same sample projection points as small as possible, that is, to make the projection point of the heterogeneous sample as far away as possible, so that the distance between the center of the class as large as possible, that is

Be as large as possible and consider both to maximize your goals:

Follow-up solution see the textbook, here mainly to explain the problem.

If w is a projection matrix, then the multi-classification LDA projects the sample into the N-1 dimension space, the N-1 is usually much smaller than the original attribute number of the data, can reduce the dimension of the sample point by this projection, and uses the class information in the projection process, so LDA is regarded as a classical supervisory dimensionality reduction technique.

3.5 Multi-Category Learning

Disassembly solution

Single-to-one: each of the N categories of data set 22 is paired, resulting in N (N-1)/2 classification tasks, the test phase, the new sample to all classifiers, get N (N-1)/2 classification results, the most predicted categories as the final classification results.

Two-to-many: one v Rest. Each time a sample of a class is used as a counter example to train n classifiers. If a classifier is predicted to be positive during testing, the corresponding category tag is the result of the final classification.

A one-to-many training n classifier, but a one-to-two training n (N-1)/2 classifier, one-to-one storage overhead and test time overhead is usually greater than one-to-many, but in training the one-to-many each classifier uses all the training data, when the category is many, one-to-one cost is smaller than a Test performance depends on the specific data distribution.

Many-to-many: Select several positive cases at a time, and several counter-examples. Error correction output code technology ECOC, correction output code on the classifier errors have certain tolerance and correction ability.

3.6 Category Imbalance

Class-imbalance refers to a situation in which the number of training samples in different categories of classification tasks varies widely.

Re-scaling strategy: Under-sampling, oversampling, threshold movement.

  

  

  

Watermelon Book chapter III Linear Model

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.