Previously learned linear classification, linear regression and logistics regression, this time to do a summary, and the main derivation of the cross-entropy loss function and gradient descent method. I. Overview
A picture of heights field teacher's handout is first sacrificed
The difference between PLA, Linear regression to logistics regression.
The error function changes from 0/1 error to mean square error to cross entropy error. 1.1 Pla/pocket
PLA is for linear data, two classification, 0/1 error, initialization weight, and then iteration update, when there is a classification error point, the correct weight, Wt+1=wt+yn (t) *xn (t), until there is no error.
Later, in order to deal with non-linear data, the introduction of pocket, no longer looking for that no classification error weight, but in the iterative process to record each weight in the wrong number of times, after enough weight, to make the final weight as a result. 1.2 Linear regression
Linear regression can be used to solve the problem of predicting bank card quota, predicting the price of housing. Use the mean square error.
For the time being, don't do too much explaining.
Directly into the logistics regression. second, logistics regression 2.1 Basic Introduction
When we are predicting the recurrence of heart disease, it is impossible to give a yes or no answer, only to say, how many probabilities will recur. However, our training data is only recurrent or non-recurrent, and the training data we want to get is probabilistic.
This introduces the logistics function, which is converted to a number between 0-1 through a map to represent the probability.
Logistics function: F (x) =11+e−x f (x) = 1 1 + e−x f (x) = \frac{1}{1+e^{-x}}
Thus, the assumption function is obtained:
So how do we optimize this hypothetical function, and what kind of error function to use? The cross-entropy loss function is introduced here. 2.2 Derivation of the cross-entropy loss function
Suppose we have such a bunch of data,
The probability of our target function generating this data set is:
P (D) =p (x1o) p (x2x) P...P (XNX) p(d)=p(x1o)p(x2x)p...p(xnx)p(d)=p(x1o)p(x2x)p...p(x_nx)
In the formula, capital O is the positive category O, uppercase X is negative category X
Since we know that the known data X1 x 1 x_1, the possibility of producing o is our objective function f (x), so we can get:
The formula for conditional probabilities can be obtained by:
P (b| A) =p (AB) p (a) p (B | a) = P (a B) p (a) p (b| a) = \frac{P (AB)}{p (a)}
Therefore, the probability formula that produces the DataSet D can be expressed as:
P (D) =p (x1) F (x1) ∗p (x2) (1−f (x2)) ∗ ... ∗p (XN) (1−f (XN)) p (D) = P (x 1) F (x 1) ∗