The first contact optimization algorithm. Introduce several optimization algorithms and use them to train a nonlinear function for classification.
Assuming there are some data points, we use a straight line to fit the points (the line is the best fit line), which is called regression.
Using logistic regression to classify the classification boundary line by establishing regression formula according to the existing data.
The word "regression" here originates from the best fit, indicating that the best fit parameter is found. The practice of training classifiers: Finding the best fit parameters, using the optimization algorithm (gradient rise method, improved stochastic gradient rise method).
5.1 Classification based on logistic regression and sigmoid function
Logistic regression: Advantages: The computational cost is not high, easy to understand and realize. Disadvantage: Easy to fit, the classification accuracy may not be high. Applicable data type: Numeric type, nominal type.
The sigmoid function: g (z) =1/(1+e-z), also represented as hθ (X) =g (ΘTX).
In order to implement the logistic regression classifier, we need to take a regression coefficient in each feature and then add all the result values , substituting the sum into the sigmoid function to get the value between 0~1.
You can now categorize the label Y:
where θtx=0 namely θ0+θ1*x1+θ2*x2=0 is called the decision boundary namely Boundarydecision.
Cost function:
The cost function of linear regression is based on the least squares method to minimize the sum of squared difference between the observed and estimated values. That
But for logistic regression, our cost fucntion cannot minimize the compensating of observations and estimates, because we will find that J (θ) is a non-convex function, there are many local extremum points, we cannot use the gradient iteration to get the final parameter (from Andrewng Video). So here we redefine a cost function
Through the function curve of the above two functions, we will find that when the Y=1, while the estimated value h=1 or when the y=0, and the estimated value h=0, that is, the forecast is accurate, at this time the cost is 0, but when the prediction error cost will be infinite, it is clear that the definition of the function.
The above grouping functions can be written together:
This gives the overall loss function J (θ) to:
5.2 Determination of optimal regression coefficients based on optimization method
sigmoid function input is recorded as Z, Z=w0x0+w1x1+...+wnxn. If a vector is used, the z=wtx means that the elements corresponding to the two numeric vectors are multiplied and then all are added together to obtain the Z-value.
Where vector x is the input data of the classifier, the vector w is the best parameter (coefficient) we want to find.
5.2.1 Gradient Ascending method
Gradient rise method: To find the maximum value of a function, the best way is to explore it along the gradient direction of the function. If the gradient is recorded as a, the gradient of the function f (x, y) is represented by the following formula:
。 This gradient means moving along the x direction and moving in the y direction. Where the function f (x, y) must be defined and micro at the point to be computed.
This allows us to get a gradient-ascending formula based on the above J (θ):
Of course, there is a summation symbol missing. This will get
Of course, for a randomized gradient iteration, only one sample is used for parameter updates at a time:
This is also the source of the formulas in the following code.
For example: data=[1,2,3;4,5,6;7,8,9;10,11,12] is a data set of 4 sample points and 3 features, at which time the label is [1,0,0,0],
Then the gradient is expressed when j=0, is the first column [1,4,7,10] and the product of the label difference. This is your own experience.
Why use the above function as cost function?
Andrew Ng's explanation is that the sum of squares of the minimum and observed values is a non-convex function, and the above cost function satisfies the condition by observing the function curve.
Here is another explanation-the maximum likelihood estimate:
We know Hθ (x) ≥0.5< behind Jane with H>, at this time Y=1, less than 0.5,y=0. Then we use h as the probability of occurrence of y=1, then when Y=0, h<0.5, at this time can not use H as y=0 probability,< because the maximum likelihood of the idea of the existing data to maximize the probability of occurrence, less than 0.5 is too small, we can use 1-h as the probability of y=0, This can be used as the probability of y=0, and then only need to maximize the joint probability density function.
This joint probability density function can be written as:
Converted to a logarithmic likelihood function, it is consistent with the likelihood function given above.
Figure 5-2 The gradient ascent algorithm will re-estimate the direction of movement after reaching each point
The gradient ascent algorithm in Figure 5-2 moves one step along the gradient direction. The gradient operator always points to the fastest growing direction of the function value. This is said to move the direction, without mentioning the size of the moving volume. The measure is called the step length, which is recorded as Alpha. In vector notation, the iterative formula for the gradient rise algorithm is as follows: w: =w+α▽WF (w).
The formula will iterate until a stop condition is reached, such as the number of iterations reaching a specified value or the algorithm reaching an allowable error range.
5.2.2 Training algorithm: Using gradient rise to find the best parameters
Training Sample: 100 sample points, each of which contains two numeric characters: X1 and x2.
#Coding:utf-8 fromNumPyImport*defLoaddataset ():#convenience function: Open the file and read it row by lineDatamat = []; Labelmat =[] FR= Open ('TestSet.txt') forLineinchfr.readlines (): Linearr=Line.strip (). Split () datamat.append ([1.0, Float (linearr[0]), float (linearr[1]))#for ease of calculation, set the X0 value to 1.0Labelmat.append (int (linearr[2])) returnDatamat, Labelmatdefsigmoid (InX):return1.0/(1+exp (-InX))defGradascent (Datamatin, classlabels):#gradient rise: Datamatin:2 numpy array, 100*3 matrix; classlabels: Category tag, 1*100 line vectorDatamatrix = Mat (Datamatin)#feature MatrixLabelmat = Mat (Classlabels). Transpose ()#class label matrix: 100*1 column vectorM,n =shape (Datamatrix) Alpha= 0.001#step to move to the targetMaxcycles = 500#Number of iterationsweights = Ones ((n,1))#n*1 column vector: 3 rows and 1 columns forKinchRange (maxcycles): H= Sigmoid (datamatrix*weights)#100*3*3*1=100*1,datamatrix * Weights represents more than one product calculation, in fact contains 300 times the productError = (labelmat-h)#difference between the real category and the forecast categoryweights = weights + Alpha * datamatrix.transpose () * ERROR#W:=W+Α▽WF (W) returnWeights
Note: The penultimate line of code
weights = weights + Alpha * datamatrix.transpose () * ERROR#W:=W+Α▽WF (W)
5.2.3 Analyzing data: Drawing decision boundaries
The above has solved a set of regression coefficients that determine the dividing line between different categories of data. How do you draw the dividing line so that the optimization process is easy to understand?
#5-2: Drawing datasets and logistic regression functions for best-fit linesdefPlotbestfit (weights):ImportMatplotlib.pyplot as Plt Datamat, Labelmat=loaddataset () Dataarr=Array (datamat) n= Shape (Dataarr) [0]#n=100Xcord1 = []; Ycord1 =[] Xcord2= []; Ycord2 = [] forIinchrange (n):ifInt (Labelmat[i]) = = 1: Xcord1.append (Dataarr[i,1]); Ycord1.append (Dataarr[i, 2]) Else: Xcord2.append (Dataarr[i,1]); Ycord2.append (Dataarr[i, 2]) FIG=plt.figure () Ax= Fig.add_subplot (111) Ax.scatter (Xcord1, ycord1, S=30, c='Red', marker='s') Ax.scatter (Xcord2, Ycord2, S=30, c='Green') x= Arange (-3.0, 3.0, 0.1) y= (-weights[0]-weights[1] * x)/weights[2] #最佳拟合直线 ax.plot (x, y) Plt.xlabel ('X1');p Lt.ylabel ('X2'); Plt.show ()
The results of this classification are quite good, and although the example is simple and the data set is small, this method requires a lot of computation (300 multiplication).
So the next section will make a little improvement to the algorithm so that it can be used on other real data.
Note: 5.1 Refer to the link below
Small village head Source: http://blog.csdn.net/lu597203933
5 Logistic regression (i)