Original: http://blog.csdn.net/dongtingzhizi/article/details/15962797
Logistic Regression Summary
The son of the cave court
Weibo:-bing son of the dongting
(November 2013)
pdf:http://download.csdn.net/detail/lewsn2008/6547463
1. Introduction
See Stanford Andrew Ng Teacher's machine learning public class on the logistic regression, and then read the "machine learning Combat" in the logisticregression part, write down this study notes summed up.
First of all, I feel, "machine learning Combat" in the introduction of the principle of the whole algorithm with the source code to achieve, very operational, can deepen the understanding of the algorithm, but in the ointment is introduced in the principle of a relatively rough, many details are not specifically introduced. Therefore, for those who do not have the basic friends (including me) may see in some places confused, need to consult the relevant information to understand. Therefore, the book is still more suitable for the basic friends.
This article mainly introduces the following three aspects of the content:
(1) The basic principle of Logistic regression is distributed in the second chapter;
(2) The specific process of Logistic regression, including: Select the prediction function, solve the cost function and j (θ), the gradient descent method to find the minimum value of j (θ) , And the vectorization of the recursive descent process (vectorization), distributed in the third chapter;
(3) The realization code given in the "machine learning Combat" is analyzed, and the doubts about the logisticregression part of reading the book are explained. A friend without a base may feel confused when reading the logistic regression part of the book, the code given in the book is simple, but it is not linked to the theory that is introduced in the book. There will be a lot of questions, such as: The general is to use gradient descent method to find the minimum value of loss function, why here with gradient rise method? The book says that with a gradient rise, why did the code realize that the code to ask for a gradient? These questions will be answered in chapters III and fourth.
References or references in the text are listed in the final "references". The content of this article is only my personal understanding, if there are errors or omissions, you are welcome to criticize. Let's get down to the chase.
2. Fundamentals
The principles of the Logistic regression and linear regression are similar, and according to my own understanding, they can be described simply as such a process:
(1) Find a suitable predictive function (known as hypothesis in the public class of Andrew NG), generally expressed as the H function, which is the classification function we need to find to predict the outcome of the input data. This process is critical to having a certain understanding or analysis of the data, knowing or guessing the "approximate" form of the predictive function, such as a linear function or a nonlinear function.
(2) Construct a cost function (loss function) that represents the deviation between the predicted output (H) and the Training data category (y), which can be the difference between the two (h-y) or other forms. Taking into account the "loss" of all training data, sum or average cost, denoted by the J (θ) function, represents the deviation of all training data predictions from the actual category.
(3) Obviously, the smaller the value of theJ (θ) function means the more accurate the predictive function (i.e. the more accurate the H function), so this step needs to find the minimum value of the J (θ) function. There are different methods for finding the minimum value of the function, and there are some gradient descent methods (Gradient descent) when the Logistic regression is implemented.
3. Concrete process 3.1 construct prediction function
Logistic regression Although the name is "regression", it is actually a classification method for two classification problems (that is, there are only two types of output). According to the steps in chapter two, it is necessary to find a predictive function (h), obviously, the output of the function must be two values (representing two categories respectively), so the logistic function (or the sigmoid function) is used, the function is as follows:
The corresponding function image is an S-shaped curve that takes a value between 0 and 1 (Figure 1).
Figure 1
Next you need to determine the boundary type of the data partition, for the two data distributions in Figure 2 and Figure 3, it is clear that figure 2 requires a linear boundary, and Figure 3 requires a nonlinear boundary. Next we'll discuss only the linear boundary conditions.
Figure 2
Figure 3
For the case of linear boundaries, the boundary form is as follows:
The construct prediction function is:
The value of the hθ (x) function has a special meaning, which represents the probability that the result takes 1, so the probability for the input x classification result to be Category 1 and category 0, respectively:
3.2 Structure Cost function
Andrew ng gives the cost function and the J (θ) function (5) and (6) directly in the course, but does not give a specific explanation, just shows that the function is reasonable to measure the prediction of H function.
In fact, the cost function and the J (θ) function are deduced based on the maximum likelihood estimation. The process of derivation is explained in detail below. (4) Synthesis can be written as:
Take the likelihood function as:
The logarithmic likelihood function is:
The maximum likelihood estimation is the θthat is required to take the maximum value of l (θ), in fact, it can be solved by the gradient rise method, and the obtained θ is the best parameter required. However, in Andrew Ng's course, J (θ) was taken as (6), namely:
Since a negative coefficient of -1/mis multiplied, the θ of J (θ) takes the minimum value as the best parameter required.
3.3 Gradient descent method to seek
J (θ)The minimum value
The minimum value of the J (θ) can be obtained by using the gradient descent method, which can be obtained by the gradient descent method of θ :
In the formula, the α -learning step, the following to seek the bias guide:
The following formula is used in the process of solving the equation:
Therefore, the (11) Type of update process can be written as:
Because Alpha is a constant in the formula, the 1/m is generally omitted, so the final θ update process is:
In addition, as mentioned in section 3.2, it is also the same as θ for the maximum value of l (θ) , and the maximum value of the (9) formula is obtained by the gradient rise method:
The observation above is the same as (14), so the gradient ascending hair and gradient descent method are exactly the same, and this is the reason why the gradient rise method is used in machine learning combat.
3.4 Gradient Descent Process vectorization
The Vectorization,andrew ng course on the θ update process is just around the area, with no specific explanation.
"Machine learning combat" even the cost function and the gradient are not explained, so it is more impossible to explain vectorization. However, the implementation of the code given in the implementation of the vectorization, the code shown in Figure 4 32 lines of weights (that is, θ) update only a line of code, directly through the matrix or vector calculation update, not with a for loop, The description does implement the vectorization, which is detailed in the next chapter of the code.
Vectorization is also mentioned in the literature [3], but it is also relatively sketchy and very simple to give the results of vectorization:
and whether the update formula is correct or not, here's Σ (...) is a summation process, obviously need a for Statement loop m times, so there is no complete implementation of vectorization, unlike the "machine learning Combat" in the code of a statement can be completed θ update.
I understand the vectorization process of code implementation in machine learning combat.
The matrix form of the contract training data is as follows, each action ofx is a training sample, and each column is given a different special value:
The matrix of the parameter θ to be asked is in the form of:
First ask x.θ and remember as a:
Ask hθ (x)-y and remember as E:
The parameter a of g (a) is a column vector, so the G function is implemented to support the column vector as a parameter, and the column vector is returned. The hθ (x)-y can be evaluated by the G (A)-y calculation.
Look again at the (15) θ update process when j=0 :
The same can be written θj,
Together It is:
In summary, the following steps are vectorization after θ update:
(1) Seeking a=x.θ;
(2) Seeking e=g (A)-y;
(3) Seeking θ:=θ-α.x '. E,x ' denotes the transpose of the Matrix X.
can also be combined to write:
As mentioned earlier: 1/m can be omitted.
4. Code Analysis
Figure 4 shows some of the implementation code given in machine learning combat.
Figure 4
The sigmoid function is the g (z) function in the previous article, and the parameter InX can be a vector because the Python numpy is used in the program.
The Gradascent function is the implementation function of the gradient rise, the parameters Datamatin and classlabels for the training data, 23 and 24 lines to the training data to be processed, converted to the NumPy matrix type, At the same time, the horizontal amount of classlabels is converted to the column vector Labelmat, at this time the Datamatrix and Labelmat is (18) in the form of x and y. Alpha is the learning step and the Maxcycles is the number of iterations. Weights is an n-dimensional (number of columns equal to x ) column vector, which is the θin the (19) formula.
A For loop of 29 rows will update the process of θ iteratively maxcycles times, once per loop. Comparing the vectorization of the θ update step in the final summary of section 3.4, the 30 rows are equivalent to a=x.θ and g (a), 31 lines to e=g (a)-y, and 32 rows θ:=θ-α.x '. E. So these three lines of code are actually exactly the same as the θ update step for vectorization.
Summing up, from the above code analysis can be seen, although there are only more than 10 lines of code, but there are too many hidden in the details, if there is no relevant basis is really very difficult to understand. believe that the complete reading of this article, it should be no problem! ^_^.
"References"
[1] "machine learning Combat"--"beauty" Peter Harington
[2] Stanford Machine Learning public Class (HTTPS://WWW.COURSERA.ORG/COURSE/ML)
[3] http://blog.csdn.net/abcjennifer/article/details/7716281
[4] Http://www.cnblogs.com/tornadomeet/p/3395593.html
[5] http://blog.csdn.net/moodytong/article/details/9731283
[6] http://blog.csdn.net/jackie_zhu/article/details/8895270
Logistic regression Summary