(i) Understanding the logistic regression (LR) classifier
First of all, logistic regression, although named "Regression", but it is actually a classification method, mainly used for two classification problems, using the logistic function (or called the sigmoid function), the value range of the independent variable (-inf, INF), the value range of the argument is (0,1), The function form is:
Since the definition field of the sigmoid function is (-inf, +inf), the range is (0, 1). Therefore, the most basic LR classifier is suitable for classifying two categories (class 0, Class 1) targets. The Sigmoid function is a nice "S" shape, as shown in:
The LR classifier (Logistic Regression Classifier) is designed to learn a 0/1 classification model from the training data features-a linear combination of sample characteristicsas arguments, use the logistic function to map the arguments to (0,1). So the solution of LR classifier is to solve a set of weights(is the nominal variable--dummy, which is constant, the actual project is often another x0= 1.0. Regardless of the meaning of the constant term, it is best to keep it, and a predictive function is constructed by substituting the logistic function:
The value of the function represents the probability that the result is 1, which is the probability that the feature belongs to Y=1. Therefore, the probabilities for the input x classification results to Category 1 and category 0 are:
When we want to determine which class A new feature belongs to, we will find a Z-value according to the following formula:
(x1,x2,..., xn is a characteristic of a sample data, the dimension is N)
In order to find out---if greater than 0.5 is Y=1 class, and vice versa belongs to Y=0 class. (Note: It is still assumed that the statistical sample is evenly distributed, so the threshold value is 0.5). How does this set of weights for the LR classifier be calculated? This needs to involve the concept of maximum likelihood estimation mle and optimization algorithm, which is commonly used in the mathematical optimization algorithm is the gradient rise (descent) algorithm.
Logistic regression can also be used for multi-classification, but the two classification is more commonly used and easier to interpret. So the most common in practice is the logistic regression of the two classification. The LR classifier is suitable for data types: numeric and nominal data. The advantage is that the calculation cost is not high, easy to understand and realize, its disadvantage is that it is easy to fit, the classification accuracy may be not high.
(ii) Mathematical derivation of logistic regression1, gradient descent method to solve logistic regression
First of all, understand the following mathematical derivation process requires more derivative solution formula, you can refer to "common basic elementary function derivation formula integral formula."
Assuming that there are N observation samples, the observed values are set to the probability that the yi=1 is obtained under a given condition. The conditional probability of getting yi=0 under the same conditions is. Thus, the probability of getting an observed value is
-----This formula is actually a composite formula (1) to derive
Because each observation is independent, their joint distribution can be expressed as the product of each marginal distribution:
(M table Statistics sample number)
The above-described likelihood function is called n observation. Our goal is to be able to estimate the maximum number of parameters for this likelihood function. Therefore, the key of maximum likelihood estimation is to find out the parameters, so that the maximum value is obtained.
To calculate the logarithm of the above function:
The maximum likelihood estimation is the θ of the maximum value of the upper equation, which can be solved by using the gradient rise method, and the obtained θ is the best parameter required. In Andrew Ng's course, J (θ) is taken as the following formula, i.e.:J(θ)=-(1/m) L (θ),J (θ The minimum value of θ is the best parameter required. The minimum value is obtained by gradient descent method. The initial value of θ can be all 1.0, and the update process is:
(j Table Sample J attribute, total N; A is the step size--the amount of each move is freely specified)
Therefore, the update process of θ (which can have an initial value of all 1.0) can be written as:
(I represents the first statistical sample, J of the sample J property; A For step)
The formula will always be iterated until a stop condition is reached (for example, the number of iterations reaches a specified value or the algorithm achieves an allowable error range).
2, to quantify vectorization solution
Vectorization is the use of matrix calculations instead of a for loop to simplify the calculation process and improve efficiency. As above, Σ (...) is a summation process, obviously need a for Statement loop m times, so there is no complete implementation of vectorization. The vectorization process is described below:
The matrix form of the contract training data is as follows, each action of X is a training sample, and each column is given a different special value:
The parameter A of G (a) is a column vector, so the G function is implemented to support the column vector as a parameter, and the column vector is returned. It is known from the above that Hθ(x)-Y can be calculated from g (A)-y once.
The θ update process can be changed to:
In summary, the following steps are vectorization after θ update:
(1) Seeking a=x *θ
(2) Ask E=g (A)-y
(3) Request (A for step)
3, algorithm optimization--stochastic gradient method
The gradient rise (descent) algorithm needs to traverse the entire data set each time the regression coefficients are updated, which is good when dealing with about 100 datasets, but if there are billions of samples and thousands of features, the computational complexity of the method is too high. An improved method is to update the regression coefficients with only one sample point at a time, which is called the stochastic gradient algorithm. Because the classifier can be updated incrementally when the new sample arrives, it can complete the parameter update when the new data arrives, without needing to reread the whole data set for batch operation, so the stochastic gradient algorithm is an online learning algorithm. (as opposed to elearning, all data processed at once is referred to as "batching"). The stochastic gradient algorithm has the same effect as the gradient algorithm, but it has higher computational efficiency.
(iii) Python implementation of logistic regression algorithm
In the previous section, the =-( θ) (1/m) L (θ) was solved by means of the gradient descent method in the course of Andrew Ng to illustrate the process of logistic regression, and the process of implementation of this Python algorithm is still directly to J ( θ) By using the gradient ascending method or the stochastic gradient rising method, the Lrtrain object simultaneously realizes the process of solving the gradient ascending method or the stochastic gradient rising method.
The LR classifier Learning package contains a lr.py/object_json.py/test.py of three modules. The LR module implements the LR classifier through the object logisticregres, and supports the two solutions of Gradascent (' Grad ') and randomgradascent (' Randomgrad ') (option one, Classifierarray only stores one classification solution, and of course you can also define two Classifierarray to support both solutions.
The test module uses the LR classifier to predict the mortality rate of the disease based on hernia conditions. There is a problem with this data--the data is lost by 30%, and is replaced by a special value of 0, because 0 does not affect the value update of the LR classifier.
The partial deletion of sample eigenvalues in training data is a tricky issue, and many documents are devoted to solving the problem, as it is too bad to lose the data directly, and the cost of re-acquisition is expensive. Some optional data loss processing methods include:
-Use the mean value of the available features to fill the missing values;
-use special values to ± true complement missing values, such as-1;
-Ignore samples with missing values;
-Fill missing values using the mean value of similar samples;
-Use additional machine learning algorithms to predict missing values.
The LR classifier Algorithm learning package is:
Machine learning Logistic Regression
(iv) logistic regression application
Main uses of logistic regression:
Looking for risk factors: Looking for a disease risk factors, etc.;
Prediction: According to the model, the probability of the occurrence of a disease or a certain condition is predicted in the case of different self-variables.
Discriminant: In fact, it is similar to the prediction, but also according to the model, judging someone belongs to a disease or a certain situation of the probability of how much, that is to see how much this person is likely to belong to a disease.
Logistic regression is mainly used in epidemiology, and the common situation is to explore the risk factors of a disease, predict the probability of occurrence of a disease according to the risk factors, and so on. For example, to explore the risk factors of gastric cancer, you can choose two groups of people, one group is the stomach cancer group, a group of non-gastric cancer group, two groups of people must have different signs and lifestyles. Here the dependent variable is whether gastric cancer, that is, "yes" or "no", the self-variable can include many, such as age, sex, dietary habits, Helicobacter pylori infection. An argument can be either continuous or categorical.
Reference:The trapezoid descent of logistic regression is calculated as the maximum value
Machine Learning Classic algorithm and Python implementation---logistic regression (LR) classifier