Logical regression model of DS&ML_ classification algorithm notes

Last Update:2018-07-26 Source: Internet

Author: User

Tags reserved

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Other related articles

Ds&ml_ Related Analysis Notes
Support automata SVM model of DS&ML_ classification algorithm note
Random forest, gradient lifting tree, xgboost model of DS&ML_ classification algorithm notes
K-Nearest neighbor and Kd-tree model of DS&ML_ classification algorithm notes
PCA model for principal component analysis of Ds&ml_ descending dimension algorithm notes
Naive Bayesian model of DS&ML_ classification algorithm notes
K-means model of Ds&ml_ Clustering algorithm note
DS&ML_ Classification Algorithm Note Decision tree Model Summary and collation of relevant knowledge points of logistic regression model simply describe the idea of logistic regression and what is logistic regression.

Logistic regression is called regression, but it is a classification machine learning algorithm, the principle is to fit the data into a predictive logistic function, predict the value of the logistic function to represent the probability of taking 1, and the probability of taking 0 is (1-prediction function value), so as to complete the probability of the occurrence of certain events prediction. To put it simply, logistic regression is a gradient drop. (Copyright©http://blog.csdn.net/s_gy_zetrov. All rights Reserved) specific algorithm

For the two classification problem, since the output value of the linear regression model is not very good between 01, a function g is introduced, and the logistic regression is represented as hθ (x) =g ((θ^t) *x), where (θ^t) *x is the expression of linear regression, θ is the parameter.

And this function g, is very well-known sigmoid function, neural network is also useful to it.

G (z) =1/(1+e^ (−z))

Combining the two above, we get the mathematical expression of the logistic regression model:

hθ (x) =1/(1+e^ ((−θ^t) *x), where θ is a parameter. decision Boundary (decision boundary)

For the sigmoid function, when G (z) ≥0.5, z≥0; For hθ (x) =g (ΘTX) ≥0.5, then θtx≥0, which means estimated y=1 at this time; Conversely, when predicting y = 0 o'clock,θtx<0;

So we can think (θ^t) *x = 0 is a decision boundary, when it is greater than 0 or less than 0 o'clock, the logistic regression model predicts different classification results respectively.

For example hθ (x) =g (Θ0+Θ1*X1+Θ2*X2)

If Θ0,θ1,θ2 3, 1, 1 respectively; When −3+x1+x2≥0, y = 1; Then X1+x2 =3 is a decision-making boundary, and the graph is represented as follows:

The above is just a linear decision boundary, and when hθ (x) is more complex, we can get a nonlinear decision boundary, for example:

hθ (x) =g (θ0+θ1*x1+θ2*x2+θ3*x1^2+θ4*x2^2)

Here when X21+x22≥1, Y=1, the decision boundary is a circle, as shown in the following figure:

Theoretically, as long as our hθ (x) design is reasonable enough to say precisely that G ((θ^t) *x) (θ^t) *x is sufficiently complex, we can, in different cases, fit out different decision boundaries to separate the different sample points. Cost function---- solving how to choose the right parameter θ makes (θ^t) *x = 0 A good decision-making boundary

Cost function is actually a function of measuring the difference between the result and the actual result that we estimate under this set of parameters.

If the cost function of the linear regression is simple (hθ (x), y) = (set) * (hθ (x) −y) hθ (x) in ^2 is replaced with the hθ (x) of the logistic regression, the cost function is "non-convex", which means that the function has many local lows. As shown below:

And we want our cost function to be a convex function of a bowl-shaped structure so that our algorithm solves the local lowest point, it must be the global minimum. Therefore, the cost function directly with linear regression is not feasible for logistic regression, and we need other kinds of costing functions to ensure that the costs of logistic regression are convex functions.

The loss functions commonly used in statistical learning are as follows: from Hangyuan Li, "Statistical learning method"

1.0-1 loss function (0-1 loss functions):
    L (y,f (x)) = 1, y≠f (x)
             = 0, y=f (x)
2. Square loss functions (quadratic loss function)
    L (y,f (X)) = (y−f (x)) 2
3. Absolute loss functions (absolute loss function)
    L (y,f (x)) =| Y−f (X) |
4. Logarithmic loss function (logarithmic loss function) or logarithmic likelihood loss function (Log-likelihood loss function)
    L (Y,p (y| X)) =−logp (y| X

The smaller the loss function, the better the model.

Based on the description and additions of the previous section, we select the logarithmic likelihood loss function as the cost function for logistic regression:

Intuitively, if our class y = 1, the logistic regression is determined by hθ (x) = 1, then cost = 0, that is, the predicted value and the real value are exactly equal when the penalty item = 0, does not affect the result;

However, if our class y = 1, and the logistic regression is determined by hθ (x) = 0, then for the cost function, when hθ (x) →0, cost→∞, so for this learning algorithm to give a great cost penalty.

The same is true for y=0: If our class is Y = 0 and the logistic regression is determined by hθ (x) = 1, because hθ (x) → 1 o'clock cost→∞, it also gives a great cost penalty for this learning algorithm.

Since category y can only be equal to 0 or 1, the cost function can be simplified to j (θ) = (1/m) *sum (costs (hθ (xi), Yi)), i=1~m "then the equation can be further simplified with the maximum likelihood function". The next task is to minimize the cost function, because the smaller the loss function, the better the model. and the cost function J (θ) of the independent variable is θ, so we need to find a θ to make j (θ) the smallest. Here we can use the gradient descent method, the gradient descent is intuitively understood that we take an initial value on the convex function J (θ) of the bowl structure, and then move this value step closer to the lowest point of the process, and finally the bottom is J (θ), at this time the θ is also the most appropriate.

Mathematically understood, in order to find the minimum point, we should move towards the fastest direction of descent (Guide function/bias direction) forward, each step to a small step α, and then look at this time the fastest decline in the direction, and then move in this direction, until the lowest.

Algorithm Description: S1: Determine the current position of the loss function gradient, for J (θ) of the self-variable θ vector, its gradient expression: (∂j (θ))/(∂θ) S2: By the step α multiplied by the loss function gradient, obtains the current position drops the distance, namely α* (∂j (θ)/∂θ), corresponds to which direction the next step should go and how long to go. S3: Updating the θ vector, the even if of the cost function is updated every time to (θ vector-the current cost function gradient * step function α) which is θ=θ−α* (∂j (θ)/∂θ).

Selection of step α

If α is too small, the gradient may be slow, and if it is too large, the gradient drop may "overshoot" the smallest point, and it may fail to converge and produce "divergence" (diverge). As to whether to adjust the step dynamically, the answer is no, because when the function approaches the local minimum, the gradient descent method automatically takes "small steps", so there is no need to reduce learning rate over time.

In addition to the gradient descent algorithm, the optimization algorithm includes: Newton method, Quasi-Newton method, etc.

Compared with the Newton method and the quasi-Newton method, the gradient descent method is the iterative solution, but the gradient descent method is the gradient solution, and the Newton/quasi-Newton method is solved by the inverse matrix or pseudo-inverse matrix of the second order Haisen matrix. In contrast, the use of Newton/quasi-Newton method converges faster and does not require manual selection step. But the time of each iteration is longer than the gradient descent method.AdvantagesFast, suitable for two classification problems. Simple and easy to understand, directly see the weight of each feature (that is, the effect of the corresponding independent variable on the dependent variable). can easily update the model to absorb new data.DisadvantagesCategories are not continuous variables but categorical variables have limited adaptability to data and scenarios and are less adaptable to decision tree algorithmsand the difference between Naive Bayes algorithmVariable type
LR is not required for variable types, but it is common to first discretization of variables
The increment and decrease of discrete features are easy, the fast iterative sparse vectors of the model are fast, the calculation results are convenient to store and easy to expand, and the discretization features have strong robustness to abnormal data: for example, a characteristic is that age >30 is 1, otherwise 0. If the feature is not discretized, an abnormal data "age 300" will cause a large disturbance feature discretization of the model, which simplifies the function of the logistic regression model and reduces the risk of model overfitting. Naive Bayes is required to be as discrete as possible, because the probability of a continuous type of computing condition is troublesome.
Naive Bayes has a much looser limitation on the basis of the class-conditional independent hypothesis, and if the data satisfies the independent hypothesis of the class condition, LR can achieve very good results, and when the data dissatisfaction condition independent hypothesis, LR can still adjust the parameters to maximize the data distribution of the model,- Thus, an optimal model under the existing data set is trained. Model Training Mode
Naive Bayes is the use of conditional probabilities. is a probabilistic generation model, using joint probabilistic modeling, estimating the true distribution of the entire sample space, and finding the conditional probability distribution by looking for a rule LR is using gradient descent. is a probabilistic discriminant model, which directly evaluates the conditional probability distribution through the data, the input test data model directly outputs the label, and does not find the law to distinguish the category from the difference between different types of data directly by Wunda
When the data set is small, the naïve Bayesian is chosen, because it is a model, in the case of the prior model can be more data fit better data set should choose LR, because it is a discriminant model, target-driven, not to model the joint probability, through training data directly predict the output, So you can get better results with enough data. Extreme value Missing value sensitivity
Logistic regression is sensitive to extreme values, missing values, because all samples interact with each other in the final model naive Bayes is insensitive to missing values, extreme values, because it is a probabilistic generation model, and an estimate of how the overall distribution results are presented.
Naive Bayes-01 presents lr-a logistic curve (Copyright©http://blog.csdn.net/s_gy_zetrov. All rights Reserved)the difference between a decision tree algorithm and aAlgorithmic logic
Logistic regression is to classify the decision tree according to the logistic function's arguments greater than 0 or less than 0, or according to the decision boundaries, to make a division of each feature. Extreme value Missing value sensitivity
Logistic regression is sensitive to extreme values and missing values because all the samples in the final model interact with each other in the decision tree, because in the decision tree algorithm, the numerical variables are processed in segments, which has the consequence that the information contained in the extremes is not sensitive, especially when the extreme values do contain important information. This information is probably not reflected in the model. The nonlinear relation of decision tree is more suitable than the linear relation of logistic regression.
Logistic regression can only find linear segmentation (the input feature x and logit are linear, unless X is multidimensional mapped), and the decision tree can find nonlinear segmentation. Logistic regression is better than decision tree for analyzing the whole structure of data, and decision tree is better than logistic regression in the analysis of local structure.the accumulation of some fragmentary knowledge points

[1] Http://52opencourse.com/125/coursera Public Course notes-Stanford University sixth lesson in machine learning-Logistic regression-logistic-regression
[2] https://blog.csdn.net/han_xiaoyang/article/details/49123419
[3] Https://www.cnblogs.com/pinard/p/5970503.html

Visitor Tracker

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More