QQ Exchange Group: 127591054
Jackchiang qq:595696297 welcome everyone to exchange.
Author Experience: July 17 just graduated child ~~16 internship at the end of the year to do DBA, Midway has changed, want to data mining as their long-term career, that is, career planning positioning: data mining. Prefer to do data analysis, there is no way. But. But. But. The threshold of excavation is really very high. So high. So high. Quickly graduated from the job in the telecommunications, came a say go on the journey, a person ... To Yunnan, with only 200 dollars ... (People don't open their mouths, I am flying Pig's Regiment), Yunnan is very beautiful. It's not going to be a picture.
It's really hard to come back and start looking for a job. Hard. It's hard, basically no one wants it. No, it lasted for one months. These one months are tough, credit cards are 3,000 and so on. Calm down to think about why you choose this. Why doesn't anyone want to.
cause I belong to:
1, digging this piece needs a lot of theoretical knowledge. I have learned, but forgot, equal to 0.
2, did not analyze this piece of experience, completely in front of the interviewer is the concept of 0.
3, if I am the interviewer, see me like this. Nor, as one interviewer said: "You are sincere in your expression, but you have not made any effort to do so."
This sentence I thought for a long time, stopped looking for this piece of work.
This is also looking for a long time to work, even the interview phone is not the reason. Then unintentionally contact with the ETL, simple is to clean the data, and then sent to the mining analysis. No way, first from the previous level, slowly into the excavation or machine learning. That's what I thought a few months ago.
Maybe it's my luck. I was asked to do a company that I had rejected before. No, take me in. The head of the company is fine, but. But. But. He was very cruel to me. Especially fierce. Especially fierce. The feeling is my senior class teacher, she is a Sichuan sister ... 29 years old. The key is not ugly ...
She asked me the idea, let me first do ETL said I now this level, digging slowly. It's true. She is the company to do digging this piece of manager introduced to me, guide me how to learn. This thing I feel from the beginning, is to learn mathematics ... And I'm learning math now. He assigned me a task to understand "logical regression" within two weeks, to learn at a time outside of work. Because I ETL to go to the project on a special busy trip. Well, look at that. So the following notes were made and talked to him. Then he asked me to continue to study math. The above is I just enter the pit: Data Mining/machine learning background ...
This article I moved from my word notes, good trouble, if you want to see Word can go to the link below to download.
http://download.csdn.net/download/jack__chiang/9970290
The first article: very shallow everywhere excerpt, adds oneself understanding, uses the Python realization. There's still a lot to be gained.
Hope can help everyone, I hope the end of 17 can officially into the excavation pit. Wish me well. 1. Concept
Logic regression is referred to as LR, but it is the most widely used in the Internet domain Automatic classification algorithm: from stand-alone automated spam mail recognition program to need hundreds of machines to support the Internet advertising system, its algorithm backbone is LR. Baidu's explanation: LR algorithm is the short of logistic regression, Chinese translation is the logical regression algorithm. LR analysis is the most general method of analysis at present. It has the least restriction on grammar, and can be effectively analyzed by using the LR method in the programming language which is used to describe the context-free grammar. Moreover, the analysis efficiency is not inferior to the analysis method such as Top-down analysis without backtracking, general "move to return" and operator precedence. In addition, the LR Analyzer can find the syntax error of input symbol string accurately and timely in the process of work. All these, the LR analysis method has received extensive attention in the world. 2, an example
Everyone in the usual work and study will often encounter a variety of decision-making problems: such as this email is not spam, this user is not interested in a product. Whether the house is to be bought or not. However, when we need to learn such a discipline through the machine to solve, that is, the first step need to make decisions on these issues, the most common way is to build a program called classifier. The input of this program is a series of characteristics to be decided, and the output is the decision result of this program. Spam classification as an example, each email is a question to be made, and the feature we usually use is a series of information we think might be relevant from the message itself, such as the sender, the length of the message, the time, the keyword in the message, the punctuation, whether there are multiple recipients, and so on. Given these characteristics, our spam classifier can determine if this email is spam. As for how we get this message classification program, the usual practice is through some kind of machine learning algorithm. This is called learning because these algorithms usually require a sample that has already been labeled, such as 100 emails, each letter has been identified as spam, and the algorithm automatically generates an automatic categorizer about the problem. However, logical regression is the most commonly used classifier algorithm for machine learning.
The LR model is simple in principle and has a ready-made tool library called Liblinear, which is easy to start and works well.
LR can be said to be the most commonly used on the internet is the most influential classification algorithm. LR is almost the basic algorithm of the CTR model in all advertising systems and recommended systems.
LR at the same time on the hot depth of the basic unit of learning, a solid grasp LR also helps to learn deep learning. 3. The three stages of learning logic regression
1, understand the LR model preparation work
1, the first study of the maximum likelihood estimate, the university has forgotten the study.
Concept: The maximum likelihood estimate is based on the value of the sample (X1,X2,...,XI,...,XN) to estimate the parameters (Θ1,θ2,...,θi,...,θn) in the sample model. is a parameter estimation method.
where the likelihood is the abbreviation of a likelihood function. Can be expressed by L (X1,x2,...,xi,...,xn,θ1,θ2,...,θi,...,θn). In fact, once a sample is determined, one generally assumes a model that fits the sample, that is, the number of parameters is determined, and the value of the parameter is to be asked. For example, based on the height of the class data, it is generally assumed that the height is in accordance with the normal distribution (μ,σ), is to be asked for the specific values of these two parameters.
Then, how to obtain the specific value of the parameters in the model based on the existing sample data. The
Maximum likelihood estimate means that you can make the maximum of L (x1,x2,...,xi,...,xn,θ1,θ2,...,θi,...,θn), and your parameter θ1,θ2,...,θi,...,θn is valid. The example of a height sample (assuming that the height sample is distributed independently) is the
. That is, when L (x1,x2,...,xi,...,xn,μ,σ) is the largest, (μ,σ) is estimated to be accurate.
The general step for the estimation of the maximum likelihood function:
(1) write out the likelihood function;
(2) Take the logarithm of the likelihood function, and arrange it, the
(3) derivation number, the
(4) derivative is zero, the likelihood function obtains the maximal value.
The following is an example to introduce the idea and method of the maximum likelihood estimation method.
Set a bag of black and white balls. P is the probability of a white ball being randomly touched from the bag, and the value of P is to be estimated.
According to the question, we make the overall X
x={
1, if get white ball
0, if get black ball
{
Then X obeys 0-1 distribution B (1,p), where P (x=1) =p, P (x=0) =1-p.
For the estimation of P, we do have to put back the touch ball 10 times, the result of which can be represented by random variables as follows:
xi={
1, if get white ball
0, if get black ball i=1,2,..., 10.
{
Then the x1,x2,..., X10 is a sample from the overall x. If 10 times the result of the ball is the sample observation value (X1,X2,..., x10) = (1,0,1,0,0,0,1,0,0,0), then its likelihood function is
L (P) =p (x1=1,x2=0,x3=1,x4=0,x5=0,x6=0,x7=1,x8=0,x9=0,x10=0) =p3 (1−p) 7
That is, L (p) =p3 (1−P) 7 is the probability that the observed value (1,0,1,0,0,0,1,0,0,0) occurs in the 10-stroke ball.
The idea of maximum likelihood estimation method
There are a number of possible results in randomised trials, and if a certain result occurs in one test and there is a small probability event principle, we naturally think that the probability of this result is greater, and the result is considered to be one of the most likely probabilities in all results. Therefore, the p should be so estimated, that is, the choice of p^, so that the highest probability of the above observed value. That is to say L (p^) is the maximum value of L (p). The maximum point p^ of L (p) can be obtained by the equation
To solve. When the Ii. is p^=0.3, L (0.3) =max{l (p)}. It is appropriate to use p^=0.3 as an estimate of the probability of obtaining a white ball randomly.
Therefore, the θ^ of the maximum likelihood estimator is a satisfying solution for the discrete-type population in general.
2. Supplemental guide number and gradient
In many problems, we should not only know the change rate of the function in the axis direction (i.e. the partial derivative), but also try to find the change rate of the function in other specific directions. This is the directional derivative to be discussed next.
The definition of 1 set ternary function is defined in a neighborhood of the point, L is the ray from the point, and any point on the L that is contained within, to indicate the distance between P and two points, if the limit
exists, the limit is called the directional derivative of the function f at the point along the L direction, which is recorded as
The relationship between directional derivative and partial derivative along either direction is given by the following theorem. (a word to cut a good trouble I cut the Word document directly)
In simple terms, the number of square wizards is the derivative of each coordinate variable multiplied by this coordinate variable to extend the direction cosine of a point.
complementary: Direction cosine
Definition: In analytic geometry, the three-direction cosine of a vector is the angle cosine of the vector and the three axes respectively.
3, then gradient descent algorithm
Concept: The gradient descent method is an optimization algorithm, usually also known as the steepest descent method. The steepest descent method is one of the simplest and oldest methods to solve unconstrained optimization problems, although it is no longer practical, but many effective algorithms have been improved and modified based on it. The steepest descent method is to use the direction of the negative gradient as the search direction, the steepest descent method is closer to the target value, the smaller the step, the slower the advance.
Random Gradient descent
Gradient descent is the process of minimizing functions by the gradient of cost functions.
This involves the form and derivative of the cost function, which allows the gradient to be extrapolated from any given point and moved in that direction, for example, along the slope downward (downhill) until the minimum value.
Commonly used in machine learning and artificial intelligence to recursively approximate the minimum deviation model
First we should be clear that the gradient direction of a multivariate function is the steepest direction of the function value. When materialized into a function of 1, the gradient direction is first the tangent of the extension curve, and then the direction of the tangent upward is the gradient direction. In 2 or multivariate functions, the gradient vector is the derivative of the function value F for each variable, and the direction of the redirect is the gradient direction, of course the size of the vector is the size of the gradient.
Now suppose we ask for the maximum value of the function, using the gradient descent method, as shown in figure:
The basic idea of the gradient descent method is quite simple, assuming that we require the minimum value of the function f, first we have to select an initial point, then the next point is produced along the gradient line direction, here is the opposite direction along the gradient (because the minimum value is sought, if the maximum is the direction of the gradient). The iterative formula for the gradient descent method is:
Because under normal circumstances, the gradient vector of 0 indicates that the gradient is at an extreme point, at which point the amplitude is 0. While using the gradient descent algorithm to optimize the solution, the termination condition of the algorithm is that the amplitude of the gradient vector is close to 0, which can set a very small constant threshold value.
In machine learning, we can use a technique to evaluate and update the coefficients after each iteration, a technique called random gradient descent, which minimizes the training error (training error) of the model. 2, the problem
In actual work, we may encounter the following problems:
1, to predict whether a user clicks on a specific product
2, judge the user's gender
3, predict whether users will buy a given category
4. Judge whether a comment is positive or negative
These can be regarded as classification problems, more accurately, can be regarded as two classification problem 3, logistic model
Before introducing the logistic regression model, we first introduce the sigmoid function, whose mathematical form is:
The corresponding function curve is shown in the following figure:
From the image above you can see that the sigmoid function is an S-shaped curve, it takes a value between [0, 1], and the value of the function away from 0 will soon be close to 0/1. This property allows us to explain in a probabilistic way (the extension below will briefly discuss why it is reasonable to use this function as a probabilistic model).
decision function
A machine learning model, in effect, is to limit the decision function to a set of conditions, which determines the hypothetical space of the model. Of course, we also hope that this set of qualifications is simple and reasonable. And the assumption that the logical regression model makes is
The G (h) Here is the sigmoid function mentioned above, and the corresponding decision function is:
The choice of 0.5 as a threshold is a general practice, the actual application of specific circumstances can choose different thresholds, if the correct discriminant accuracy requirements, you can choose a larger threshold, the positive example of the recall requirements, you can select a smaller threshold.
After the mathematical form of the model is determined, the rest is how to solve the parameters in the model. A commonly used method in statistics is the maximum likelihood estimator, that is, to find a set of parameters, so that under this set of parameters, the likelihood of our data (probability) is greater. In the logical regression model, the likelihood degree can be expressed as:
In the logical regression model, we maximize the likelihood function and minimize the log loss function is actually equivalent. For this optimization problem, there are many solutions, which is illustrated by the example of gradient descent. Gradient descent (gradient descent) is also called as the steepest gradient descent, which is an iterative method to approximate the optimal value by selecting a direction to adjust the parameters of the target function in each step. The basic steps are as follows:
The gradient calculation method of the loss function is:
Choosing a smaller step along the negative direction of the gradient can ensure that the loss function is reduced, on the other hand, the loss function of logistic regression is a convex function (which is a strictly convex function after adding the regular term), so that we can find the local optimal value is the global optimal. In addition, the commonly used convex optimization method can be used to solve this problem. such as conjugate gradient descent, Newton method, LBFGS, etc.
classification Boundary
Once we know how to solve the parameters, let's look at the final results of the model. It is easy to see from the sigmoid function, when θtx>0, Y=1, otherwise y=0. Θtx=0 is the implicit classification plane of the model (in the High-dimensional space, we say the hyperplane). So logic regression is essentially a linear model, but, this does not mean that only linear and measurable data can be solved by LR, in fact, we can transform the low dimensional space into high-dimensional space by feature transformation, and the probability of linearly being divided in the high-dimensional space is higher than that of the data which is not divided in the low dimension. The comparison of the following two graphs shows the linear classification curve and Non-linear classification curve (through feature mapping).
regularization
When the model has too many parameters, it is easy to encounter the problem of fitting. There is a need for a way to control the complexity of the model, typically by adding regular items to the optimization objective, and by punishing too large parameters to prevent the fitting:
Logical regression is a discriminant model, which is modeled directly on conditional probability P (y|x), and does not care about the data distribution P (x,y) behind. The Gausbeyes model (Gaussian Naive Bayes) is a generative model, which models the joint distribution of the data, and then calculates the posterior probabilities of the samples by Bayesian formula, namely:
Multi-Classification (Softmax)
If Y does not take a value in [0,1] but instead takes a value in the K category, the problem becomes a multiple classification problem. There are two ways to deal with this type of problem: one is to train a two-meta classifier (one-vs-all) for each category, which is appropriate when the K class is not mutually exclusive, such as which category the user will buy. If the K class is mutually exclusive, that is, y=i means that Y can not take other values, such as the user's age, in which case Softmax regression is more appropriate. Softmax regression is a direct generalization of logistic regression in multiple classifications, and the corresponding model can also be called multivariate logistic regression (multinomial logistic regression). The model is modeled by the Softmax function, in the following form:
Similarly, we can solve the problem by gradient descent or other higher-order methods, which we will not repeat here.
Application
The first part of this article mentioned several problems encountered in the actual, here to predict the user's purchase preferences for the category as an example, describes how the United States Regiment is a logical return to solve problems in the work. This problem can be converted to predict whether the user will buy a category at a certain time in the future, and if the purchase mark is 1 and will not buy the mark as 0, it will be converted to a two classification problem. We use the characteristics of the user in the United States to browse, purchase and other historical information, see the table below.
The time span of the feature extracted is 30 days, and the label is 2 days. The training data generated is about 70 million levels (the US group has had behavioral users for one months), we aggregate similar pieces of the sketch, and finally there are 18 more typical collection of categories. If a user buys a collection of categories within a given time, it is a positive example. Yo the training data, using the Spark version of the LR algorithm for each category training a two classification model, the number of iterations set to 100 times the model training needs about 40 minutes, the average of 2 minutes per model, the test set on the AUC is also mostly over 0.8. A trained model is saved to predict the probability of purchase on various categories. The results of the predictions are used for recommendations and other scenarios.
Because the distribution of positive and negative cases between different categories, some of the positive and negative distribution is very uneven, we also tried different sampling methods, the ultimate goal is to improve the single rate and other online indicators. After some parameter tuning, the category preference feature has brought more than 1% of the single rate promotion for recommendation and ordering.
In addition, because the LR model is simple and efficient and easy to implement, it can provide a good baseline for the following model optimization, we also use the LR model in the ordering service.
Summary
The mathematical model and solution of logistic regression are relatively concise and simple to achieve. by discretization and other mappings of features, logistic regression can also deal with non-linear problems, and is a very powerful classifier. Therefore, in practical applications, when we can get many low-level features, we can consider using logistic regression to solve our problems.
gradient descent method, maximum likelihood estimation, convex function, bottom leveling method difference, log loss function. 4, using Python to achieve the descent with gradient logic regression (logistic)
The logistic regression algorithm is named by the core function of the method, which is the logistic function. The expression of logistic regression is the equation, very much like linear regression. The input value (X) predicts the output value (y) by combining weights or values linearly.
Here is an Indian diabetes dataset.
1, how to use logistic regression model to predict.
The main difference with the linear regression is in the expression of the output value of the model 0 or 1,logistic regression as equation, very much like linear regression. The input value x predicts the output value by a linear combination weight or a series value Y
The main difference from linear regression is that the output value of the model is two value 0 or 1, not a continuous value.
This e is the base of the natural number (Euler number), Y is the predictive value, the B0 is the deviation or intercept item, and the B1 is the parameter of a single input variable (x1).
The y predictor is a real number before 0, 1, which needs to be rounded to an integer value and mapped to a predictive class value.
Each column in the input data has an associated coefficient B (a constant real value), which is the focus of the training study. The final model that is stored in a memory or file is actually a factor in the equation (beta or b).
The coefficients of the Logisti regression algorithm must be estimated from the training set.
Random gradient descent
Gradient descent is the process of minimizing functions by the gradient of the cost function.
This involves the form and derivative of the cost function, allowing the gradient to be pushed from any given point and moving in that direction, for example, along the slope down until the minimum value.
In machine learning, we can use a technique to evaluate and update the coefficients after each iteration, a technique called random gradient descent, which minimizes the training error of the model.
Each training sample is passed into the model each time, the model predicts the training samples, calculates the error and updates the model so as to reduce the error of the next prediction.
The process can find a set of coefficients with minimal training error coefficients, each iteration, the coefficient b in machine learning is updated by the equation:
where b is an optimized coefficient or weight, learning_rate is the learning rate that must be set, for example 0.01,y1-y is the training data based on the weight of the model error, Y is calculated by the coefficient of the predicted value, X is the input value.
Pima Indians Diabetes Data Set
The Pima Indians dataset contains estimates of the incidence of diabetes in the Pima Indians within 5 years, based on basic medical details.
It is a two-item classification problem, in which predictions are 0 (without diabetes) or 1 (diabetes).
It contains 768 rows and 9 columns. All values are numeric and contain floating-point values (float). The following example shows the structure of the first few lines of data.
Data address: Https://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes
The first step is to develop a function that can be predicted.
This predictive function is required for estimating the value of the estimated coefficients in the stochastic gradient descent and for predicting the model after the final determination on the test set.
Here is a function named Predict (), given a set of coefficients, which predicts the output value of each row.
The first factor is always the Intercept item (intercept), also known as deviation or b0, because it is independent, not the coefficient of the input value.
We can put some small data to test our function.
This is the distribution of several of the above data, different colors represent different categories.
Use these small data first to test the Predict function.
Running this function, we get the predicted value quite close to the expected output value (y), and we can predict the correct category when rounding.
Now we are ready to implement the stochastic gradient descent algorithm to optimize the coefficient value.
2, how to use the random gradient descent (stochastic gradient descent) to estimate the coefficient (coefficient).
We can use random gradient descent to estimate the coefficient of a training set.
A random gradient drop takes two parameters:
Learning Rate (leanring Rate): Used to limit the amount of correction for each factor at each iteration.
Iteration count (epochs): The number of convenient training set data before the factor is updated.
Three-tier loops in a function:
1, each iteration (epoch) of the loop.
2. Cycle of each row of the training set data for each iteration.
3, each iteration of each row of data each of the coefficients of the cycle of updates.
In this way, in each iteration, we update each of the data coefficients for each row in the training set. The sparse update cardinality is based on the model's training error value. This error is computed by the difference between the expected output value (the real dependent variable) and the predicted value determined by the estimated coefficient.
Each output attribute (the argument) corresponds to a coefficient that is constantly updated in the iteration, for example:
The special factor (also known as The Intercept) at the beginning of the list is updated in a similar way, except that it is independent of the input value.
Now we can put it all together. Here is a function called COEFFICIENTS_SGD () that calculates the coefficient value of a training set using a random gradient descent.
In addition, each iteration we record the error squared and SSE (a positive value) so that we can output the results at the beginning of each outer loop.
We can test this function with the small data above.
We use a larger learning rate of 0.3 and iterate 100 times to train the model, or to update the coefficients 100 times.
The code is run, and each iteration will print out the squared sum of the time bands and the optimal coefficients determined by the iteration.
You can see that the error continues to fall even in the last iteration. We can train for a longer time, or more iterations, or increase the degree of iterative update coefficients (higher learning rates).
Now, let's apply the algorithm to the actual dataset.
3, how to apply logistic regression to the real prediction problem.
In this section, we will use the stochastic gradient descent algorithm to train the logistic regression model of the diabetes dataset.
The example assumes that the CSV copy of the dataset is in the current working directory and the file name is Pima-indians-diabetes.csv
The dataset is loaded first, the string value is converted to a number, and each column is normalized to a value of 0 to 1. This is standardized using the auxiliary functions Load_csv () and str_column_to_float () to load and prepare the DataSet and Dataset_minmax () and Normalize_dataset ().
We will use K-fold cross-validation (k-fold Cross validation) to estimate the predicted effect of the model learned in the unknown data. This means that we will build and evaluate the K model and use the average of the predicted effect as the evaluation criterion for the model. The classification accuracy rate will be used to evaluate each model. These processes are provided by the auxiliary function Cross_validation_split (), Accuracy_metric (), and Evaluate_algorithm ().
We will use the Predict (), COEFFICIENTS_SGD () functions created above and a new logistic_regression () function to train the model.
The k value for K-fold cross-validation is 5, and the quantity to be evaluated for each iteration is 768/5 = 153.6 or just over 150 records. The learning rate 0.1 and the training iteration times were selected by experiment 100.
You can try other settings to see if the model evaluates to a better score than mine.
Run this sample code, print 50 percent cross-validation each time score, and the last print classification accuracy rate average.
As you can see, the accuracy rate of this algorithm is approximately 77%, and if we use the 0 rule algorithm to predict most classes and the baseline value is 65%, the accuracy of this algorithm is higher than the baseline value.