Next, we will discuss the naive Bayesian model, linear regression, multivariate regression, and logistic regression analysis models.
4. Naive Bayes model
The Table query model is simple and effective, but there is a problem. As the number of inputs increases, the number of training samples in each cell decreases rapidly. If the dimension is 2 and each dimension has 10 different variables, 100 cells are required. When there are 3 dimensions, 1000 cells are required, the four dimensions are 10000. in this way, even traditional data mining will encounter significant bottlenecks in exponential growth.
The naive Bayes model provides this method when attempting to predict a probability value. Basic Idea: each input variable itself contains information required for prediction. For example, the target variable is the probability of canceling the business, and the interpretation variable is market, channel acquisition, initial credit rating, Interest Rate Plan, phone number type, mobile phone number, and customer age. These variables have the ability to predict. Based on the significant difference in the consumption rate, each variable can be divided into different ranges.
Simple understanding: conditional probability refers to the probability of A under A given B and the probability of B under A given.
Explanation: the probability of occurrence of A under A given B is equal to the probability of occurrence of B under A given A multiplied by the ratio of the probability of occurrence of A and B.
If A indicates that the renewal is stopped, B indicates that the renewal is stopped on the BlackBerry phone, that is, the ratio of the probability of a blackberry phone to the probability of a total stop of the phone number to the total probability of a blackberry phone number being used under a given stop of the phone number.
4.1, probability, probability, and relief
· Probability: a number between 0 and 1, indicating the possibility of a specific result. One method to estimate the result probability is to calculate the percentage of the number of results in the sample data.
· Probability: The probability ratio of a specific result that does not occur. If the probability of an event occurring is 0.2, the probability of not occurring is 0.8. The probability of occurrence is 1/4. The probability value ranges from 0 to infinity.
· Likelihood: two correlated conditional probability ratios. That is, the ratio of the probability of A occurrence in A specific result when B is given to the probability of A occurrence in the case that B is not given.
4.2 Naive Bayes computing
For each attribute of any number of attributes, the naive Bayes formula associates the probability of the target event with the likelihood of the event.Return to the example of predicting customer loss based on marketing, channel acquisition, initial credit rating, rate calculation, phone number type, mobile phone model, and customer age. For example, we are concerned about the case of blackberry renewal mentioned above. 1. Overall probability of renewal suspension. 2. likelihood of stopping Blackberry users. 3. likelihood of renewal in the entire state market.
It is defined as "simple" based on the assumption that all likelihood multiplication is based on input variables independent from each other. In this case, it is assumed that the likelihood of using Blackberry phones is independent from the market (and the likelihood of existence in the state is independent from the mobile phone type ). In practice, this situation is rare.
The most attractive feature of the naive Bayes model: if some input values are missing for the observation to be scored, the missing likelihood can be simply removed from the model. This means that it contains input (user age) that are not available for all useful variables, but they are useful if you know these variables. Given the probability of different inputs, and these inputs are related to the stop renewal, the naive Bayes formula can calculate the probability of the Stop renewal, and the company is more interested in such users.
4.3 comparison between simple and Table query models
For probabilistic objects, the naive Bayes model is closely related to the Table query model. The main difference between the two lies in how to use dimensions. In the Table query model, all dimensions are used to define cells in sequence, and then the target probability of each cell is calculated. Therefore, the Table query model can obtain the interaction between variables. In simplicity, You need to calculate the likelihood for each dimension separately, and then combine the likelihood to calculate the target probability. The combination of likelihood has a hypothesis that each dimension has an independent target. Table queries do not describe the combination of such a type of attributes. That is, because the frequency of these attributes in the training data is very low, these attributes with low probability will not appear.
In a simple model, we can predict any combination that has never been used before, but to do so, we must assume that the effects of these inputs are independent of each other. The Table query model does not have such assumptions. Therefore, when there is more data to support a trusted estimation model, the Table query model may be better.
5. Linear Regression
Regression Modeling is also a predictive modeling technology. Linear regression can be used in Excel. The regression model is also very complex. Here we will talk about the best fitting curve.Both the input and target variables must be numerical variables. The regression equation describes an arithmetic relationship between the two.The "Optimal" relationship refers to the sum of squares that minimize the vertical distance from the data point to the fitting curve.
5.1 best fit curve
For example, it shows the relationship between the daily page views of an article published by a knowledgeable blogger over time and attention. Which of the following symbols is used in the graph?Hollow circle,It helps to clearly display the aggregation between points. For example,In a short period of time, the user's attention distribution is very intensive.This scatter chart is particularly useful when different customer groups are compared with different colors.
With the increase of time, the daily access volume of the blog will become lower and lower, and the blog will be maintained to a certain level. The drawn curve is more like a hyperbolic curve. The Time Increments according to the X axis. Y axis, increasing daily access volume. We can simulate the time-varying traffic curve of this blogger. While ensuring the quality of a blog, we can predict the approximate access volume of a blog by a blogger. There may be a lot of errors or inaccuracies here. However, by fitting the curve, we can intuitively see the trend of the curve. If the curve is more accurate, we can even simulate the function expression of the curve.
If the labels used as data points are not completely fit, the scatter chart can transmit more information. The property of the best fitting curve: among all possible curves, the best fitting curve refers to the curve at the bottom of the Square from the observed point to the vertical distance of the curve, A scatter chart shows the distance between each point and the curve.
The best fit curve is defined by the least square method. The Euclidean distance formula is used to calculate the Euclidean distance. At that time,Gaussian puts forward this idea that the sum of squares is used to calculate the sum of distances.. This makes it easy to calculate the optimal fitting curve coefficient.
Here we talk about linear regression. In fact, the regression model is a linear equation. Here we only describe a fitting curve, but it is not actually a regression curve. In reality, more linear regression models are rare, and more are curve fitting.
5.2 advantages of fitting
For a given dataset, you can always find an optimal fitting curve. However, if there are many curves, which one is the best.The "residual" is introduced here, that is, the difference between the measured predicted value and the actual value.There is also a standard method, which is used to measure the degree of fitting the descriptive curve to the observed data.
1. Residual
, Height and weight model residual.
A non-biased model should be similar to the loss of low-value points in terms of loss of high-value points. The most important aspect of a residual graph is whether these values are equal to the possibility of fitting a curve. We can also see that the sample points on the curve are different from those on the curve.
The estimated value produced by a good model should be close to the actual value, so the residual value should be concentrated near the curve axis.We can also see some isolated points of the left-Off Curve. The reason for these points may be due to some manual records.
In statistics, the error item should be considered in the regression equation. The best fitting curve equation is:
Y = aX + B
However, this curve is not a complete model. The statistician will represent the model equation as: ε represents the error item, because X cannot perfectly display Y. The error item indicates the Section Y that the model cannot interpret.
Y = aX + B + ε
2. R (R represents the square of R)
For the best fitting curve,The R value is always 0 ~ Between 1. When the value is close to 1, it indicates that the curve performs well in capturing the relationship between the input and the target. If it is close to 0, it indicates that the best fitting curve has poor performance.From 0 ~ Within the range of 1, a larger value indicates a strong relationship between the two.
Compared with the average value of random prediction, the estimated value of the model is good. Simple definition, but complicated computing. R needs to compare the horizontal line of the best fitting curve with the y average. 1 minus the ratio of two errors, R can be calculated. Sum of squares of the residual values of the best fitting curve in the molecular formula. The denominator is the sum of the residual squares of the horizontal line. R measures the degree to which the best fit curve is better than the mean value.
R measures data stability. Whether similar models are generated for different samples in the same dataset. When the R value is low, different samples may behave very differently.Or, adding a small number of observed values on this basis may greatly change the coefficient of the model. When the R value is high, adding a few observations will not change.
5.3 Global Effect
The regression equation can discover the global pattern in the data. That is, the equation coefficient should work for all input variables in the range. This indicates that regression models are good at capturing patterns that are always correct, not local models.
For example, consider the risks of car insurance buyers of different ages. Young Drivers are at high risk. With the increasing experience of drivers, risks will gradually decrease. For older drivers, the risk increases. This is because age is difficult to input as a regression equation. Because there is no global pattern, the age impact varies for different age groups.
To a large extent, you need to use variables with different values as input parameters based on the degree of familiarity of The Modeler. However, regression equations themselves do not find local patterns.
6. Multiple Regression
The example that introduced linear regression uses a single input-persistence-to explain the changes in daily traffic over time. When a regression model has multiple inputs, it is called a multivariate regression.
6.1 equality
The general form of linear regression model (no error) is Y = a + a1x1 + a2x2 + a3x3 +... + anxn. This equation expands the optimal curve equation by adding more variables and selecting coefficients for each variable.
Although more dimensions can be introduced, the geometric interpretation of linear regression can be extended to multiple regression-the curve can be changed to a plane and then to a superplane. It is easier to consider the contribution of each word variable to the estimated value of the dependent variable, that is, the contribution size and direction of the independent variable can be determined by the coefficient.
6.2. target variable range
A regression equation can generate any value. If there is no limit on X, there is no limit on Y. This can be illustrated for Y = aX + B. The equation is a simple straight line, and the value of X changes along with the side, from negative infinity to positive infinity. But this is not the case. The range of many target variables is not infinite or even continuous. In these cases, the statistician introduced a linked function to map the Y value generated by the regression equation to a suitable range of the target variable. When a target follows a known distribution, you can select a link function. The generated value is the same as the average value of the target, and a similar distribution is generated. Even if you do not know the exact distribution, you can introduce the link function to map the estimation to an appropriate range of the target.
We will introduce Logistic regression analysis later, where a linked function is used to map the infinite range of linear regression to 0 ~ 1. This interval is equivalent to probability estimation.
6.3. Other precautions for using multivariate regression
When multiple input variables exist in the regression model, problems that do not exist in a single input may occur.
· Ideally, all inputs should be linearly independent.
· The input displayed by the model may be mutually exclusive.
· Adding a new input will change the system value of all original input values.
1. Linear Independence
Similar to Naive Bayes, the input of a multivariate model should be linearly independent.Changing One of the input values at this position should not affect other input values. In actual situations, it is difficult to achieve true independence.Generally, do not include independent variables that are closely related to each other. If these variables are included, one input variable of the model usually has a large positive coefficient while the other input variable has a large negative coefficient. The two variables are essentially offset by each other, so the values of these two coefficients do not have a real impact on any variables.
2. Interaction
Even if two variables are completely independent, their impact on the target may be related. The attractiveness of an ice cream may depend on the price and weather-especially the heat of a day. These variables can be considered independent (of course, the price of ice cream does not depend on the temperature, and the temperature may affect the price of ice cream, but it is not assumed here ). Although these variables are independent of each other, the price impact on the target may be affected by the temperature. When the weather is hot, people do not care about the price of ice cream. When the weather is cold, only the real quality and low price can attract people to buy.
Similarly,The impact of price changes on the household rate may vary with the distance from the city center. This is an example of interaction.
When interactions are important, you can add new variables to introduce them. These new variables are the product of variable values involved in standardized interactions.
3. Adding a variable can change the value of the original variable in the model.
A natural modelThe development method starts with a simple model with only one input, and gradually increases its complexity by adding variables.. IfAll input variables are completely independent. adding or deleting a variable does not change the coefficients of other variables in the model.. However, input variables are almost impossible to be completely independent. All variables that contain another variable may change the coefficient size, or even change the positive and negative values of other original variables in the model.
6.4. Variable Selection for multivariate regression
When a multivariate regression model has a large number of inputs, its performance is not ideal.. Selecting the correct input variable is the most important part for any modeling.. Here we will talk about "domain knowledge ",The first thing to consider is some prior knowledge about the problem and the methods people used to solve the problem.. Sometimes, domain knowledge provides a good indicator for the prediction of the temporary model.
After a list of candidate variables is created using domain knowledge and common senseThe software used to create a regression model usually helps users select the best variables required by the model.Usage:
1. Forward Selection
A group of input variables are used from the beginning. some or all of the variables can appear in the final model. The first step is to create a separate regression model for each input variable. If there are n input variables, the first step will consider n Different regression models with one input variable. Select the variable corresponding to the model with the highest test score as the first variable in the forward selection model.
One way to select the optimal model is to select the model with the lowest R value. Another method is to use the F-test method in the statistical test.The best model is the model with the smallest error in the validation set.. This looks more like data mining because it uses a validation set and does not make any assumptions about input or target values.
After the first variable is selected, the process repeats here. For the second time, you can combine each remaining variable with the selected variable in the first step to create n-1 regression models that contain two input variables.The best model in these models will be the foundation of the next iteration, And the next iteration will test the multivariate regression model with three variables. This process continues until certain stop conditions are met.Conditions for terminationIt can be the maximum number of selected variables, or you cannot continue to increase the threshold value of the model by adding more variables.
2. Gradually select
The step-by-step selection is very similar to the forward selection, but only a little different.In addition to adding a variable in each step, you can also delete a variable. a variable that enters the model earlier may not be a valid variable due to the combination of subsequent variables.
3. Backward Elimination
The backward Elimination Method for Selecting variables first creates a multivariate regression model using all n input variables.Use statistical tests to eliminate bad variables and then reset the model. This process continues until certain stop conditions are met, such as the minimum number to reach the ideal variable.
7. Logistic Regression Analysis
Linear regression models have a special form. This form can be used for any number of input variables, but the effect is the most obvious when there is only one input variable. The regression formula is a linear equation. One attribute of a straight line is that it can extend infinitely to both ends. Except for a straight line parallel to the X axis, the regression model has no maximum or minimum values. These attributes make the linear regression model suitable for estimating continuous variables with a wide range of values.
The same attributes make the linear regression model suitable for modeling unbounded and continuous targets, rather than modeling binary results, such as whether or not the results are good or bad. Because binary problems are extremely common, This is the logistic regression analysis model.
7.1 modeling Dual Output
Modeling binary output does not seem like an evaluation task. There are two categories. A task assigns each record to one of the classes. This is a classification task. However, this task can be repeated as "what is the probability that a record belongs to one of the classes ?", Because probability is a number, this problem is transformed into an evaluation task.
1. Use linear regression to evaluate Probability
When talking about the payment quota model of the customer's subscription Based on the duration estimation in the newspaper. The call center contacts the user, and some customers agree to sign the subscription agreement. Soon after the contract was signed, they began to receive newspapers, and then they had to pay a bill. Some new users will not take the time to pay for the first order, thus reducing newspaper revenue. After a period of time, customers who have not paid the order will be suspended from receiving the newspaper. Some new customers who have paid the bill may become good customers after a long time. However, in the past few weeks, the customer is highly unlikely to pay.
If the target value is 0, it can be expressed as a customer who has never paid. If the target value is 1, it indicates that the customer has paid the payment. There will be an optimal fitting curve here. But it cannot be guaranteed that this is a good model. Because after a certain day, the probability trajectory value is greater than 1, and the estimated value will increase infinitely with the increase of the duration. This is the nature of a straight line: there are obvious drawbacks, there is no maximum or minimum value.
2. Curve the regression line
Obviously, a straight line is not a proper shape for probability estimation. Logistic regression analysis solves this problem by bending the regression line into a more appropriate shape. We need to obtain a value ranging from 0 ~ 1. This is the logical function.
7.2 Logical Functions
The naive Bayes model is multiplied by a string of likelihood to estimate the probability and then converts it to probability. This technique is also used to convert linear regression into logistic regression.
Step 1: Convert the probability P to the probability P between P and 1-P. Probability and probability indicate the same thing. The difference is that the probability value ranges from 0 ~ 1, the probability is 0 to positive infinity. In the decrement logarithm to generate a function from negative infinity to positive infinity.
From this point of view, probability is converted into a continuous function from negative infinity to positive infinity-the advantage of formal linear regression. Use the logarithm of probability as the target variable to establish a regression equation.
The equation result is:
This is the logical function. The logical function itself has a feature, namely the S curve. The model parameters bend left or right and stretch or contract the model. A good property is that at the origin, the slope of the curve is about 45%, and the curve is close to a straight line between-1 and 1. In addition, it becomes flat and remains between 0 and 1. This is a curve that is very suitable for probability.
Although logistic regression looks more "Curved" than a straight line, it can only find the global mode, not the local mode. The result of obtaining the local mode is reflected in the variable to determine its impact. The Ordinary Least Square Method Used for finding the best fitting curve for linear regression is not applicable to logistic regression analysis. It uses the maximum likelihood method,The maximum likelihood method is used to fit the model.
Maximum Likelihood fitting model
Fitting any type of parameter model pair means that you need to use data to find the best parameters. The predicted values calculated using these best parameters are as consistent as possible with the actual values.This is the opposite of model scoring. Given a set of parameter value models during the scoring process, the model produces the best estimate value for certain inputs.
Given a undetermined parameter value and a set of observed values, this function returns the probability that this parameter value is correct. Although such a function does not exist in general, there is a useful relationship between the probability of the target value and the likelihood of the parameter in the training set given a parameter-directly proportional. Likelihood is a metric related to uncertainty, which is different from probability as an absolute measurement. However, likelihood has obvious advantages over candidate parameters. The actual calculation formula of the likelihood function depends on the various assumptions about the data, which in turn depends on the use of specific parameter models.
For logistic regression, the likelihood function has an exact optimal value. You can use the numerical optimization method to maximize the likelihood value and ensure that the maximum point is found.
Summary:
The idea of probability statistics is the basis of all data mining technologies. Given an ideal target statistical description, the similarity can be used to measure the distance between the candidate value and the prototype or ideal object to score the candidate. Euclidean distance is a common similarity measure, but there are many other possible methods.
The Table query model uses a different similarity measurement method. All the observed values that fall into the same cell in the table will get the same score. Scores the records allocated to the cell based on the characteristics of the training data. There are many ways to define the cells of the Table query model, but the simplest way is to divide the value range of each input into groups of the same size, such as three or five.
There is a problem with Table query. As the number of input increases, the number of training samples in each cell decreases sharply. If there is a lot of input, it is difficult to satisfy more data to make a confidence estimate. One solution is to combine several small cells into a large cell.
The solution of the simple model is to use all training data for each dimension, and then combine the contribution of each dimension. The word "plain" refers to the assumption that input variables are independent of each other, but this assumption is often not true in reality.
A common statistical model is a regression model. The linear regression model combines all input values by multiplying one input by one coefficient. Fitting a regression model means finding numbers in some series that minimize the Square Value of the estimation error. Linear Regression has many benefits, but does not apply to all situations. In particular, it is not applicable to probability estimation. The logistic regression model uses the S-shaped function instead of a straight line to fit the observed data. It will generate an estimation with a range of only 0-1, so it is suitable for representing probability.
All regression models can discover Global patterns, that is, they discover patterns of all input values of variables. In fact, there are many local models. for what is a local model, we will talk about the decision tree, which is very powerful in searching for the local model between the input and the target.
CopyrightBUAA