Logistic regression model predicts stock ups and downs

Source: Internet
Author: User

Http://www.cnblogs.com/lafengdatascientist/p/5567038.html

Logistic regression model predicts stock ups and downs

Logistic regression is a classifier, the basic idea can be summarized as: for a two classification (0~1) problem, if P (y=1/x) >0.5 is classified as 1 classes, if P (y=1/x) <0.5, then classified as 0 classes.

I. Overview of the model 1, sigmoid function

The sigmoid function is described here for the basic idea of image-based text:

The function image is as follows:

The red line, the x=0, divides the sigmoid curve into two parts: when x < 0,y < 0.5;
When x > 0 o'clock, y > 0.5.

In the actual classification problem, the response variable is often classified according to multiple predictor variables. So the sigmoid function is to be combined with a multivariate linear function to be applied to logistic regression.

2. Logistic model

where θx=θ1x1+θ2x2+......+θnxn is a multivariate linear model.

The upper type can be converted to:

The left side of the formula is called occurrence ratio (odd). When P (x) is close to 0 o'clock, the occurrence is nearer to 0, and when P (x) is close to 1 o'clock, the ratio is nearly ∞.

The logarithm on both sides is:

The left side of the formula is called the logarithmic occurrence ratio (log-odd) or the logarithmic (logit), which becomes a linear model.

However, compared with the least squares fitting, the maximum likelihood method has better statistical properties. Logistic regression is usually fitted with maximum likelihood method, the fitting process is skipped here, and the following is only how to apply the logistic regression algorithm with R.

II. Logistic regression application 1, Data set

ISLRthe data set in the application package Smarket . Let's take a look at the structure of the dataset:

12345678910111213141516171819202122 summary(Smarket)      Year           Lag1                Lag2          Min.   :2001   Min.   :-4.922000   Min.   :-4.922000  1st Qu.:2002   1st Qu.:-0.639500   1st Qu.:-0.639500  Median :2003   Median : 0.039000   Median : 0.039000  Mean   :2003   Mean   : 0.003834   Mean   : 0.003919  3rd Qu.:2004   3rd Qu.: 0.596750   3rd Qu.: 0.596750  Max.   :2005   Max.   : 5.733000   Max.   : 5.733000       Lag3                Lag4                Lag5         Min.   :-4.922000   Min.   :-4.922000   Min.   :-4.92200  1st Qu.:-0.640000   1st Qu.:-0.640000   1st Qu.:-0.64000  Median : 0.038500   Median : 0.038500   Median : 0.03850  Mean   : 0.001716   Mean   : 0.001636   Mean   : 0.00561  3rd Qu.: 0.596750   3rd Qu.: 0.596750   3rd Qu.: 0.59700  Max.   : 5.733000   Max.   : 5.733000   Max.   : 5.73300      Volume           Today           Direction Min.   :0.3561   Min.   :-4.922000   Down:602  1st Qu.:1.2574   1st Qu.:-0.639500   Up  :648  Median :1.4229   Median : 0.038500             Mean   :1.4783   Mean   : 0.003138             3rd Qu.:1.6417   3rd Qu.: 0.596750             Max.   :3.1525   Max.   : 5.733000

  

Smarketis the 1250-day return on equity investment from 2001 to 2005, which Year Lag1 refers to Lag5 the return on investment for the last 5 trading days, which is the Today return on investment today, the Direction market trend, or Up (UP) or Down (fall).

First look at the correlation coefficients of each variable:

123 library(corrplot)corrplot(corr = cor(Smarket[,-9]),order = "AOE",type = "upper",tl.pos = "d")corrplot(corr = cor(Smarket[,-9]),add=TRUE,type = "lower",method = "number",order = "AOE",diag = FALSE,tl.pos = "n",cl.pos = "n")

  

It can be seen that the Volume Year relative coefficients are relatively large, which indicates that the volume of transactions increases with the year, and the correlation between the other variables is basically not much. It is common sense that the historical data of the stock is very little correlated with the future data, and it is difficult to use supervised learning methods to accurately predict the future stock market situation. But as an application tutorial for algorithms, let's try it.

2. Train and test the logistic regression model

The logistic regression model is one of the generalized linear regression models, so the function is glm() , but the parameter must be added family=binomial .

123456789101112131415161718192021 attach(Smarket)# 2005年前的数据用作训练集,2005年的数据用作测试集> train = Year<2005# 对训练集构建逻辑斯谛模型> glm.fit=glm(Direction~Lag1+Lag2+Lag3+Lag4+Lag5+Volume,+             data=Smarket,family=binomial, subset=train)# 对训练好的模型在测试集中进行预测,type="response"表示只返回概率值> glm.probs=predict(glm.fit,newdata=Smarket[!train,],type="response")# 根据概率值进行涨跌分类> glm.pred=ifelse(glm.probs >0.5,"Up","Down")# 2005年实际的涨跌状况> Direction.2005=Smarket$Direction[!train]# 预测值和实际值作对比table(glm.pred,Direction.2005)        Direction.2005glm.pred Down Up    Down   77 97    Up     34 44# 求预测的准确率mean(glm.pred==Direction.2005)[1] 0.4801587

The forecast accuracy rate is only 0.48, it is better to guess blindly. Here's an attempt to adjust the model.

12345678910111213141516171819202122 #检查一下模型概况summary(glm.fit)Call:glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +    Volume, family = binomial, data = Smarket, subset = train)Deviance Residuals:   Min      1Q  Median      3Q     Max -1.302  -1.190   1.079   1.160   1.350 Coefficients:             Estimate Std. Error z value Pr(>|z|)(Intercept)  0.191213   0.333690   0.573    0.567Lag1        -0.054178   0.051785  -1.046    0.295Lag2        -0.045805   0.051797  -0.884    0.377Lag3         0.007200   0.051644   0.139    0.889Lag4         0.006441   0.051706   0.125    0.901Lag5        -0.004223   0.051138  -0.083    0.934Volume      -0.116257   0.239618  -0.485    0.628(Dispersion parameter forbinomial family taken to be 1)    Null deviance: 1383.3  on 997  degrees of freedomResidual deviance: 1381.1  on 991  degrees of freedomAIC: 1395.1Number of Fisher Scoring iterations: 3

  

You can see that all variables have a large p-value and are not significant. The smaller the AIC mentioned in the previous linear regression section, the better the model, the more the AIC is still larger.

Adding a Predictor variable independent of the response variable causes the test error rate to increase (because such a predictor increases the model variance but does not reduce the model bias accordingly), so removing such predictors may optimize the model.

The P-values of LAG1 and LAG2 in the above model are significantly smaller than the other variables, so only these two variables are trained again.

12345678910111213 > glm.fit=glm(Direction~Lag1+Lag2,+             data=Smarket,family=binomial, subset=train)> glm.probs=predict(glm.fit,newdata=Smarket[!train,],type="response")> glm.pred=ifelse(glm.probs >0.5,"Up","Down")table(glm.pred,Direction.2005)        Direction.2005glm.pred Down  Up    Down   35  35    Up     76 106mean(glm.pred==Direction.2005)[1] 0.5595238> 106/(76+106)[1] 0.5824176

  

The overall accuracy of the model reached 56%, and it finally shows that the predictive accuracy of the statistical model is better than the guessing (albeit only a little). According to the confusion matrix, when the logistic regression model predicts the fall, there is a 50% accuracy rate, and when the logistic regression model predicts the rise, there is a 58% accuracy. (Matrix row Name table forecast value, column name table actual value)

Use this model to predict 2 new sets of data:

123 > predict(glm.fit,newdata =data.frame(Lag1=c(1.2,1.5),Lag2=c(1.1,-0.8)),type="response")        120.47914620.4960939

It can be seen that for two points (lag1,lag2) = (1.2,1.1) and (1.5,-0.8), the model predicts that stocks will fall. It is important to note that the predictive results of logistic regression do not provide a confidence interval (or prediction interval) as a linear regression, so it is useless to add the interval parameter.

Logistic regression model predicts stock ups and downs

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.