Logistic regression model predicts stock ups and downs

Last Update:2016-06-18 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Http://www.cnblogs.com/lafengdatascientist/p/5567038.html

Logistic regression is a classifier, the basic idea can be summarized as: for a two classification (0~1) problem, if P (y=1/x) >0.5 is classified as 1 classes, if P (y=1/x) <0.5, then classified as 0 classes.

I. Overview of the model 1, sigmoid function

The sigmoid function is described here for the basic idea of image-based text:

The function image is as follows:

The red line, the x=0, divides the sigmoid curve into two parts: when x < 0,y < 0.5;
When x > 0 o'clock, y > 0.5.

In the actual classification problem, the response variable is often classified according to multiple predictor variables. So the sigmoid function is to be combined with a multivariate linear function to be applied to logistic regression.

2. Logistic model

where θx=θ1x1+θ2x2+......+θnxn is a multivariate linear model.

The upper type can be converted to:

The left side of the formula is called occurrence ratio (odd). When P (x) is close to 0 o'clock, the occurrence is nearer to 0, and when P (x) is close to 1 o'clock, the ratio is nearly ∞.

The logarithm on both sides is:

The left side of the formula is called the logarithmic occurrence ratio (log-odd) or the logarithmic (logit), which becomes a linear model.

However, compared with the least squares fitting, the maximum likelihood method has better statistical properties. Logistic regression is usually fitted with maximum likelihood method, the fitting process is skipped here, and the following is only how to apply the logistic regression algorithm with R.

II. Logistic regression application 1, Data set

ISLRthe data set in the application package Smarket . Let's take a look at the structure of the dataset:

12345678910111213141516171819202122 > summary(Smarket) Year Lag1 Lag2 Min. :2001 Min. :-4.922000 Min. :-4.922000 1st Qu.:2002 1st Qu.:-0.639500 1st Qu.:-0.639500 Median :2003 Median : 0.039000 Median : 0.039000 Mean :2003 Mean : 0.003834 Mean : 0.003919 3rd Qu.:2004 3rd Qu.: 0.596750 3rd Qu.: 0.596750 Max. :2005 Max. : 5.733000 Max. : 5.733000 Lag3 Lag4 Lag5 Min. :-4.922000 Min. :-4.922000 Min. :-4.92200 1st Qu.:-0.640000 1st Qu.:-0.640000 1st Qu.:-0.64000 Median : 0.038500 Median : 0.038500 Median : 0.03850 Mean : 0.001716 Mean : 0.001636 Mean : 0.00561 3rd Qu.: 0.596750 3rd Qu.: 0.596750 3rd Qu.: 0.59700 Max. : 5.733000 Max. : 5.733000 Max. : 5.73300 Volume Today Direction Min. :0.3561 Min. :-4.922000 Down:602 1st Qu.:1.2574 1st Qu.:-0.639500 Up :648 Median :1.4229 Median : 0.038500 Mean :1.4783 Mean : 0.003138 3rd Qu.:1.6417 3rd Qu.: 0.596750 Max. :3.1525 Max. : 5.733000

Smarketis the 1250-day return on equity investment from 2001 to 2005, which Year Lag1 refers to Lag5 the return on investment for the last 5 trading days, which is the Today return on investment today, the Direction market trend, or Up (UP) or Down (fall).

First look at the correlation coefficients of each variable:

123	`library(corrplot)corrplot(corr =` `cor(Smarket[,-9]),order =` `"AOE",type =` `"upper",tl.pos =` `"d")corrplot(corr =` `cor(Smarket[,-9]),add=TRUE,type =` `"lower",method =` `"number",order =` `"AOE",diag =` `FALSE,tl.pos =` `"n",cl.pos =` `"n")`

It can be seen that the Volume Year relative coefficients are relatively large, which indicates that the volume of transactions increases with the year, and the correlation between the other variables is basically not much. It is common sense that the historical data of the stock is very little correlated with the future data, and it is difficult to use supervised learning methods to accurately predict the future stock market situation. But as an application tutorial for algorithms, let's try it.

2. Train and test the logistic regression model

The logistic regression model is one of the generalized linear regression models, so the function is glm() , but the parameter must be added family=binomial .

123456789101112131415161718192021 > attach(Smarket)> # 2005年前的数据用作训练集，2005年的数据用作测试集> train = Year<2005> # 对训练集构建逻辑斯谛模型> glm.fit=glm(Direction~Lag1+Lag2+Lag3+Lag4+Lag5+Volume,+ data=Smarket,family=binomial, subset=train)> # 对训练好的模型在测试集中进行预测，type="response"表示只返回概率值> glm.probs=predict(glm.fit,newdata=Smarket[!train,],type="response")> # 根据概率值进行涨跌分类> glm.pred=ifelse(glm.probs >0.5,"Up","Down")> # 2005年实际的涨跌状况> Direction.2005=Smarket$Direction[!train]> # 预测值和实际值作对比> table(glm.pred,Direction.2005) Direction.2005glm.pred Down Up Down 77 97 Up 34 44> # 求预测的准确率> mean(glm.pred==Direction.2005)[1] 0.4801587

The forecast accuracy rate is only 0.48, it is better to guess blindly. Here's an attempt to adjust the model.

12345678910111213141516171819202122 #检查一下模型概况> summary(glm.fit)Call:glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, family = binomial, data = Smarket, subset = train)Deviance Residuals: Min 1Q Median 3Q Max -1.302 -1.190 1.079 1.160 1.350 Coefficients: Estimate Std. Error z value Pr(>|z|)(Intercept) 0.191213 0.333690 0.573 0.567Lag1 -0.054178 0.051785 -1.046 0.295Lag2 -0.045805 0.051797 -0.884 0.377Lag3 0.007200 0.051644 0.139 0.889Lag4 0.006441 0.051706 0.125 0.901Lag5 -0.004223 0.051138 -0.083 0.934Volume -0.116257 0.239618 -0.485 0.628(Dispersion parameter forbinomial family taken to be 1) Null deviance: 1383.3 on 997 degrees of freedomResidual deviance: 1381.1 on 991 degrees of freedomAIC: 1395.1Number of Fisher Scoring iterations: 3

You can see that all variables have a large p-value and are not significant. The smaller the AIC mentioned in the previous linear regression section, the better the model, the more the AIC is still larger.

Adding a Predictor variable independent of the response variable causes the test error rate to increase (because such a predictor increases the model variance but does not reduce the model bias accordingly), so removing such predictors may optimize the model.

The P-values of LAG1 and LAG2 in the above model are significantly smaller than the other variables, so only these two variables are trained again.

12345678910111213 > glm.fit=glm(Direction~Lag1+Lag2,+ data=Smarket,family=binomial, subset=train)> glm.probs=predict(glm.fit,newdata=Smarket[!train,],type="response")> glm.pred=ifelse(glm.probs >0.5,"Up","Down")> table(glm.pred,Direction.2005) Direction.2005glm.pred Down Up Down 35 35 Up 76 106> mean(glm.pred==Direction.2005)[1] 0.5595238> 106/(76+106)[1] 0.5824176

The overall accuracy of the model reached 56%, and it finally shows that the predictive accuracy of the statistical model is better than the guessing (albeit only a little). According to the confusion matrix, when the logistic regression model predicts the fall, there is a 50% accuracy rate, and when the logistic regression model predicts the rise, there is a 58% accuracy. (Matrix row Name table forecast value, column name table actual value)

Use this model to predict 2 new sets of data:

123	`> predict(glm.fit,newdata` `=data.frame(Lag1=c(1.2,1.5),Lag2=c(1.1,-0.8)),type="response")` `120.47914620.4960939`

It can be seen that for two points (lag1,lag2) = (1.2,1.1) and (1.5,-0.8), the model predicts that stocks will fall. It is important to note that the predictive results of logistic regression do not provide a confidence interval (or prediction interval) as a linear regression, so it is useless to add the interval parameter.

Logistic regression model predicts stock ups and downs

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Logistic regression model predicts stock ups and downs

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Logistic regression model predicts stock ups and downs

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support