Http://www.cnblogs.com/lafengdatascientist/p/5567038.html
Logistic regression model predicts stock ups and downs
Logistic regression is a classifier, the basic idea can be summarized as: for a two classification (0~1) problem, if P (y=1/x) >0.5 is classified as 1 classes, if P (y=1/x) <0.5, then classified as 0 classes.
I. Overview of the model 1, sigmoid function
The sigmoid function is described here for the basic idea of image-based text:
The function image is as follows:
The red line, the x=0, divides the sigmoid curve into two parts: when x < 0,y < 0.5;
When x > 0 o'clock, y > 0.5.
In the actual classification problem, the response variable is often classified according to multiple predictor variables. So the sigmoid function is to be combined with a multivariate linear function to be applied to logistic regression.
2. Logistic model
where θx=θ1x1+θ2x2+......+θnxn is a multivariate linear model.
The upper type can be converted to:
The left side of the formula is called occurrence ratio (odd). When P (x) is close to 0 o'clock, the occurrence is nearer to 0, and when P (x) is close to 1 o'clock, the ratio is nearly ∞.
The logarithm on both sides is:
The left side of the formula is called the logarithmic occurrence ratio (log-odd) or the logarithmic (logit), which becomes a linear model.
However, compared with the least squares fitting, the maximum likelihood method has better statistical properties. Logistic regression is usually fitted with maximum likelihood method, the fitting process is skipped here, and the following is only how to apply the logistic regression algorithm with R.
II. Logistic regression application 1, Data set
ISLR
the data set in the application package Smarket
. Let's take a look at the structure of the dataset:
12345678910111213141516171819202122 |
>
summary
(Smarket)
Year Lag1 Lag2
Min. :2001 Min. :-4.922000 Min. :-4.922000
1st Qu.:2002 1st Qu.:-0.639500 1st Qu.:-0.639500
Median :2003 Median : 0.039000 Median : 0.039000
Mean :2003 Mean : 0.003834 Mean : 0.003919
3rd Qu.:2004 3rd Qu.: 0.596750 3rd Qu.: 0.596750
Max. :2005 Max. : 5.733000 Max. : 5.733000
Lag3 Lag4 Lag5
Min. :-4.922000 Min. :-4.922000 Min. :-4.92200
1st Qu.:-0.640000 1st Qu.:-0.640000 1st Qu.:-0.64000
Median : 0.038500 Median : 0.038500 Median : 0.03850
Mean : 0.001716 Mean : 0.001636 Mean : 0.00561
3rd Qu.: 0.596750 3rd Qu.: 0.596750 3rd Qu.: 0.59700
Max. : 5.733000 Max. : 5.733000 Max. : 5.73300
Volume Today Direction
Min. :0.3561 Min. :-4.922000 Down:602
1st Qu.:1.2574 1st Qu.:-0.639500 Up :648
Median :1.4229 Median : 0.038500
Mean :1.4783 Mean : 0.003138
3rd Qu.:1.6417 3rd Qu.: 0.596750
Max. :3.1525 Max. : 5.733000
|
Smarket
is the 1250-day return on equity investment from 2001 to 2005, which Year
Lag1
refers to Lag5
the return on investment for the last 5 trading days, which is the Today
return on investment today, the Direction
market trend, or Up
(UP) or Down
(fall).
First look at the correlation coefficients of each variable:
123 |
library
(corrplot)
corrplot
(corr =
cor
(Smarket[,-9]),order =
"AOE"
,type =
"upper"
,tl.pos =
"d"
)
corrplot
(corr =
cor
(Smarket[,-9]),add=
TRUE
,type =
"lower"
,method =
"number"
,order =
"AOE"
,diag =
FALSE
,tl.pos =
"n"
,cl.pos =
"n"
)
|
It can be seen that the Volume
Year
relative coefficients are relatively large, which indicates that the volume of transactions increases with the year, and the correlation between the other variables is basically not much. It is common sense that the historical data of the stock is very little correlated with the future data, and it is difficult to use supervised learning methods to accurately predict the future stock market situation. But as an application tutorial for algorithms, let's try it.
2. Train and test the logistic regression model
The logistic regression model is one of the generalized linear regression models, so the function is glm()
, but the parameter must be added family=binomial
.
123456789101112131415161718192021 |
>
attach
(Smarket)
>
# 2005年前的数据用作训练集,2005年的数据用作测试集
> train = Year<2005
>
# 对训练集构建逻辑斯谛模型
> glm.fit=
glm
(Direction~Lag1+Lag2+Lag3+Lag4+Lag5+Volume,
+ data=Smarket,family=binomial, subset=train)
>
# 对训练好的模型在测试集中进行预测,type="response"表示只返回概率值
> glm.probs=
predict
(glm.fit,newdata=Smarket[!train,],type=
"response"
)
>
# 根据概率值进行涨跌分类
> glm.pred=
ifelse
(glm.probs >0.5,
"Up"
,
"Down"
)
>
# 2005年实际的涨跌状况
> Direction.2005=Smarket$Direction[!train]
>
# 预测值和实际值作对比
>
table
(glm.pred,Direction.2005)
Direction.2005
glm.pred Down Up
Down 77 97
Up 34 44
>
# 求预测的准确率
>
mean
(glm.pred==Direction.2005)
[1] 0.4801587
|
The forecast accuracy rate is only 0.48, it is better to guess blindly. Here's an attempt to adjust the model.
12345678910111213141516171819202122 |
#检查一下模型概况
>
summary
(glm.fit)
Call:
glm
(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
Volume, family = binomial, data = Smarket, subset = train)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.302 -1.190 1.079 1.160 1.350
Coefficients:
Estimate Std. Error z value
Pr
(>|z|)
(Intercept) 0.191213 0.333690 0.573 0.567
Lag1 -0.054178 0.051785 -1.046 0.295
Lag2 -0.045805 0.051797 -0.884 0.377
Lag3 0.007200 0.051644 0.139 0.889
Lag4 0.006441 0.051706 0.125 0.901
Lag5 -0.004223 0.051138 -0.083 0.934
Volume -0.116257 0.239618 -0.485 0.628
(Dispersion parameter
for
binomial family taken to be 1)
Null deviance: 1383.3 on 997 degrees of freedom
Residual deviance: 1381.1 on 991 degrees of freedom
AIC: 1395.1
Number of Fisher Scoring iterations: 3
|
You can see that all variables have a large p-value and are not significant. The smaller the AIC mentioned in the previous linear regression section, the better the model, the more the AIC is still larger.
Adding a Predictor variable independent of the response variable causes the test error rate to increase (because such a predictor increases the model variance but does not reduce the model bias accordingly), so removing such predictors may optimize the model.
The P-values of LAG1 and LAG2 in the above model are significantly smaller than the other variables, so only these two variables are trained again.
12345678910111213 |
> glm.fit=
glm
(Direction~Lag1+Lag2,
+ data=Smarket,family=binomial, subset=train)
> glm.probs=
predict
(glm.fit,newdata=Smarket[!train,],type=
"response"
)
> glm.pred=
ifelse
(glm.probs >0.5,
"Up"
,
"Down"
)
>
table
(glm.pred,Direction.2005)
Direction.2005
glm.pred Down Up
Down 35 35
Up 76 106
>
mean
(glm.pred==Direction.2005)
[1] 0.5595238
> 106/(76+106)
[1] 0.5824176
|
The overall accuracy of the model reached 56%, and it finally shows that the predictive accuracy of the statistical model is better than the guessing (albeit only a little). According to the confusion matrix, when the logistic regression model predicts the fall, there is a 50% accuracy rate, and when the logistic regression model predicts the rise, there is a 58% accuracy. (Matrix row Name table forecast value, column name table actual value)
Use this model to predict 2 new sets of data:
123 |
> predict(glm.fit,newdata
=
data.frame(Lag1
=
c(
1.2
,
1.5
),Lag2
=
c(
1.1
,
-
0.8
)),
type
=
"response"
)
1
2
0.4791462
0.4960939
|
It can be seen that for two points (lag1,lag2) = (1.2,1.1) and (1.5,-0.8), the model predicts that stocks will fall. It is important to note that the predictive results of logistic regression do not provide a confidence interval (or prediction interval) as a linear regression, so it is useless to add the interval parameter.
Logistic regression model predicts stock ups and downs