US presidential election of the ASA
In this election year, the American Statistical Association (ASA) put together student contests and presidential elections to predict the exact percentage of candidates who were the winner of the 2016 presidential election as a match point. For details, see:
http://thisisstatistics.org/electionprediction2016/
Get Data
There are a lot of public polling data on the Internet. You can get the relevant data for the presidential election on the following website:
http://projects.fivethirtyeight.com/2016-election-forecast/national-polls/
Other good data sources are:
http://www.realclearpolitics.com/epolls/latest_polls/
Http://elections.huffingtonpost.com/pollster/2016-general-election-trump-vs-clinton
http://www.gallup.com/products/170987/gallup-analytics.aspx)
It is important to note that the data is updated daily, so you may be looking at this article with different results when the data changes.
Because the original data is a JSON file, R pulls it off as a list in lists.
The original GitHub address: HTTPS://GITHUB.COM/HARDIN47/PREDICTION2016/BLOB/MASTER/PREDBLOG.RMD
# #载入需要的包require (XML) require (DPLYR) require (Tidyr) require (READR) require (mosaic) require (Rcurl) require (GGPLOT2) Require (lubridate) require (Rjsonio) # #数据拉取url = "http://projects.fivethirtyeight.com/2016-election-forecast/ national-polls/"Doc <-htmlparse (url, useinternalnodes = TRUE) #爬取网页内容sc = xpathsapply (Doc,"//script[ Contains (., ' Race.model ')] ", function (x) c (Xmlvalue (x), Xmlattrs (x) [[" href]])) Jsobj = Gsub (". *race.state Data = (. *); race.pathprefix.* "," \\1 ", SC) data = Fromjson (jsobj) allpolls <-data$polls#unlisting the whole thingindx &L t;-sapply (allpolls, length) pollsdf <-as.data.frame (Do.call (Rbind, lapply (allpolls, ' length<-', Max (indx))) # # Data cleaning #unlisting The WEIGHTSPOLLSWT <-as.data.frame (t (As.data.frame (Do.call, Lapply (Pollsdf$weight, Data.frame, Stringsasfactors=false))))) names (POLLSWT) <-C ("Wtpolls", "Wtplus", "Wtnow") row.names (POLLSWT) <-nullpollsdf <-cbind (POLLSDF, POLLSWT ) #unlisting the VOTINGINDXV <-sapply (pollsdf$votinganswers, length) pollsvot <-as.data.frame (Do.call, Lapply (pollsdf$votinganswers, ' length<-', Max (INDXV)))) Pollsvot1 <-R Bind (As.data.frame (Do.call (Rbind, lapply (Pollsvot$v1, Data.frame, St Ringsasfactors=false))) Pollsvot2 <-rbind (As.data.frame (Do.call, Rbind (lapply, Pollsvot$v2, Stringsasfactors=false))) pollsvot1 <-cbind (polltype = Rownames (pollsvot1), Pol LSVOT1, Polltypea = Gsub (' [0-9]+ ', ' ', Rownames (POLLSVOT1)), Polltype1 = Extract_nume Ric (Rownames (POLLSVOT1)) Pollsvot1$polltype1 <-IfElse (is.na (Pollsvot1$polltype1), 1, Pollsvot1$polltype1 + 1) Pollsvot2 <-cbind (polltype = Rownames (POLLSVOT2), pollsvot2, Polltypea = Gsub (' [0-9]+ ', ' ', Rownames (Pollsvot2)), Polltype1 = Extract_numeric (Rownames (Pollsvot2))) Pollsvot2$polltype1 <-IfElse (is.na (Pollsvot2$polltype1), 1, pollsvot2$ Polltype1 + 1) pollsdf <-pollsdf%>% Mutate (population = unlist (population), samplesize = As.numeric (Unli St (SampleSize)), pollster = Unlist (pollster), StartDate = Ymd (Unlist (StartDate)), endDate = Ymd ( Unlist (EndDate)), pollsterrating = Unlist (pollsterrating))%>% Select (population, samplesize, pollster, start Date, EndDate, pollsterrating, Wtpolls, Wtplus, Wtnow) allpolldata <-cbind (Rbind (Pollsdf[rep (Seq_len) (Nrow (polls DF)), each=3),], Pollsdf[rep (Seq_len (Nrow (POLLSDF)), each=3),]), Rbind (poll SVOT1, Pollsvot2)) Allpolldata <-allpolldata%>% Arrange (polltype1, choice)
View all selection data: Allolldata
Fast Visualization
It is necessary to simply look at the data before figuring out the proportion of the projected votes for the 2016 U.S. presidential campaign. The data set has been collated and visualized using the Ggplot2 package (select data from August 2016, the x axis is the enddate,y axis is adj_pct, the color is based on choice, which is the two colors Clinton and Hillary, and according to the Wtnow set the point size):
# #快速可视化ggplot (Subset (Allpolldata, (Polltypea = = "Now") & (EndDate > Ymd ("2016-08-01")), Aes (y=adj_pct, X =enddate, Color=choice)) + geom_line () + Geom_point (Aes (Size=wtnow)) + Labs (title = "Vote percentage by date and P Oll weight\n ", y =" Percent Vote if election Today ", x =" Poll Date ", color =" candidate ", size=" 538 poll\nweight " )
Quick analysis
Given that each candidate's vote ratio is based on the percentage of votes currently voted, the vote weight must be set based on the idea of 538 (sample size samplesize) and the day Sine poll of the poll closing days. The weights are calculated as follows:
Using the calculated weights, I will calculate the weighted average of the percentage of votes being predicted and its standard deviation (SE). The standard deviation (SE) calculation formula is derived from Cochran (1977).
# #快速分析 # references # code found at http://stats.stackexchange.com/questions/25895/ computing-standard-error-in-weighted-mean-estimation# cited from http://www.cs.tufts.edu/~nr/cs257/archive/ donald-gatz/weighted-standard-error.pdf# Donald F. Gatz and Luther Smith, "The standard error of A weighted MEAN CONCENTRA Tion-i. BOOTSTRAPPING VS Other METHODS "weighted.var.se <-function (x, W, Na.rm=false) # computes the variance of a weighted m Ean following Cochran 1977 definition{if (na.rm) {w <-w[i <-!is.na (x)]; x <-x[i]} n = Length (w) Xwbar = Weighted.mean (x,w,na.rm=na.rm) Wbar = Mean (w) out = n/((n-1) *sum (w) ^2) * (SUM ((W*x-wbar*xwbar) ^2) -2*xwbar*sum ((W-wbar) * (W*x-wbar*xwbar)) +xwbar^2*sum ((W-wbar) ^2)) return (out)}# calculates the cumulative average and weighted average cumulative mean/weighted MEANALLPOLLDATA2 <-allpolldata%>% filter (Wtnow > 0)%>% filter (Polltypea = "Now")%>% mutate (dayssince = As.numeric (To Day ()-endDate))%>% mutate (WT = Wtnow * SQRT (samplesize)/dayssince)%>% MutatE (votewt = wt*pct)%>% group_by (choice)%>% arrange (choice,-dayssince)%>% mutate (cum.mean.wt = Cumsum (votewt )/cumsum (WT))%>% mutate (Cum.mean = Cummean (PCT)) View (ALLPOLLDATA2)
Visualize cumulative average and weighted average values
# #绘制累计平均/Weighted average cumulative mean/weighted mean# cumulative average Ggplot (subset (ALLPOLLDATA2 (EndDate > Ymd ("2016-01-01")), AES (Y=cum.mean, X=enddate, Color=choice)) + geom_line () + Geom_point (Aes (SIZE=WT)) + Labs (title = "Cumulative Mean Vote percentage\n ", y =" Cumulative Percent Vote if election Today ", x =" Poll Date ", color =" candidate ", si Ze= "Calculated Weight") # Weighted average Ggplot (subset (ALLPOLLDATA2) (EndDate > Ymd ("2016-01-01")), Aes (Y=CUM.MEAN.WT, X =enddate, Color=choice)) + geom_line () + Geom_point (Aes (SIZE=WT)) + Labs (title = "Cumulative Weighted Mean Vote P Ercentage\n ", y =" cumulative Weighted Percent Vote if election Today ", x =" Poll Date ", color =" candidate ", si Ze= "Calculated Weight")
Vote percentage Forecast
In addition, the standard deviation of the weighted average and average (Cochran (1977)) can be calculated for each candidate. Using this formula, we can predict the final percentage of the main candidate!
Pollsummary <-allpolldata2%>% Select (Choice, pct, WT, VOTEWT, SampleSize, dayssince)%>% group_by ( Choice)%>% summarise (mean.vote = Weighted.mean (pct, WT, na.rm=true), std.vote = sqrt (weighted.var.se (PCT, WT, NA.RM=TRUE)) pollsummary## # A Tibble:2 x 3## choice mean.vote std.vote## <chr> <dbl> <dbl>## 1 Clinton 43.48713 0.5073771## 2 Trump 38.95760 1.0717574
Apparently, the main candidates were Clinton and Hillary Clinton, who had an average percentage of votes higher than Hillary, and whose standard deviation was Uhillary, which meant that the vote was stable, and that the eventual winner was probably Clinton, but the change in Hillary's changes did not rule out the possibility of Hillary's victory. Hillary's vote was seen to have reached the highest percentage of 51%.
Original link: https://www.r-statistics.com/2016/08/presidential-election-predictions-2016/
This article link: http://www.cnblogs.com/homewch/p/5811945.html
Projections for the 2016 presidential election