A time series is a series of ordered data. This is usually the sampled data at equal time intervals. If the interval is not equal, the timescale for each data point is typically labeled.
The following is an example of the data airline passenger, which is commonly used in time series. This is the number of passengers per month for 11 years, in thousands of units.
If you want to try another data set, you can visit here: https://datamarket.com/data/list/?q=provider:tsdl
It can be seen clearly that the data of airline passenger is very regular.
Time series data mining mainly includes decompose (analyzing the various components of the data, such as trends, periodicity), prediction (predicting future values), classification (feature extraction and classification of ordered data sequences), Clustering (similar series clustering) and so on.
This article focuses on prediction (forecast, prediction). Known historical data, how to accurately predict future data.
Start with a simple approach. Given a time series, what is the simplest idea to predict the next value?
(1) mean (average): The future value is the average of historical values.
(2) Exponential smoothing (exponential decay): the weights of each historical point can be different when the average value of the go is worth it. The most natural is that the closer the point gives the bigger weight.
Or, a more convenient notation, with a sharp angle on the variable head to indicate the estimated value
(3) Snaive: Assuming the period of the known data, then the time corresponding to the previous period as the next period corresponding to the predicted value of the time
(4) Drift: Drift, which is the value of the last point plus the average trend of the data
Introducing the simplest algorithm, let's start with the top two powerful algorithms in the two time series: Holt-winters and ARIMA. The algorithm above is a special case of the two algorithms.
(5) Holt-winters: third-order exponential smoothing
Holt-winters's idea is to decompose the data into three components: the average level, the trend (trend), and the periodicity (seasonality). A simple function STL can decompose the original data in R:
The first-order holt-winters assumes that the data is stationary (static distribution), which is the normal exponential smoothing. The second-order algorithm assumes that the data has a trend, which can be additive (additive, linear trend), or multiplicative (multiplicative, nonlinear trend), just a small difference in the formula. The third-order algorithm is based on the hypothesis of second orders, which has a periodic component. The same cyclical component can be additive and multiplicative. For example, if the number of people each February is 1000 more than in previous years, this is additive, and if every February is 120% more than usual, then multiplicative.
There is a holt-winters implementation in R, it can now be used to try the effect. I use the data from the previous decade to predict the last year's data. Performance measures are based on RMSE. Of course, you can also use other metrics:
The forecast results are as follows:
The result is still very good.
(6) arima:autoregressive Integrated moving Average
Arima is a combination of two algorithms: AR and MA. The formula is as follows:
Is the white noise, the mean value is 0, C is the constant. The first half of the Arima is autoregressive: The latter part is the moving average:. AR is actually an infinite impulse response filter (infinite impulse resopnse), MA is a finite impulse response (finite impulse resopnse), and the input is white noise.
The I-Finger integrated (differential) inside the Arima. ARIMA (P,D,Q) represents P-order ar,d differential, Q-Order MA. Why the difference? The premise of Arima is that the data is stationary, that is, the statistical characteristics (mean,variance,correlation, etc.) do not vary with the time window. The mathematical representation is the same as the Union distribution:
Of course, many times do not meet this requirement, such as the airline passenger data here. There are many ways to transform the original data to make it stationary:
(1) differential, i.e. integrated. For example, first order difference is the value of subtracting the previous item from each item in the original sequence. The second order difference is the first difference based on the difference. This is the most recommended practice.
(2) The original data is roughly fitted with some function, and the remaining quantity is processed by Arima. For example, the trend of airline passenger is first fitted with a straight line, so the original data becomes the offset of each data point from the line. Then use the Arima to fit these offsets.
(3) Log or open the root of the original data. This is very effective for variance that are not constants.
How to look at the data is not stationary it? Two very common quantities are used here: ACF (auto correlation function) and PACF (patial Auto correlation function). For non-stationary data, the ACF graph does not tend to be 0, or the trend 0 is slow. Here are three ACF graphs, corresponding to the original data, first-order differential raw data, to remove periodic first-order differential data:
After making sure the stationary, the values of P and Q are determined below. The two values still depends on the ACF and PACF:
After you have determined the P and Q, you can invoke the Arime function in R. It is worth mentioning that there are two very powerful functions in R: ETS and Auto.arima. Users do not need to do anything, these two functions will automatically pick the most appropriate algorithm to analyze the data.
The effects of each algorithm in R are as follows:
The code is as follows:
Passenger = Read.csv ('Passenger.csv', header=f,sep=' ') P<-unlist (passenger) pt<-ts (p,frequency= A, start=2001) Plot (PT) Train<-window (pt,start=2001, end= .+ One/ A)Test<-window (pt,start= -) Library (forecast) pred_meanf<-meanf (train,h= A) Rmse (Test, Pred_meanf$mean) #226.2657pred_naive<-naive (train,h= A) Rmse (Pred_naive$mean,Test)#102.9765pred_snaive<-snaive (train,h= A) Rmse (Pred_snaive$mean,Test)# -.70832PRED_RWF<-RWF (train,h= A, Drift=t) Rmse (Pred_rwf$mean,Test)# the.66636pred_ses <-ses (train,h= A, initial=' Simple', alpha=0.2) Rmse (Pred_ses$mean,Test) # the.77035Pred_holt<-holt (train,h= A, damped=f,initial=" Simple", Beta=0. $) Rmse (Pred_holt$mean,Test)# the.86677Without beta=0. $It would be -.41239PRED_HW<-HW (train,h= A, seasonal='multiplicative') Rmse (Pred_hw$mean,Test)# -.36156Fit<-ets (train) accuracy (Predict (fit, A),Test) # -.390252PRED_STLF<-STLF (train) Rmse (Pred_stlf$mean,Test)# A.07215Plot (STL (Train,s.window="periodic") ) #Seasonal decomposition of Time Series by Loessfit<-auto.arima (train) accuracy (forecast (fit,h= A),Test) # at.538735ma = Arima (train, order = C (0,1,3), Seasonal=list (Order=c (0,1,3), period= A)) p<-predict (MA, A) accuracy (p$pred,Test) # -.55567BT= Box.Test(Ma$residuals, lag= -, type ="Ljung-box", fitdf=2)
Analyzing time series data with R