Time Series Complete Tutorial (R)

Time Series Complete Tutorial (R) _ Statistics

Last Update:2018-08-23 Source: Internet

Author: User

Tags diff truncated

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Brief introduction

In business applications, time is the most important factor and can improve the success rate. Yet the vast majority of companies struggle to keep up with the pace of time. But with the development of technology, there are many effective methods, which can let us predict the future. Don't worry, this article does not discuss the time machine, the discussion is very practical things.
This article will discuss the method of forecasting. One prediction is related to time, and this method of dealing with time-related data is called a time series model. This model can find some hidden information to assist the decision in the time related data.
The time series model is a very useful model when we are dealing with time-series sequence data. Most companies are based on time series data to analyze the second year of sales, site traffic, competitive position and more things. Yet many people do not understand the field of time series analysis.
So if you don't understand the time series model. This article will introduce you to the processing steps of the time series model and its related techniques.
This article contains content such as the following:
Directory
* 1, Time series model introduction
* 2, use R language to explore time series data
* 3, Introduction of ARMA time series model
* 4. The framework and application of Arima time series model

Let's get started. 1. Introduction of time Series model

Let ' s begin. This section includes stationary sequence, random walk, rho coefficient, Dickey fuller test smoothness. If you don't know this knowledge, don't worry-the following concepts will be introduced in detail in this section, I bet you like my introduction. Stationary sequence

There are three criteria for judging whether a sequence is a stationary sequence:
1. The mean value is a constant independent of time t. The following figure (left) satisfies the conditions of a stationary sequence, and the following figure (right) obviously has time dependencies.

Variance, a constant that is independent of time t. This feature is called variance homogeneity. The following figure shows what is the variance alignment and what is not the variance alignment. (Note the different distributions on the right hand side.) ）

Covariance, which is related only to the time interval k, and to the constant that is not independent of the temporal t. As shown in the following figure (right), it can be noted that as time increases, the curve becomes more and more near. Therefore, the covariance of the red sequence is not constant.
Why should we care about stationary time series?

You cannot build a time series model unless your time series is stable. In many cases, time-stationary conditions are often unsatisfied, so the first thing to do is to stabilize the time series and then try to predict the time series using a random model. There are many ways to stabilize data, such as eliminating long-term trends and differential differentiation. Random Walk

This is the most basic concept of time series. You may well understand the concept. However, many people in the industry still regard random walk as a stationary sequence. In this section, I will use some mathematical tools to help understand this concept. Let's look at an example first.
Example: Think of a girl randomly imagining a girl moving at random on a giant chessboard. Here, the next location depends only on the previous location.

(Source: http://scifun.chem.wisc.edu/WOP/RandomWalk.html)

Now imagine that you're in a closed room and you can't see this girl. But you want to predict the position of the girl at different times. How to predict a point. Of course, as time goes on, you're predicting more and more uncertainty. At t=0 time, you must know where the girl is. The next moment the girl moves to a piece of the box in Appendix 8, this time, the probability you have predicted has dropped to 1/8. Continue to predict, and now we will formulate this sequence:

$X (t) = X (t-1) + Er (t) $
The $er_t$ here represent this random disturbance of this point in time. This is the randomness that girls bring at every point in time.

Now we recursive all x time points, and finally we'll get the following equation:

$X (t) = X (0) + Sum (er (1), ER (2), er (3) ... Er (t)) $

Now let's try to verify the smoothness of the random walk hypothesis:
1. Whether the mean value is constant.

E[x (t)] = e[x (0)] + Sum (e[er (1)],e[er (2)],e[er (3)] ... E[er (t)])

We know that the expected value of random disturbances in random processes is 0. So far: e[x (t)] = e[x (0)] = constant
2. Whether the variance is constant.

Var[x (t)] = var[x (0)] + Sum (var[er (1)],var[er (2)],var[er (3)] ... Var[er (t)])
var[x (t)] = T * Var (ERROR) = time Related

Therefore, we infer that random walk is not a smooth process because it has a time-varying variance. Also, if we examine the covariance, we see that covariance depends on time. Let's see a more interesting thing.

We already know that a random walk is a non-stationary process. Let's introduce a new coefficient into the equation to see if we can develop a formula for checking the smoothness.
Rho coefficient

X (t) = Rho * X (t-1) + Er (t)

Now, we're going to change the rho to see if we can make the sequence smooth. Here we just look, does not carry on the test of smoothness.
Let's start with a rho=0 complete stationary sequence. Here is a diagram of the time series:

To increase the value of rho to 0.5, we will get the following figure:

You may notice that our cycle is getting longer, but basically there doesn't seem to be a serious violation of the stationary hypothesis. Now let's take the more extreme case ρ= 0.9

We still see that after a certain interval, return to zero from the extreme value. This series also does not violate the non-stationary nature is obvious. Now, let's use the ρ= 1 random walk to see

This is clearly a violation of fixed conditions. What makes rho= 1 so special? , this condition does not satisfy the test of smoothness. We're looking for the math.
The expectation for formula X (t) = Rho * X (t-1) + Er (t) is:

E[x (t)] = Rho *e[X (t-1)]

This formula is very meaningful. The next X (or time point T) is pulled to the value of the previous X rho*.
For example, if X (t–1) = 1,e[x (t)] = 0.5 (rho= 0.5). Now, if you move from zero to any direction next you want to expect to be 0. The only thing that can make expectations bigger is the error rate. When Rho turned 1. The next step is not any possible drop. Test smoothness of Dickey Fuller

The last point of learning here is Dickey Fuller Test. In statistics, the Dickey-fuller test tests whether a autoregressive model has a unit root. Here according to the above Rho coefficient has one adjustment, transforms the formula to the Dickey-fuller test

X (t) = Rho * X (t-1) + er (t)
=>  x (t)-X (t-1) = (Rho-1) x (t-1) + ER (t)

We're going to test if rho–1=0 are significantly different. If the 0 hypothesis is not tenable, we will get a stationary time series.
The test of smoothness and the conversion of a sequence to a stationary sequence are the most important parts of the time series model. It is therefore necessary to keep in mind all the concepts mentioned in this section for easy access to the next section.
Let's look at the examples of time series. 2. Use R to explore time series

In this section we will learn how to use R to process time series. Here we are just exploring the time series and will not establish a time series model.
The data used in this section is the built-in data in R: Airpassengers. The data set is the number of passengers per month of international aviation in 1949-1960. In the data set

The following code will help us in the dataset and can see a few datasets.

> Data (airpassengers) #在入数据
 > class (Airpassengers)
 [1] "ts"
#查看AirPassengers数据类型, here is the time series data
 > Start (airpassengers)
 [1] 1949 1
#这个是Airpassengers数据开始的时间
> End (airpassengers)
 [1] 1960 12
#这个是Airpassengers数据结束的时间
> Frequency (airpassengers)
 [1]
#时间序列的频率是一年12个月
 > Summary (airpassengers)
 Min. 1st Qu. Median Mean 3rd Qu. Max.
 104.0 180.0 265.5 280.3 360.5 622.0

Detailed data in matrices

#The number of passengers are distributed across Spectrum
> Plot (airpassengers)
#绘制出时间序列
> Abline (REG=LM (Airpassengers~time (airpassengers))
# Fitting in line

> Cycle (airpassengers)
     Feb-Mar Apr may June Aug Sep Oct  Nov Dec
1949   1   2   3   4   5< C8/>6   7   8   9
1950   1   2   3 4 5 6 7   8   9
1951   1   2   3   4 5 6 7 8 9
1952   1   2   3   4  5 6 7 8 9 10
1953   1   2   3   4   5  6 7 8 9 Ten 11   1   2   3   4   5  6 7 8 9 1954 12
1955   1   2   3   4 5 6 7 8   9  ten
1956   1   2   3   4 5 6 7 8   9  ten
1957   1   2   3   4 5 6 7 8   9  ten
1958   1   2   3   4 5 6 7 8   9  ten
1959   1   2   3   4 5 6 7 8   9  ten
1960   1   2   3   4   5
6 7 8 9 # Print the yearly cycle
> Plot (Aggregate (Airpassengers,fun=mean))
#绘制
> BoxPlot (airpassengers~cycle ( airpassengers)
#绘制盒图

Important inference the annual trend indicates that the number of travellers is increasing every year. The mean and variance of July or August is much higher than in other months. The average of each month is different, but the variance is very small. Therefore, it can be seen that there is a strong periodicity. , one cycle is 12 months or less.

Viewing data is the most important part of building a time series model-if you don't, you will not know if this sequence is a stationary sequence. As with this example, we already know a lot of details about this model.
Next we will build some time series models and the characteristics of these models, also will be the most predictions. 3. Arma Time Series model

Arma is also called autoregressive moving average mixed model. Arma models are often used in time series. In Arma model, AR represents autoregressive and Ma represents moving average. If these terms sound complicated, don't worry-the following will take a few minutes to briefly describe these concepts.
We will now appreciate the characteristics of these models. Before you begin, remember that AR or MA is not applied to non-stationary sequences.
A non-stationary sequence may be obtained in practical applications, and the first thing you need to do is to turn the sequence into a stationary sequence (through differential differentiation/conversion) and then select a time series model that can be used.
First, this article introduces the two models separately (AR&MA). Next we look at the characteristics of these models. Self-regressive time series model

Let's understand the AR model from the following example:
Status quo a country's GDP (X (t)) depends on last year's GDP (x (t-1)). This assumes that a country's GDP this year is dependent on last year's gross domestic product and this year's newly opened factories and services. But GDP is largely dependent on last year's GDP.
The formula for GDP, then, is:

X (t) = Alpha *  x (t–1) + error (t)       (1)

This equation is the AR formula. The formula (1) indicates that the next point is completely dependent on the previous point. Alpha is a factor that would like to be able to find the Alpha minimization error rate. X (t-1) also relies on X (t).
For example, X (t) represents the sales of a city's juice in one day. In winter, very few suppliers enter the juice. Suddenly one day, the temperature rose, the demand for juice soared to 1000. But after a few days, the temperature has dropped. But it is well known that people drink juice on hot days, and 50% of them will still drink juice on cold days. In the next few days, the ratio dropped to 25% (50% of 50%) and then gradually dropped to a very small number a few days later. The following figure explains the inertia of the AR sequence:
Moving Average time series model

Next, another example of moving averages.
It's easy to understand that a company generates some kind of package. As a competitive market, the sales volume of the package increased from zero. So one day he did an experiment, designed and made a different package, which is not to be purchased at any time. Therefore, suppose that the aggregate demand in the market is 1000 such packages. On one day, the demand for this package is particularly high and soon the inventory is almost over. At the end of the day, there were 100 bags that were not sold. We turn this error into a time point error. Over the next few days there are still a few customers buying this package. Here's a simple formula to describe this scenario:

X (t) = Beta *  error (T-1) + error (t)

Try to get this picture out, that's the way it is:

Notice the difference between the MA and AR models. In the MA model, the noise/impact is rapid in the hour. will be affected by the long time in the AR model. The difference between AR model and MA model

The main difference between AR and MA model is the correlation of time series objects at different time points.
The MA model expresses the current predictive value using a linear combination of random disturbances or predictive errors from past periods. When a value is n>, the correlation between x (t) and X (t-n) is always 0. The AM model can only reflect the influence and effect of relevant factors on the predicted target by the historical observations of time series variables, and the model variables are relatively independent of the hypothetical conditions, which may eliminate the difficulties caused by the choice of variables and multiple collinearity in the common regression prediction method. That is, the correlation between x (t) and X (t-1) in the AM model is becoming smaller over time. This difference should be used well. Using ACF and PACF drawings

Once we get a stationary time series. We have to answer two of the most important questions;
Q1: This is the AR or MA process.
Q2: What is the order of the AR or MA processes we need to take advantage of?

To solve these two problems we have to use two coefficients:
The sample autocorrelation coefficient (PACF) in the case of the time series X (t) lag K-order sample self correlation coefficient (ACF) and the Lag K phase. The formula is omitted.
AR model ACF and PACF:
It is proved by calculation that:
-The ACF of AR is a trailing sequence, that is, regardless of the lag K, the calculated value of ACF is related to the autocorrelation function of 1 to P-order hysteresis.
-The PACF of AR is a truncated sequence, that is, the phenomenon of pacf=0 when the lag period is k>p.

The Blue Line display value of the above figure is significantly different from 0. It is clear that the above PACF figure shows a truncated end in the second lag, which means that this is an AR (2) process.
ACF and PACF of the MA model:
-The ACF of MA is a truncated sequence, that is, the phenomenon of pacf=0 when the lag period is k>p.
-The PACF of AR is a trailing sequence, that is, regardless of the lag K, the calculated value of ACF is related to the autocorrelation function of 1 to P-order hysteresis.

Obviously, the ACF above is truncated to the second lag, which is considered to be a MA (2) process.
At present, this article has introduced the type that uses the ACF&PACF graph to recognize the stationary sequence. Now, I'll introduce the overall framework of a time series model. In addition, the practical application of time series model is discussed. 4. The framework and application of Arima time series model

In this paper, we introduce the basic concept of time series model, use R to explore time series and Arma model. Now we organize these scattered things and do something very interesting. Framework

The frame below shows how to "do A Time series analysis" step-by-step.

The first three steps we discussed in the previous article. Still, here's a simple explanation: the first step: Time series visualization

Before you build any type of time series model, it is important to analyze its trends. The details we are interested in include the various trends, cycles, seasonality, or random behaviors in the sequence. The second part of this article has already been introduced. Step Two: Stable sequence

Once we know the patterns, trends, cycles. We can check if the sequence is stable. Dicky-fuller is a very popular way of testing. In the first part of the opinion, this test method is introduced. It's not over here yet. What to do if the sequence is found to be non-stationary.
There are three more commonly used techniques to make a time series smooth.
1 Elimination Trend: Here we simply delete the trend component in the time series. For example, the equation of my time series is:

X (t) = (mean + trend * t) + error

Here I simply delete the trend*t section of the above formula and establish the X (t) =mean+error model
2 difference: This technique is often used to eliminate nonstationary. Here we build the model for the difference of the sequence rather than the real sequence. For example:

X (t) –x (t-1) = ARMA (p,  Q)

This difference is also part of the Arima. Now we have 3 parameters:

P:ar
d:i
q:ma

3 Seasonal: Seasonality is directly included in the Arima model. We'll discuss this in the application section below. Step three: Find the best parameters

Parameter p,q can be found using ACF and PACF graphs. In addition to this method, if the correlation coefficient ACF and the partial correlation coefficient pacf Gradually decrease, this indicates that we need to do the time series smoothly and introduce the D parameter. Fourth step: CV Arima Model

Having found these parameters, we can now try the CV Arima model. The value found from the previous step may be just an approximate estimate, and we need to explore more combinations (p,d,q). The smallest BIC and AIC model parameters are what we want. We can also try some seasonal ingredients. Here, in the ACF/PACF diagram, we'll notice some seasonal things. Fifth step: Forecast

At this stage, we have the Arima model, and we can now make predictions. We can also visualize this trend and cross-validation. The application of time series model.

Here we use the preceding example. Use this time series to make predictions. We recommend that you observe this data before proceeding to the next step. Where do we start?

The figure below is a graph of the number of passengers in these years. Look at this diagram before you go down.

Here is my observation:
1. There is a tendency for passengers to increase over the year.
2. This appears to be seasonal and does not exceed 12 months per cycle.
3. The variance of the data is increasing year by year.
We need to solve two problems before we do the smoothness test. First, we need to eliminate variance. Here we do the logarithm of this sequence. Second, we need to solve the trend of the sequence. We do this by making difference in sequence sequences. Now, let's examine the smoothness of the final sequence.

Adf.test (diff log (airpassengers), alternative= "stationary", k=0)
#这里可能会显示没有这个函数, need to be installed. Install.packages (" Tseries ")
#加在这个包, library (tseries
    data:diff (log (airpassengers))
    Dickey-fuller = -9.6003, Lag order = 0,
    P-value = 0.01
    Alternative hypothesis:stationary

We can see that this sequence is sufficiently smooth to do any time series model.
The next step is to find the correct parameters for the Arima model. We have the opinion that ' d ' is 1, so we need to do 1 differential to let the sequence smooth. Here we draw the correlation diagram. Here is the ACF diagram for this sequence.

#ACF图
ACF (log (airpassengers))

What can be seen from the above table.

It is clear that ACF is falling very slowly, which means that the number of passengers is not stable. As we have discussed earlier, we are going to do regression on the difference after the sequence goes to logarithm, rather than the data entropy difference directly after the sequence goes logarithm. Let's take a look at the differential ACF and PACF curves.

> ACF (diff (airpassengers))
> Pacf (diff Log (airpassengers))

Obviously the ACF cutoff is the first lag, so we know that the value of P should be 0. The value of q should be 1 or 2. After several iterations, we found that the AIC and the BIC were the smallest when (P,D,Q) were fetched (0,1,1).

> Fit <-Arima (log (Airpassengers), C (0, 1, 1), seasonal = list (order = C (0, 1, 1), period =))
> Pred <- Predict (fit, n.ahead = 10*12)
ts.plot (airpassengers,2.718^pred$pred, log = "Y", lty = C (1,3))

Postscript

I participate in a time series race, the understanding of the time series stops with the Markov model. Moreover, Markov model is not suitable for solving pure time series problem, so we search the relevant knowledge of time series on the Internet. Very accidental discovery of this article, the logic of the article clear, Express clearly, the text is easy to understand, it is worth my study, so I tried to translate this article. Many things in the middle are not very well understood, if there is a wrong place, please correct me. Resources

A Complete Tutorial on time Series modeling in R
Time series
Eighth Chapter Time series analysis

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More