Time series Analysis algorithm "R detailed"

Source: Internet
Author: User
Tags diff truncated

Time series Analysis algorithm "R detailed"

https://www.analyticsvidhya.com/blog/2015/12/complete-tutorial-time-series-modeling/

Http://www.cnblogs.com/ECJTUACM-873284962/p/6917031.html

Introduction

In business applications, time is the most important factor to improve the success rate. However, most companies struggle to keep up with the pace of time. But with the development of technology, there are many effective methods that can make us predict the future. Don't worry, this article will not discuss the time machine, the discussion is very useful things.
This article will discuss the methods for forecasting. One prediction is time-dependent, and this method of processing time-dependent data is called a time-series model . This model can find some hidden information to assist decision-making in time-related data.
The time series model is a useful model when we are dealing with sequential sequence data. Most companies are based on time series data to analyze the second year of sales, website traffic, competitive position and more. Yet many people do not understand the field of time series analysis.
So, if you don't understand the time series model. This article will introduce you to the process steps of the time series model and its related techniques.
This article contains the following:
Directory
* 1, Time series model introduction
* 2, use R language to explore time series data
* 3. Introduction of ARMA Time series model
* 4. The framework and application of Arima time series model

Let's get started.

1 , Time series model introduction

Let ' s begin. This section includes stationary sequences, random walk, rho coefficients, Dickey Fuller test stationarity. If you don't know about this knowledge, don't worry-the following concepts will be described in detail in this section, I bet you like my presentation.

Stationary Sequence

There are three criteria for judging whether a sequence is a stationary sequence:
1. Mean, is a constant independent of time t. (left) satisfies the condition of the stationary sequence, and (right) is obviously time dependent.

2. Variance, is a constant independent of time t. This attribute is called variance homogeneity. Shows what is the variance alignment and what is not the variance alignment. (Note the different distributions on the right-hand side.) )

3. Covariance, which is related only to the time interval K, is independent of the duration T. As in (right), it can be noted that as time increases, the curve becomes more and more close. Therefore the covariance of the red sequence is not constant.

Why should we care about stationary time series?

You cannot create a time series model unless your time series is stationary. In many cases time-stationary conditions are often unsatisfied, so the first thing to do is to make the time series smooth, and then try to predict the time series using stochastic models. There are many ways to smooth data, such as eliminating long-term trends and differentiating.

Random Walk

This is the most basic concept of a time series. You may well understand the concept. However, many industry people still consider the random walk as a stationary sequence. In this section, I will use some mathematical tools to help understand this concept. Let's look at an example first.
Example : Think of a girl randomly imagining a girl moving around a giant chessboard. Here, the next position depends only on the previous location.

(Source: http://scifun.chem.wisc.edu/WOP/RandomWalk.html)

Now imagine that you are in a closed room and can't see this girl. But you want to predict the location of the girl at different times. How can you predict a point? Of course your predictions are getting worse over time. At the t=0 moment, you must know where the girl is. The next time the girl moves to the block in the Attachment 8 box, this time, you predict the probability has dropped to 1/8. Continue to predict, now we are formulating this sequence:

X (t) = x (t-1) + Er (t)

Here the ER (t) represents this time point T random disturbance item. This is the randomness that a girl brings at every point in time.

Now we're going to pass all XS time points, and finally we'll get the following equation:

X (t) = x (0) + Sum (er (1), ER (2), er (3) ..... Er (t))

Now, let's try to verify the stationarity hypothesis of the random walk:
1. Are the mean values constant?

E[x (t)] = e[x (0)] + Sum (e[er (1)],e[er (2)],e[er (3)] ..... E[er (t)])

We know that the expected value of random interference items due to stochastic processes is 0. So far: e[x (t)] = e[x (0)] = constant
2. Is the variance constant?

Var[x (t)] = var[x (0)] + Sum (var[er (1)],var[er (2)],var[er (3)] ..... Var[er (t)])

Var[x (t)] = T * Var (ERROR) = time dependent

Therefore, we infer that the random walk is not a smooth process, because it has a time-varying variance. Also, if we examine the covariance, we see that the covariance depends on the time.

Let's see a more interesting thing.

We already know that a random walk is a non-stationary process. Let's introduce a new coefficient into the equation to see if we can make a formula to check the smoothness.
Rho coefficient

X (t) = Rho * X (t-1) + Er (t)

Now, we'll change rho to see if we can make this sequence smooth. Here we are just looking, not conducting a stationarity test.
Let's start with a completely stationary sequence of rho=0. Here is a diagram of the time series:

By increasing the value of rho to 0.5, we will get the following:

You may notice that our cycle is getting longer, but basically there doesn't seem to be a serious breach of the stationarity hypothesis. Now let's take the more extreme case ρ= 0.9

We still see that after a certain interval of time, the return from the extreme value to zero. This series does not violate the non-stationarity obviously. Now, let's take a look at the ρ= 1 random walk.

This is clearly a violation of the fixed conditions. What makes rho= 1 so special? , this situation does not satisfy the stationarity test ? Let's find out the reason for this math.
The expectation for Formula X (t) = Rho * X (t-1) + Er (t) is:

E[x (t)] = Rho *e[X (t-1)]

This formula is very meaningful. The next X (or point-in-time t) is pulled to the value of the previous X rho*.
For example, if X (t–1) = 1, e[x (t)] = 0.5 (rho= 0.5). Now, if you move from zero to any direction next you want to expect 0. The only thing that can make the expectation bigger is the error rate. When Rho becomes 1? The next step is nothing to be dropped.

Dickey Fuller Test Smoothness of

The last knowledge point to study here is Dickey Fuller Test. In statistics, the Dickey-fuller test tests whether an autoregressive model has a unit root. Here there is an adjustment based on the rho coefficient above, converting the formula to the Dickey-fuller test

X (t) = Rho * X (t-1) + Er (t)

= x (t)-X (t-1) = (Rho-1) x (t-1) + Er (t)

We want to test if rho–1=0 is significantly different. If the 0 hypothesis is not true, we will get a stationary time series.
The smoothness test and the conversion of a sequence to a stationary sequence are the most important parts of a time series model. It is therefore necessary to remember all the concepts mentioned in this section for easy access to the next section.
Let's take a look at the time series examples.

2 , using R to explore time series

In this section we will learn how to use R to process time series. Here we just explore the time series and do not build a time series model.
The data used in this section is the built-in data in R: Airpassengers. This data set is the number of passengers per month for international flights in 1949-1960.

In the into data set

The following code will help us to get into the dataset and be able to see some small sets of data.

> Data (airpassengers)
> class (Airpassengers)
[1] "TS"
#查看AirPassengers数据类型, here is the time series data
> Start (airpassengers)
[1] 1949 1
#这个是Airpassengers数据开始的时间
> End (airpassengers)
[1] 1960 12
#这个是Airpassengers数据结束的时间
> Frequency (airpassengers)
[1] 12
#时间序列的频率是一年12个月
> Summary (airpassengers)
Min. 1st Qu. Median Mean 3rd Qu. Max.
104.0 180.0 265.5 280.3 360.5 622.0

Detailed data in matrices

#The number of passengers is distributed across the spectrum
> Plot (airpassengers)
#绘制出时间序列
>abline (REG=LM (Airpassengers~time (airpassengers)))
# Quasi-Unity Line

There are some more operations that need to be done

> Cycle (airpassengers)
#打印每年的周期

2 Jan Mar April may June Jul Sep Oct Nov

3 1949 1 2 3 4 5 6 7 8 9 10 11 12

4 1950 1 2 3 4 5 6 7 8 9 10 11 12

5 1951 1 2 3 4 5 6 7 8 9 10 11 12

6 1952 1 2 3 4 5 6 7 8 9 10 11 12

7 1953 1 2 3 4 5 6 7 8 9 10 11 12

8 1954 1 2 3 4 5 6 7 8 9 10 11 12

9 1955 1 2 3 4 5 6 7 8 9 10 11 12

10 1956 1 2 3 4 5 6 7 8 9 10 11 12

11 1957 1 2 3 4 5 6 7 8 9 10 11 12

12 1958 1 2 3 4 5 6 7 8 9 10 11 12

13 1959 1 2 3 4 5 6 7 8 9 10 11 12

14 1960 1 2 3 4 5 6 7 8 9 10 11 12

>plot (Aggregate (Airpassengers,fun=mean))
#This'll aggregate the cycles and display a year in year trend
> BoxPlot (airpassengers~cycle (airpassengers))
#Box plot across months would give us a sense on seasonal effect

Important Inference

    1. The annual trend shows that the number of passengers is increasing every year.
    2. July or August has a much higher mean and variance than other months
    3. The average for each month is not the same, but the variance varies very little. Therefore, it can be seen that there is a strong periodicity. , a period of 12 months or less.

Viewing data, heuristic data is the most important part of building a time series model-if you don't have this step, you won't know if this sequence is a stationary sequence. Like this example, we already know a lot about the details of this model.
Next we will build some time series models and the characteristics of these models, and also the most predictions.

3 , Arma time series model

Arma is also called autoregressive moving average hybrid model. Arma models are often used in time series. In the ARMA model, AR represents the Autoregressive and MA represents the moving average. If these terms sound complicated, don't worry-these concepts will be briefly introduced in a few minutes ' time.
We will now appreciate the characteristics of these models. Before you begin, you must first remember that AR or MA is not applied to nonstationary sequences.
In practical applications you may get a non-stationary sequence, the first thing you need to do is to turn this sequence into a stationary sequence (through differential differentiation/transformation) and then select the time series model that you can use.
First, this article introduces two models (AR&MA) separately. Let's take a look at the characteristics of these models.

Autoregressive Time Series Model

Let's understand the AR model from the following example:
Status quo a country's GDP (X (t)) depends on GDP last year (X (t-1)). This assumes that the GDP of a country this year is dependent on the gross domestic product of last year and the new factories and services that are being opened this year. But GDP is largely dependent on GDP last year.
So, the formula for GDP is:

X (t) = Alpha * x (t–1) + error (t)

This equation is the AR formula. The formula (1) indicates that the next point is completely dependent on the previous point. Alpha is a factor that is expected to find the Alpha minimization error rate. X (t-1) is also dependent on X (t).

For example, X (t) represents the amount of juice sold in a city on a given day. In winter, very few suppliers enter the juice. Suddenly one day, the temperature rose, the demand for juice soared to 1000. However, after a few days, the temperature has dropped. But it is well known that people drink juice on hot days, and 50% of them still drink juice in cold weather. In the next few days, the ratio dropped to 25% (50% of 50%), and then a few days later gradually dropped to a very small number. Explains the inertia of the AR sequence:

Moving Average time series model

Then another example of moving averages.
It's easy to understand that a company generates some kind of package. As a competitive market, the volume of sales of packages increases from zero. So one day he did an experiment, designed and made different packages, and this kind of package won't be purchased at any time. Therefore, assume that the total demand on the market is 1000 of such packages. On one day, the demand for this package was particularly high and soon the inventory was almost over. At the end of the day, 100 more bags were sold. We turn this error into a time-point error. Several customers are still buying this package for the next few days. The following is a simple formula to describe this scenario:

X (t) = Beta * ERROR (T-1) + error (t)

Try to get this picture out, that's what it looks like:

Notice the difference between the MA and AR models? In the MA model, the noise/shock is fast. The AR model is subject to long-time effects.

AR the difference between a model and a MA model

The main difference between AR and MA model is the correlation of time series objects at different time points.
The MA model expresses the current predictive value using a linear combination of random disturbances or predictive errors over various periods of the past. When a value is n>, the correlation between x (t) and X (t-n) is always 0. The AM model can only reflect the influence and effect of the relevant factors on the prediction target through the time series variable's own historical observation value, and the model of the step model may eliminate the difficulties caused by the choice of independent variables and multiple collinearity in the normal fallback prediction method. That is, the correlation of X (t) and X (t-1) in the AM model is becoming smaller over time. This difference should be used well.

Drawing with ACF and PACF

Once we get a stationary time series. We have to answer two of the most important questions;
Q1: Is this an AR or MA process?
Q2: What is the order of the AR or MA processes we need to use?

To solve these two problems we need to use two factors:
The sample self-correlation coefficient (ACF) and the hysteresis K phase of the time series X (t) lag K-Order sample autocorrelation coefficients (PACF). The formula is omitted.
AR model of ACF and PACF:
It is proved by calculation that:
-AR ACF is a trailing sequence, that is, regardless of the lag time K, the calculated value of ACF is related to the autocorrelation function of 1 to P-order lag.
-The PACF of AR is a truncated sequence, that is, when the lag time k>p pacf=0 phenomenon.

The Blue Line display value is significantly different from 0. It is clear that the above PACF diagram shows the truncated second lag, which means that this is an AR (2) process.
The ACF and PACF of the MA model:
-The ACF of MA is a truncated sequence, which is the phenomenon of pacf=0 when the lag period k>p.
-The PACF of AR is a trailing sequence, that is, the calculated value of ACF is related to the autocorrelation function of 1 to P-order hysteresis regardless of the lag time K.

Obviously, the above ACF graph is truncated to the second lag, which is thought to be a MA (2) process.
At present, this article has introduced the use of ACF&PACF diagram to identify the type of stationary sequence. Now, I'll describe the overall framework of a time series model. In addition, the actual application of the time series model is discussed.

4 the framework and application of Arima time series model

In this paper, we quickly introduce the basic concepts of time series model, use R to explore time series and Arma model. Now we are going to organize these scattered things and do something very interesting.

Framework

The framework shows how to " do a Time series analysis " Step after stage

The first three steps we discussed in the previous comments. Nonetheless, here's a brief explanation:

First Step: Time series visualization

Before you build any type of time series model, it is critical to analyze its trends. The details we are interested in include various trends in the sequence, periodicity, seasonality, or random behavior. In the second part of this article has been introduced.

Step two: Smooth sequence

Once we know the pattern, the trend, the cycle. We can check if the sequence is smooth. Dicky-fuller is a very popular way of testing. In the first part the opinion introduces this kind of test method. It's not over here! What if a sequence is found to be non-stationary?
Here are three more commonly used techniques to make a time series smooth.
1 elimination Trend : Here we simply delete the trend component in the time series. For example, the equation of my time series is:

X (t) = (mean + trend * t) + error

Here I simply delete the trend*t part of the above formula and build the X (t) =mean+error model
2 Differential : This technique is often used to eliminate non-stationarity. Here we are creating a model rather than a true sequence of the results of the difference of the sequence. For example:

X (t) –x (t-1) = ARMA (p, Q)

This difference is also part of the Arima. Now we have 3 parameters:

P:ar

D:i

Q:ma

3 seasonality : Seasonality is directly incorporated into the ARIMA model. We'll discuss this in the application section below.

Step three: Find the optimal parameters

Parameter p,q can be found using ACF and PACF diagrams. In addition to this method, if the correlation coefficient ACF and the partial correlation coefficient pacf Gradually decrease, this indicates that we need to make time series smooth and introduce D parameter.

Fourth step: CV Arima Model

With these parameters found, we can now try out the CV Arima model. The value found from the previous step may be just an approximate estimated value, and we need to explore more (P,D,Q) combinations. The minimum BIC and AIC model parameters are what we want. We can also try some seasonal ingredients. Here, we will notice some seasonal things in the ACF/PACF chart.

Fifth step: Forecast

At this point, we have the Arima model, and we can make predictions now. We can also visualize this trend and cross-validate it.

The application of time series model.

Here we use the previous example. Use this time series to make predictions. We recommend that you observe this data before proceeding to the next step.

Where do we start?

Is the figure of the number of passengers over the years. Before looking down, look at this diagram.

Here is my observation:
1. The number of passengers is increasing every year.
2. This appears to be seasonal, with each cycle not exceeding 12 months.
3. The variance of the data is increasing every year.
We need to solve two problems before we perform a smooth test. First, we need to eliminate variance. Here we do the logarithm of this sequence. Second, we need to solve the trend of the sequence. We do this by differentiating the timing sequence. Now, let's examine the smoothness of the final sequence.

Adf.test (diff (Log (airpassengers)), alternative= "stationary", k=0)
#这里可能会显示没有这个函数, need to install. Install.packages ("Tseries")
#加在这个包, library (tseries)

Augmented Dickey-fuller Test

Data:diff (log (airpassengers))
Dickey-fuller = -9.6003, Lag order = 0,
P-value = 0.01
Alternative hypothesis:stationary

We can see that this sequence is smooth enough to do any time series model.
The next step is to find the correct parameters for the Arima model. We know that ' d ' is 1, so we need to do 1 differential to make the sequence smooth. Here we draw the relevant diagram. Here is the ACF diagram for this sequence.

#ACF plots

ACF (log (airpassengers))

What do you see from the above table?

It is clear that the ACF has dropped very slowly, which means that the number of passengers is not stable. As we have discussed earlier, we are prepared to do regression on the difference in sequence de-logarithm, instead of the data entropy difference directly after the sequence de-logarithm. Let's take a look at the differential ACF and PACF curves.

ACF (diff (Log (airpassengers)))

PACF (diff (Log (airpassengers)))

Obviously ACF cutoff with the first lag, so we know the value of P should be 0. and the value of Q should be 1 or 2. After several iterations, we find that the AIC and BIC are the Least p,d,q (0,1,1).

(Fit <-Arima (log (Airpassengers), C (0, 1, 1), seasonal = list (order = C (0, 1, 1), period = 12)))
Pred <-predict (fit, n.ahead = 10*12)
Ts.plot (airpassengers,2.718^pred$pred, log = "Y", lty = C (1,3))

Resources

A complete Tutorial on time Series Modeling in R
Time series
The eighth chapter analysis of time series

Time series Analysis algorithm "R detailed"

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.