Use r language to dig data "seven"

Source: Internet
Author: User
Tags month name

Time series and data Mining One, experiment explanation 1. Environment Login

No password automatic login, system user name Shiyanlou, password Shiyanlou

2. Introduction to the Environment

This lab environment uses the Ubuntu Linux environment with desktop, which will be used in the experiment:

1. LX Terminal (lxterminal): Linux command line terminal, Open will enter the bash environment, you can use the Linux command
2. GVim: Very useful editor, the simplest usage can refer to the course Vim editor
3. R: Enter ' R ' into the interactive environment at the command line, the code below is running in the interactive environment
4. Data: Enter the following command at the command line terminal:

# Download Data wget http://labfile.oss.aliyuncs.com/courses/360/synthetic_control.tar.gz # Unzip the data to the current folder tar zxvf synthetic_ Control.tar.gz
3. Use of the environment

Use the R language Interactive environment to enter the code and files required for the experiment, and use the LX Terminal (lxterminal) to run the required commands.

After completing the experiment, you can click "Experiment" above the desktop to save and share the results to Weibo to show your friends the progress of your study. The lab building provides a back-end system that can prove to be true that you have completed the experiment.

The Experiment records page can be viewed in the My Home page, which contains each experiment and notes, as well as the effective learning time of each experiment (refers to the time of the experiment desktop operation, if there is no action, the system will be recorded as Daze time). These are the proof of authenticity of your studies.

Ii. introduction of the course

1. Time series data in R
2. Decomposition of the time series into trend, seasonal, and model-based
3. Build an Arima model in R and use it to predict future data
4. Introduce dynamic time warping (DTW), and then use DTW distance and Euclidean distance processing time series to achieve hierarchical clustering.
5. Three examples of time series classification: one is the use of raw data, one is the use of the data after the discrete wavelet transform (DWT), and the other is the KNN classification.

Iii. Course content 1, time series data in R

Using TS, the class can extract samples of equal time distances, and when the parameters are frequency=7, the frequency units of the selected samples are weeks, and so on, and the frequency is 12 and 4, generating monthly and quarterly data respectively. The specific implementation is as follows:

# Generate 1-30 integers, frequency is 12 that is monthly data, starting from March 2011 > a <-ts (frequency=12, Start=c (2011,3)) > Print (a) # conversion to String type > str (a # Output the time series properties > attributes (a)

The results of the implementation are as follows:

2. Time series decomposition

Time series decomposition is the decomposition of time series according to trend, seasonality, periodicity and irregularity. Among them, the trend part represents the long-term trend, seasonal refers to the time series data to show seasonal fluctuations, cyclical refers to the data presented periodic fluctuations, irregular part is the residual.

Here is an example of time series decomposition, using Data airpassengers from 1949 to 1960 passenger data for Box & Jenkins international flights. There are 144 observations.

> Plot (airpassengers) # preprocess data into monthly data > Apts <-ts (airpassengers, frequency=12) # Use function decompose () to decompose time series > F <- Decompose (apts) # Quarterly data > F$figure> plot (f$figure, type= "B", xaxt= "n", Xlab= "") # give the month name > monthNames using the current time zone <-months (Isodate (2011,1:12,1) # Mark X axis with month name # side=1 means set X axis, at refers to the range from 10-12,las for dividing the unit scale to 2> axis (1, At=1:12, Labels=monthnames, las=2) > Plot (f)

The results are as follows:

, the ' observed ' table represents the original time series distribution, and the second graph shows the upward trend of the data, and the third season chart shows that the data is affected by a certain seasonal factor, and the last one is the chart after removing the trend and the seasonal influence factors.

thinking: What other packages are in R, which functions can decompose the time series, and try to use those functions to decompose and compare the decomposition results.

3. Time Series prediction

Time series prediction is the prediction of future events based on historical data. A typical example is predicting the opening price of a stock based on its historical data. In time series prediction, there are two classical models: Autoregressive Moving Average model (ARMA) and differential integrated moving average autoregressive model (ARIMA).

The Arima model is fitted with a single-variable time series data and is predicted with this model.

# parameter Order consists of (P,D,Q), p=1 refers to the number of autoregressive items as 1,list content is the season seasonal parameters > Fit <-arima (airpassengers, Order=c (1,0,0), list (order= C (2,1,0), PERIOD=12) # Forecast data for the next 24 months > Fore <-predict (Fit, n.ahead=24) # 95% error range at confidence level (u,l) > U <-fore$pred + fore$se> L <-fore$pred-2*fore$se# col=c (1,2,4,4) indicates that the line color is black, red, Blue, Blue # lty=c (1,1,2,2) in the first line of the connection point is solid and dashed > Ts.plot (Airpassengers, fore$pred, U, L, Col=c (1,2,4,4), lty = C (1,1,2,2)) > Legend ("TopLeft", C ("Actual", "Forecast", " Error Bounds (95% Confidence) "), Col=c (1,2,4), Lty=c (1,1,2))

The forecast results are as follows:

4. Time Series Clustering

Time series clustering is the clustering of time series data based on density and distance, so the time series within the same class is similar. There are many metrics that measure distances and densities such as Euclidean distance, Manhattan distance, maximum modulus, Hamming distance, angle between two vectors (inner product), and dynamic Time plan (DTW) distances.

4.1 Dynamic Time Planning distance

Dynamic time planning can find the best correspondence relation of two time series, and the algorithm is realized by package ' DTW '. In package ' DTW ', the function dtw (x, y,...) Calculate the dynamic time plan and find the best correspondence between time series X and Y.

Code implementation:

> Library (DTW) # generates an average of 100 sequences in the 0-2*pi range idx> idx <-seq (0, 2*pi, len=100) > A <-sin (idx) + runif (+)/10> b <-cos (idx) > Align <-dtw (A, B, STEP=ASYMMETRICP1, keep=t) > Dtwplottwoway (Align)

The best correspondence between A and B is shown in the following sequence:

4.2 Time series of synthetic control charts

The Synthetic Control chart time series DataSet ' Synthetic_control.data ' is stored in the current working directory '/home/shiyanlou ', which contains 600 synthetic control chart data, each of which is a time series consisting of 60 observations, Those synthetic control chart data are divided into 6 categories:
-1-100: normal type;
-101-200: Cycle type;
-201-300: Upward trend;
-301-400: Downward trend;
-401-500: Move up;
-401-600: Move Down.

First, the data is preprocessed:

> SC <-read.table ("Synthetic_control.data", Header=f, Sep= "") # Displays the first sample observations for each class of data > IDX <-C ( 1,101,201,301,401,501) > Sample1 <-t (SC[IDX,])

The composition Control Chart time series sample data is distributed as follows:

4.3 Using Euclidean distance hierarchical clustering

First, 10 samples are randomly selected from each type of data in the composite control chart above:

> Set.seed (6218) > N <-10> S <-sample (1:100, N) > IDX <-C (S, 100+s, 200+s, 300+s, 400+s, 500+s) > Sample2 <-sc[idx,]> Observedlabels <-Rep (1:6, each=n) # using Euclidean distance hierarchical clustering > HC <-hclust (Dist (sample2), method= "a Verage ") > Plot (hc, Labels=observedlabels, main=") # divides the cluster tree into 6 classes > Rect.hclust (HC, k=6) > Memb <-cutree (hc, k=6) > table (observedlabels, MEMB)

The clustering results are compared with the actual classification:

Figure 4.1

Figure 4.1, the 1th clustering class is correct, the 2nd clustering class is not very good, there are 1 data divided into the 1th class, 2 data are divided into the 3rd class, 1 data are divided into the 4th class, and the Uptrend (class 3rd) and Move up (5th Class) are not very well differentiated, similarly, downtrend (4th Class) and down (6th class) are not well recognized.

4.4 Using DTW distance to achieve hierarchical clustering

The implementation code is as follows:

> Library (DTW) > Distmatrix <-dist (sample2, method= "DTW") > HC <-hclust (Distmatrix, method= "average") > Plot (hc, labels=observedlabels, main= "") > Rect.hclust (HC, k=6) > Memb <-cutree (hc, k=6) > table ( Observedlabels, MEMB)

The clustering results are as follows:

Figure 4.2

Comparing figures 4.1 and 4.2, it can be found that the clustering effect of the latter can be seen in the measurement of the similarity of time series, DTW distance is better than European distance.

5. Time Series classification

Time series classification is the establishment of a classification model based on a time series that has already been labeled, and then uses this model to predict the time series that have not been classified. Extracting new features from the time series can improve the performance of the classification model system. The methods for extracting features include singular value decomposition (SVD), discrete Fourier variation (PFT), Discrete wavelet Variation (DWT), and segmented aggregation approximation (PAA).

5.1 Using raw Data classification

We use the function Ctree () in the package ' Party ' to classify the original time series data. The implementation code is as follows:

# Add category labels to raw datasets classid> classId <-Rep (As.character (1:6), each=100) > Newsc <-data.frame (Cbind (ClassId, SC) ) > Library (Party) > Ct <-ctree (classId ~., Data=newsc,controls = Ctree_control (minsplit=30, minbucket=10, Maxde pth=5) > Pclassid <-predict (CT) > table (ClassId, PCLASSID) # Calculate accuracy of classification > (sum (CLASSID==PCLASSID))/Nrow (SC) > Plot (CT, ip_args=list (pval=false), Ep_args=list (digits=0))

Output Decision Tree:

5.2 Extracting feature classifications

Next, we use the DWT (discrete wavelet variation) method to extract the features from the time series and establish a classification model. The wavelet transform can deal with various frequency data, so it has self-adaptability.

An example of extracting a DWT coefficient is shown below. The discrete wavelet transform can be implemented by package ' wavelets ' in R. The function dwt () in the package can calculate the coefficients of the discrete wavelet, the main 3 parameters x, filter and boundary are the time series of single variable or multivariate, the Wavelet filtering method used and the decomposition level, the parameters returned by the function are the discrete wavelet coefficients and the scale coefficients respectively. The original time series can be re-obtained by the function idwt () inverse discrete wavelet.

> Library > Library (wavelets) > Wtdata <-null# Traverse all time series > for (i in 1:nrow (SC)) {+ a <-t (Sc[i,]) + WT <-DWT (X=a, filter= "Haar", boundary= "periodic") + Wtdata <-rbind (Wtdata, Unlist (c ([email protected], [email prot Ected][[[email protected]])) +}> wtdata <-as.data.frame (wtdata) > WTSC <-data.frame (Cbind (ClassId, Wtdata) # Create a decision tree using a DWT, the control parameter is a decision tree shape size limit > Ct <-ctree (classId ~., Data=wtsc,controls = Ctree_control ( Minsplit=30, minbucket=10, maxdepth=5) > Pclassid <-predict (CT) # Compare real categories with clusters > table (classId, Pclassid)

5.3 K-nn Classification

The K-NN algorithm can also be used for the classification of time series. The algorithm flow is as follows:
-Find the k nearest neighbor of a new object
-then observe that the class that has the most number of identical attributes in the region represents the class to which the zone belongs

However, this direct search for K nearest neighbor algorithm time complexity is O (n**2), where n is the size of the data. Therefore, a valid index is required when working with large data. Package ' Rann ' provides a fast nearest neighbor search algorithm to reduce the time complexity of the algorithm to O (n log n).
The following is the implementation of the K-NN algorithm without an index:

> K <-20# Create a new dataset by adding noise to the No. 501 time series > Newts <-sc[501,] + runif (100) *15# use the ' DTW ' method to calculate the distance between the new dataset and the original data set > Dis Tances <-Dist (newts, SC, method= "DTW") # give the distance in ascending order > S <-sort (as.vector (distances), index.return=true) # S$ix[1:k] is ranked in the top 20 distance, cousin output K nearest Neighbor Class > table (Classid[s$ix[1:k])

The output results are as follows:

The data from the output table shows that 19 of the 20 neighbors are in class 6th, so the class time series is divided into the sixth class of time series.

* * * *: After learning so many time series classification, consider the advantages and disadvantages of the above time series classification method.

For more data mining materials, please come to the experimental building to study.

Use r language to dig data "seven"

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.