Tutorials | Kaggle Site Traffic Prediction Task first solution: from model to code detailed time series forecast

Source: Internet
Author: User

Https://mp.weixin.qq.com/s/JwRXBNmXBaQM2GK6BDRqMw

Selected from GitHub

Artur Suilin

The heart of the machine compiles

Participation: Shiyuan, Wall's, Huang

Recently, Artur Suilin and other people released the Kaggle website Traffic Timing Prediction Contest first place detailed solution. They not only expose all the implementation code, but also explain the implementation model and experience in detail. The heart of the machine provides a brief overview of the models and experiences they have achieved, and a more detailed code see the GitHub project.

GitHub Project Address: https://github.com/Arturus/kaggle-web-traffic

Below we will briefly describe how Artur Suilin fix GRU to complete the website Traffic Timing prediction Contest.

The forecast has two main sources of information:

    1. Local features. When we see a trend, hopefully it will continue (Autoregressive model) toward this trend; When you see the peak traffic, you know that it will gradually decay (the sliding average model); Seeing the holiday traffic increases, we know that there will be increased traffic (seasonal models) in the coming holidays.

    2. Global features. If we look at the autocorrelation (autocorrelation) function diagram, we will notice strong autocorrelation and seasonal autocorrelation between years and years.

I decided to use the RNN SEQ2SEQ model for predictions for the following reasons:

    1. RNN can be used as a natural extension of the Arima model, but more flexible and expressive than Arima.

    2. RNN is a non-parametric, greatly simplified learning. Imagine using different ARIMA parameters for the 145K timing.

    3. Any exogenous feature (numeric or category, time dependent, or sequence dependent) can be easily injected into the model.

    4. Seq2seq is naturally suitable for this task: we predict the next value based on the joint probability (joint probability) of the previous value (including the previous prediction). Using previous predictions can keep the model stable, because errors accumulate at each step, and if extreme predictions occur at a certain step, it is possible to ruin the predictive quality of all subsequent steps.

    5. There are so many hype in deep learning now.

Feature Engineering

RNN is strong enough to discover and learn its own characteristics. The feature list for the model is as follows:

    • Pageviews: The original value is transformed by the log1p () to obtain an almost normal temporal value distribution rather than a partial distribution.

    • Agent, country, site: These features are extracted from the Web page URL and then one-hot encoded.

    • Day of Week: captures the weekly seasonal effect.

    • Year-to-year autocorrelation, Quarter-to-quarter autocorrelation: Capturing the seasonal effects of each year and quarter.

    • Page popularity: high-traffic and low-traffic pages have different traffic-changing patterns, and this feature (Pageviews's median) helps capture traffic size. The pageviews feature loses scale information because each pageviews sequence is independently normalized to 0 mean and unit variance.

    • Lagged pageviews: The following will be described in detail.

Feature preprocessing

All features (including One-hot encoded features) are normalized to 0 mean and unit variance. Each pageviews sequence is normalized independently.

Time-dependent features (autocorrelation, country, and so on) are "stretched" to the timing length, which is the repeated use of the tf.tile () command daily.

The model is trained on random fixed length samples from the initial timing. For example, if the initial time series is 600 days and we train with a 200-day sample, we can randomly select the sample to start with in the first 400 days.

This sampling work is an effective data enhancement mechanism: The training code randomly selects the starting point of each time series at each step, generating an infinite amount of almost non-repeating data.

Core technology of the model

The model consists of two parts, encoder and decoder.

The encoder for CuDNN Gru,cudnn is about 5 to 10 times times faster than the TensorFlow rnncells, but the cost is not easy to use, and the documentation is not perfect.

The decoder is TF Grublockcell, and the API is encapsulated in Tf.while_loop (). The code in the loop body gets the predictions from the previous step and joins the input features into the current time step.

Working with long time series

The Lstm/gru is a very good solution for relatively short sequences (within 100-300 items). For longer sequences, however, Lstm/gru still works, but gradually forgets the information contained in earlier time steps. The Kaggle race has a time series of up to more than 700 days, so we need to find some way to "strengthen"gru's memory. "

Our first approach was to consider using some mechanism of attention. The attention mechanism can retain useful information from a longer distance in the past to the current RNN unit. For our problem, the simplest and most efficient method of attention is to use a fixed weighted sliding window attention mechanism. It has two important points in the past time step of a longer distance (considering the long-term seasonality), that is, 1 years ago and 1 quarters ago.

We can take the encoder outputs of the current_day-365 and current_day-90, and feed them to the fully connected layer to reduce the dimensions and add the results to the input characteristics of the decoder. This solution, while simple, greatly reduces the predictive error.

We then calculate the important points with their nearest neighbors, and take this to reduce noise and compensate for uneven intervals (leap years and months of different lengths): attn_365 = 0.25 * day_364 + 0.5 * day_365 + 0.25 * day_366.

But then we realized that 0.25, 0.5, 0.25 is a one-dimensional convolution core (length=3), and we can automatically learn larger convolution cores to detect important points in the past.

Finally, we build a very large attention mechanism that looks at the "fingerprint" of each time series (the fingerprint is produced by a smaller convolutional network) and determines which points should be noticed and which weights are generated for larger convolution cores. This larger convolution kernel, which is applied to the output of the decoder, generates an attention feature for each predicted date. Although this method is not used in the end, the attention mechanism remains in the code, and the reader can find it in the model code.

Note that we are not using the classic attention scheme (Bahdanau or Luong attention mechanism), because the classic attention mechanism should be calculated from scratch using all the historical data points on each prediction step, so this method is too time-consuming for a longer time series (about two years). So our scenario will be a convolution of all data points, with the same attention weight (which is also a drawback) for all the predictive time steps, which is much faster to calculate.

Because we are not satisfied with the complexity of the attention mechanism, we try to completely remove the attention mechanism and use the important data points from one year ago, six months ago, and the first quarter as additional features of the encoder and decoder. The result is very surprising, even in terms of predictive quality, than with the attention-mechanism model notch above. So our best public scores are implemented using only lag (lagged) data points, none of which use the attention mechanism.

The other important advantage of lagging data points is that the model can use a shorter encoder without worrying about the loss of past information because the information is now explicitly contained in the feature. With this approach, even if the length of our encoder is 60-90 days, the result is completely acceptable, and it takes 300-400 days before the same performance can be achieved. In addition, shorter encoders are equal to faster training and less information loss.

Loss and regularization

Smape (target loss function for competition) cannot be used directly because of its erratic behavior around the value of 0 (the loss function is a step function when the value is zero, and the loss function is indeterminate when the predicted value is zero).

I use a smoothed, Smape variant that behaves well on all real numbers:

    1. epsilon = 0.1

    2. summ = tf.maximum(tf.abs(true) + tf.abs(predicted) + epsilon, 0.5 + epsilon)

    3. smape = tf.abs(predicted - true) / summ * 2.0

       

Another option is the MAE loss function on log1p (data), which is smooth and the training target is very close to the smape.

The final prediction takes the nearest integer, and the negative prediction takes 0.

I tried to use the regularization RNN activation value in the paper "Regularizing Rnns by stabilizing activations" because the internal weights of CuDNN GRU could not be directly regularization (or I could not find the correct method). The loss of stability (stability loss) does not work, and the activation loss can be a minor loss weight (1e-06). 1E-05) bring a little improvement.

Training and validation

I use the Cocob optimizer (see paper "Training Deep Networks without learning Rates Through Coin betting") to train with gradient truncation. Cocob try to predict the optimal learning rate for each training step, so I don't have to adjust the learning rate at all. It also converges much faster than the traditional momentum-based optimizer, especially on the first epoch, allowing me to stop unsuccessful experiments early.

There are two ways to split a time series into training and validating datasets:

    1. Walk-forward split. This is not actually a split: we train and validate on the full data set, but use different time spans. Validation uses a time span that moves forward a prediction interval before the training time span.

    2. Side-by-side split. This is a traditional segmentation model for mainstream machine learning. The dataset is split into two separate sections, one for training and the other for validation.

I've tried both ways, but Walk-forward is better for this task because it's directly related to the goal of the competition: using historical values to predict future values. But the segmentation destroys the data points at the end of the time series, making it difficult to train accurate predictions of future models.

Specifically: For example, we have 300 days of historical data to predict the next 100 days of data. If we choose the Walk-forward division, we must use the first 100 days of data for real training, 100 days of data for the training mode of prediction (running the decoder, calculate the loss), and then 100 days after the data for verification, and the last 100 days to really predict the future value. Therefore, we can actually train with 1/3 data points, and the last training data point is 200 days apart from the first prediction data point. The interval is too large, because once we leave a training data, the predictive quality will appear exponentially lower (increased uncertainty). The model with 100-day gap training is relatively good in predicting quality.

The Side-by-side partition requires less computational power because it does not consume data points at the endpoint. But for our data, the performance of the model on the validation set is strongly correlated with the performance on the training set, and is almost irrelevant to the actual model performance in the future. In other words, parallel segmentation is essentially not useful for our problem, it simply replicates the loss of the model observed on the training data set.

I use only the validation set (with forward step-by-step segmentation) for model tuning, and the final model for predicting future values is only trained in blind mode, with no validation set used.

Reduce model Variance

Compared with the input of strong noise data, the model inevitably has high variance. Frankly, I'm surprised RNN actually learned something from the noise data.

The same models trained on different seed have different manifestations, and sometimes the models become divergent even on "unfortunate" seed. During the training period, the performance will gradually fluctuate greatly. It was hard to win the game with sheer luck, so I decided to take action to reduce the variance.

    1. I don't know which training steps are best for predicting the future (but the results of the previous data are only weakly correlated with the results of future data), so I can't use early stop. But I know the approximate area, where the model (possibly) is fully trained, but (probably) did not begin to fit. I decided to set this best area to 10500 to 11,500 iterations and save 10 checkpoints from every 10,000th step in the region.

    2. Similarly, I decided to train 3 models on different seed and save checkpoints from each model. So I have a total of 30 checkpoints.

    3. A well-known way to reduce variance and improve model performance is ASGD (SGD average). It is very simple and is well supported in TensorFlow. We must maintain the moving average of the network weights during training and use these average weights in the inference, rather than the original weights.

The combination of three models is good (average predictions for 30 checkpoints using the average model weight on each checkpoint). I got a smape error on the leaderboard (for future data) that was roughly the same compared to the validation on historical data.

In theory, you can also use the first two methods as integrated learning, but I mainly use it to reduce variance.

Super parameter adjustment

Many model parameters (number of layers, depth, activation function, dropout coefficients, etc.) can (and should) be adjusted to achieve better model performance. The manual adjustment was tedious and time-consuming, so I decided to automate the process and use SMAC3 to search for hyper-parameters. Here are some of the advantages of SMAC3:

    • Supports conditional parameters (for example, the number of layers and dropout for each layer), and if N_layers > 1, the Dropout on the second layer will be adjusted)

    • Explicitly handle model variance. SMAC trains several instances of each model on different seeds, and compares the models if the instances are trained on the same seed. If it outperforms another model on all the same seeds, the model wins.

Contrary to my expectations, the hyper-parametric search does not establish a well-defined global minimum. All the best models have roughly the same performance, but with different parameters. Perhaps the RNN model is too expressive for this task, and the best model score relies more on the data signal-to-noise ratio on the model architecture. However, the best parameter settings can still be found in the hparams.py file.

Original link: https://github.com/Arturus/kaggle-web-traffic/blob/master/how_it_works.md

Tutorials | Kaggle Site Traffic Prediction Task first solution: from model to code detailed time series forecast

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.