Then the previous article Training 1) Validation
We use the method of stratified sampling (stratified sampling) to separate the annotated datasets by 10% as a validation set (validation). Because the dataset is too small, our assessment on the validation set is affected by the noise, so we tested our validation set on other models on the leaderboard. 2) Training algorithm
All models are based on the SGD optimization algorithm with Nesterov momentum. We set the coefficient of momentum to 0.9. Most models require 24-48-hour convergence.
For most models, we can learn by 215000-step gradient descent. The final learning rate was adjusted by a strategy similar to Krizhevsky et al through two 1/10 reduction steps, respectively, at 180,000th and 205000 paces. We generally use 0.003 as the initial value of the learning rate.
We also tested Adam (using the first version) as an alternative to the Nesterov momentum. Results: Adam was able to accelerate the convergence by twice times, but the final performance (compared to Nesterov momentum) decreased. So we gave up Adam. 3) Initialization
We used the orthogonal initialization strategy proposed by exact solutions to the nonlinear dynamics of learning in deep linear neural networks (orthogonal Initializa A variant of tion strategy). This allows us to increase the depth of the network according to our needs without encountering problems in convergence. 4) Regularization
For most models, the probability of using dropout,dropout at the full-join layer is set to 0.5. We also performed dropout operations on convolution layers in some models.
We have also tried Gaussian dropout (using multiplicative Gaussian noise instead of multiplicative Bernoulli noise), but found that performance is similar to traditional dropout.
At the end of the race, we found that a small amount of weight decay was very helpful (not only regularizing) in the training process to stabilize large networks. For networks that contain large, fully connected layers, the model can easily diverge when the training is not weight decay. Although it is possible to overcome this problem by reducing the learning rate appropriately, this will undoubtedly reduce the speed of training. unsupervised and semi-supervised approaches 1) Unsupervised pre-training
Because the test dataset is large, we use the test dataset as a training set for unsupervised learning to pre-train the network. We use CAE to pre-train the network's convolution layers.
Consistent with the presentation of the literature, we found that the pre-training network can be a good regularizer (increase train loss, but increase validation score). However, when testing the validation set to augmentation, the results are unsatisfactory, the following will be discussed further.
Pre-training allows us to further expand our model, but because the pre-training network takes more time than 1:30, we do not use pre-training in the final model.
In the process of experimental pre-training, in order to learn the good characteristics, we prefer to max-pooling and unpooling layers to obtain the sparse representation of the feature. There are two reasons why we did not try Denoising Autoencoder: First, according to Masci's experimental results, max-pooling and unpooling methods provide a good filters, obtained than denoising Autoencoder better performance, and the subsequent combination of these two layers is very easy, second, the reference denoising Autoencoder network structure, the speed will be much slower.
Here are a few different strategies for starting a pre-training network: Layer-by- level greedy training (greedy layerwise training) vs. Deconvolutions Connect together (Training the full Deconvolutional stack jointly, similar to end-to-end): We find jointly way to get better performance, but sometimes we need to start with a layer by step greedy training to initialize the network, The jointly way will work. use tied weights (using tied weights) vs. do not use tied weights (using untied weights): Setting the weights in convolution and deconvolution to an inter-transpose matrix can make Autoencoder training faster. Therefore, we have only experimented with tied weights.
We also experimented with a number of different strategies for fine-tuning, and we found that if the same supervised learning settings were maintained, random initialization weights and pre-trained initialized networks did not make much difference in the performance of supervised learning. Possible reasons: When initializing the dense layer, it is possible that the weights of the dense layer is already in a reasonable range, making the convolution layer miss a lot of information (feature) during the pre-training phase.
We found two ways to overcome this problem: temporarily keeping the pre-training layer constant for a while, just training the dense layer (random initialization). If you train only a few layers of network, it will be very fast in reverse. Halve the learning rate in the convolution layer section. Enables a randomly initialized dense layer to adapt to a pre-trained convolution layer faster. Before the dense layer gets a better weights range, weights will not change much when convolution layer is supervised.
The performance difference between the two methods is small. 3) Pseudo-labeling
Another way to get data from a test set is to combine the pseudo-labeling and knowledge distillation (distilling the knowledge in a neural Network). The model was performance out of our expectation when training with the Pseudo-labeling method, so we studied this method in detail.
The pseudo-labeling approach is to add the test dataset to the train dataset to get a larger dataset. The original test data set corresponding to the label (that is, pseudo-labeling), is based on the previous training to practice a good model to predict the resulting. Doing so allows us to train a larger network, because Pseudo-labeling has a regularizing effect.
We experimented with tough (one-hot coded) and soft indexes (predicted probabilities) two ways. But soon decided to adopt soft indexes, because it got better performance.
Another important detail is to weigh the original train dataset and the test dataset (that is, pseudo-labeled data) in the final data set. Most of the time we used this method: pseudo-labeled data accounted for 33% in a mini-batch, and the original train dataset accounted for 67%.
It can also make pseudo-labeled data accounted for 67%, so that the model will be more regularizing, we even have to weaken dropout or culling, otherwise it will cause underfitting.
The difference between our approach and knowledge distillation is that we use test set instead of train set to migrate feature information. Another difference is that knowledge distillation allows smaller, faster models to get the same performance as large models, while pseudo-labeling allows us to build larger networks for better performance.
We think that the reason pseudo-labeling can improve performance is the larger dataset (Train+test set) and the data augmentation and Test-time augmentation (discussed next). When pseudo-labeling data is added to train data, it can be said that it contains all the characteristics of the data (Test+train), so it is better to obtain the data information during training and get better performance.
Pseudo-labeling to get the biggest performance boost was at the very beginning (0.015). In the back, we experimented with the bags model to get 0.003-0.009 ascension.