A Free Trial That Lets You Build Big!
Start building with 50+ products and up to 12 months usage for Elastic Compute Service
Kaggle Competition official website: https://www.kaggle.com/c/the-nature-conservancy-fisheries-monitoring
Read reference: http://wh1te.me/index.php/2017/02/24/kaggle-ncfm-contest/
Related courses: http://course.fast.ai/index.html
1. Introduction to NCFM Image Classification task
In order to protect and monitor the marine environment and ecological balance, The Nature Conservancy (The Nature Conservancy) invited participants from the KAGGLE[1 community to develop a machine learning algorithm that automatically classifies and identifies fish species in pictures taken by the camera on a pelagic fishing ship, For example different kinds of tuna fish and sharks. The Nature Conservancy has provided 3777 annotated images as training sets, which are divided into 8 categories, of which 7 are different species of marine fish, and the remaining 1 are those that do not contain fish, each of which belongs to only one category in the 8 category.
Figure 1 shows a few pictures in the data set sample, you can see that some of the images of the fish to be identified in a small part of the whole picture, which makes the identification of a great challenge. In addition, to measure the effectiveness of the algorithm, an additional 1000 images were provided as a test set, and the contestants needed to design an algorithm for image recognition, as much as possible to identify which of the 1000 test images belonged to the 8 category. The Kaggle platform provides a list (leaderboard) for each competition, and the higher the accuracy of the recognition, the higher the competitor's ranking on the list.
Figure 1. NCFM Image Classification Contest
2. Problem analysis and Solution ideas 2.1 convolutional Neural Network (convnets)
From the description of the problem, we can find that the ncfm competition is a typical "single label image Classification" problem, that is, given a picture, the system needs to predict which category of the image belongs to the predefined categories. In the field of computer vision, the core technical framework currently addressing such problems is deep learning (Deepin learning), specifically, data for image types, which are convolutional neural networks in deep learning (convolutional neural Networks, convnets) Architecture (about the introduction and algorithm of convolutional neural networks, here's a video tutorial to look at: The CNN convolution computing layer, this blog also wrote: CNN notes).
In general, convolutional Neural Network is a special neural network structure, that is, the automatic learning of image features can be realized by convolution operation, and the useful visual features are selected to maximize the accuracy of image classification.
Figure 2. convolutional Neural Network Architecture
Figure 2 shows a simple cat and dog recognition of the convolutional neural network structure, at the bottom (and also the largest) point block represents the input layer of the network (input layers), usually this layer is read into the image as a network data input. The topmost point block is the output layer of the network, which is the function of predicting and outputting the category of the read-in image, where only the cat and dog are differentiated, so the output layer has only 2 neural cells. and the input and output layers are called hidden layers (Hidden layer), the figure has 3 hidden layers, as mentioned earlier, the image classification of the hidden layer is done by convolution operations, so the hidden layer is also a convolutional layer (convolutional layer).
Therefore, the structure of input layer, convolutional layer, output layer and its corresponding parameters constitute a typical convolutional neural network. Of course, the convolution neural network we use in practice is more complex than the example structure, since the 2012 Imagenet competition, almost every year there will be a new network structure, has been recognized by the common network has alexnet, vgg-net, GOOGLENET, Inception v2-v4[8, 9], resnet and so on. 2.2 An effective network training technique-fine tuning (fine-tune)
We do not need to start from scratch a parameter to the experiment to construct a deep network, because there are already a lot of published papers have helped us do these verification, we just need to stand on the shoulders of predecessors, to choose a suitable network structure just fine. And choosing an already recognized network structure Another important reason is that almost all of these networks provide pre-trained parameter weights (pre-trained Weights) on large-scale datasets imagenet. This is very important. Because we only have thousands of training samples, and the depth of the network parameters are very large, which means that the number of training pictures is far less than the space for parametric search, so if you just randomly initialize the depth of the network and then use these thousands of images to train, it is very easy to produce "over-fitting" (Overfitting) phenomenon.
The so-called over-fitting, that is, the depth of the network has only seen a small number of samples, and thus "sit idle", resulting only in the recognition of this small part of the picture, lost the "generalization" (generalization) ability, can not recognize other not seen, but also similar to the picture. In order to solve this problem, we generally use the network parameters that have been trained in millions of or even thousand million as the initialization parameters, you can imagine such a set of parameters of the network has "seen" a large number of pictures, so the generalization ability greatly improved, the extracted visual features are more robust and effective.
Then we can use the more than 3,000 sea fish pictures that have been labeled and then train, and note that in order to prevent missed optimal solutions, the training rhythm (which should actually be called "learning rate") should be slow, so this training strategy is called "Fine tuning Technology" (fine-tune).
When we use our own labeling data to fine tune a pre-trained network, there are some lessons to learn. Take the general Figure 3 as an example, assuming that our network structure is similar to alexnet such a 7-layer structure, wherein the first 5 layers are convolution layer, the last 2 layer is the full connection layer.
(1) (1) We first fine-tune the last layer of the Softmax classifier, assuming that the original network is used to classify 1000 types of objects (such as the target of Imagenet), and now our data only 10 category labels, so our last layer of output layer (FC8) the number of neurons into 10. We use a very small learning rate to learn the weight matrix between the layers FC7 and FC8 and fix the weights of all the previous layers;
(2) (2) Once the network tends to converge, we further expand the scope of fine-tuning, then fine-tune two full-connection layer, that is, FC6 and FC7, and FC7 and FC8 weights, at the same time fixed FC6 before all the convolutional layer weight unchanged;
(3) We extend the fine-tuning to the penultimate convolution layer C5;
(4) (4) We extend the range of fine tuning to more convolution layers. In fact, however, we would argue that the features extracted from the pre-positioned convolution layer are more basic and universal, and that the position-relative convolution layer and the full-join layer are more relevant to the data set, so sometimes we do not fine-tune the first few convolution layers.
Total Figure 3. Basic steps of Network Finetune 3. Algorithm implementation and Analysis
In ncfm This competition forum already has the open source realization for everybody reference (https://www.kaggle.com/c/the-nature-conservancy-fisheries-monitoring/discussion/26202),
The logical structure of the model training file train.py is analyzed here. Import-related modules and parameter settings-figure 4; Build Inception_v3 deep convolutional networks, using parameters already trained on imagenet large-scale image datasets as initialization, define callback functions to save the best model on the validation set in training-figure 5; Using data augmentation technology to load training images, the data amplification technique is a common technique for controlling overfitting, and its thought is very simple, and it is also a picture, if it is flipped horizontally, or the corners are cropped, or the tones are dimmer or brighter, Will not change the category of this image-figure 6; Inception_v3 Network model training;
Figure 4. Import and parameter settings
Figure 5. Build Inception_v3 network and load pre-training parameters
Figure 6. Load training and validation picture sets using data amplification technology
Figure 7. Model Training
4. Some tips for improving your rankings
Once we have trained the model, we use this model to predict the categories of the test images, and the code in predict.py in the forum is to predict the fish and generate the submission file. Here we share with you the two skills that are common in machine learning and image recognition competitions, which are simple and effective. Their ideas are based on averages and voting ideas. The principle behind it is summed up in a sentence: the eyes of the masses are discerning.
Tip 1: The same model, with an average of multiple test samples
This technique refers to the fact that when we train a model, for a test image, we can use a similar data amplification technique to generate multiple images of a number of images, and send these images to our trained network to predict, we take those categories with the highest number of votes as the final result. The idea was realized by predict_average_augmentation.py in the GitHub repository, and its effect was obvious.
Tip 2: Cross-validation training multiple models
Remember when we said we were going to divide more than 3,000 images into training sets and validation sets. There are many kinds of such divisions. A common division is to disrupt the order of the picture, the average of all the pictures into k parts, then we can have K < training set, validation set > combination, that is, each fetch 1 copies as a validation set, the remaining K-1 as a training set. So we can train the K model in a total, so for each test picture, we can put it into the K-model to predict, and finally choose the category with the highest number of votes as the final result of the prediction. We have made this a "K-fold cross-validation" (K-fold cross-validation). Figure 9 shows a data partitioning method with 50 percent cross-validation.
Start building with 50+ products and up to 12 months usage for Elastic Compute Service