Everyone seems to be excited about the new neural network architecture of the Capsule Network (capsnet), I am no exception, can not help to use the capsule network to establish a road side traffic signs identification system, this article is the introduction of this process, of course, also includes some basic concepts of the capsule network elaborated.
The project, developed using TensorFlow, is based on the Sabour,nicholas Frosst and Geoffrey E. Hinton's paper, "Capsule dynamic routing", which is saved on GitHub. If you can't wait to try the machine learning framework such as TensorFlow, you can access the Python machine learning online environment of the Smart network. What's wrong with convolutional neural networks.
The problem with convolutional neural networks (CNN) is partly due to its generalization of image perception, such as a trained image recognition network that may identify errors in the rotation version of the same image, which is why data enhancement and average/maximum pooling (Average/max Pooling) are often used during training. Pooling creates a new layer by randomly selecting a subset of neurons in the next layer. This can effectively reduce the computational requirements of the upper layer, and also makes the network less dependent on the original location of the feature. This simplification is based on the assumption that the exact position of the feature appears to have little effect on the target recognition.
Like CNN, the upper capsule can cover a larger area of the image, but unlike the maximized pool, we do not discard the exact location information of the target object in the area.
This allows the model to maintain a constant output of subtle changes in the image. On the other hand, the model may ignore the displacement change of the image. Unchanging means that the output of the network is always the same regardless of the order and position of the detected characters. The model is therefore able to understand the rotation and displacement of features in the image and produce the appropriate output. This is not possible with pooling. That's why we're inspired to invent this new architecture. Capsule Network
The capsule network gives the model the ability to understand the changes in the image so that it can better summarize the perceived content. To understand how this architecture works, it is important to master the concept of capsules.
A capsule is a group of neurons whose activation vectors represent the instantiation parameters of a particular type of entity, such as an object or an object part.
We are used to talking about deep learning from a depth perspective, while the capsule network introduces a nested concept, in which a new dimension is introduced into the depth. Instead of adding layers of methods to increase the depth of the network, the capsule network adds (multiple) new layers to the other layer. This is a bit abstract, but when you look closely, you will find that the situation is not so complicated. In this paper, the core of this method is divided into two parts: basal capsule and digital capsule. In our case, the latter part will be renamed as traffic sign capsules. capsules and Base capsules
This layer creates a new convolution layer consisting of n * C filters based on the classic convolution calculation. n indicates the number of filters, and C indicates the size of each capsule. Therefore, a new image of N * C with a size of (T,T) is created. In the image above, the value of each capsule is displayed in red in the newly created image. The TensorFlow code is as follows:
Conv = tf.contrib.layers.conv2d (Input_layer, n * C, kernel, stride, padding= "VALID")
# Shape: (?, T, T, N * C)
Now that we have created the convolution operation, we can rearrange the convolution to create the capsule operation:
Capsules = Tf.reshape (CONV, shape= ( -1, T*t*n, C, 1)
# shape: (?, T*t*n, C, 1)
# Conv[0][0][0][:c] <=> CAPSU Les[0]
Then we get t * t * n capsules of size C (1152 capsules in this project). It should be noted that the first C value of the convolution (see comments in the code) is equal to the value of the first capsule, as shown in the preceding code. Finally, a new nonlinear function is given in the original paper, which can be applied to each capsule individually. This new function is called squeezing (squashing) and looks like this:
Therefore, we use the nonlinear squashing function to ensure that the length of the short vector is compressed to close to 0, while the length of the vector is compressed to just below 1 traffic sign Capsules
In this project, this layer consists of 43 capsules, each of which represents a specific traffic sign. To determine the predictive results of the model, we can select capsules with the maximum length. Prior to this, however, a conversion was required between the 1152 capsules in the previous layer. This is done through the routing method. The effect of this method is to select which capsules from the previous layer are associated with the output layer capsules. In other words, for each capsule, a new neural network is judged: "Hey, is this capsule valuable for judging this class?" ”
With the iterative routing process, each active capsule selects a capsule in the upper layer as its parent node in the tree.
In routing, the selection of features is no longer as random as pooling. In this article, I will not elaborate on the exact formula used for routing, which is described in the paper. The implementation code for this project is on my GitHub. I'm still improving to make the algorithm more scalable. For the traffic sign capsules and routing, I try to follow the mathematical formula in the paper in the implementation. Image Reconstruction
This approach helps guide the network to treat the capsule vectors as actual objects, allowing each image to be encoded before rebuilding. This has also been a good result in regularization.
We use additional reconstruction losses to encourage the digital capsules to encode the instantiated parameters of the input numbers.
This part of the implementation code is also included in the project GitHub, the image reconstruction implementation in the code, using convolution and nearest neighbor algorithm to enlarge the image. In fact, I can't just create a bunch of simple layers, because the image to be rebuilt contains 3 output channels. Although this implementation is fairly good in the mnist data, I still have some doubts about its effectiveness in large-scale solutions, but this is only my personal point of view.
Thus, the final loss of the model is based on two optional losses: Marginal loss: The actual prediction based on the model. This is the highest standard of capsules. Rebuild loss: The average value of the decoder loss based on the squared difference between images. Model Schema
Because I'm dealing with datasets that are different from the original paper, there are some tweaks to the model architecture.
The first convolution uses 256 filters, a kernel of size 9 (valid fill), Relu activation, and a dropout value of 0.7.
The base capsule layer contains 16 filters, 5 cores and 16 capsules. Finally, a 256 (10,10) size filter is obtained. That is, 1600 capsules of 16 value. The last layer (traffic sign capsule) consists of 43 capsules (43 classes) of size 32.
The construction code for the above structure is as follows:
def _build_main_network (self, images, conv_2_dropout): "" "This method is used to create the both convolutions And the capsnet on the top **input: *images:image PLaceholder *conv_2_dropout:dropout VA Lue placeholder **return: * * *caps1:output of First Capsule layer *caps2:output of Secon
D Capsule Layer "" "# First BLock: # layer 1:convolution. Shape = (self.h.conv_1_size, self.h.conv_1_size, 3, self.h.conv_1_nb) Conv1 = Self._create_conv (self.tf_images, Shape,
Relu=true, Max_pooling=false, padding= ' VALID ') # Layer 2:convolution. Shape = (self.h.conv_2_size, self.h.conv_2_size, SELF.H.CONV_1_NB, self.h.conv_2_nb) conv2 = Self._create_conv (CONV1, Shape, Relu=true, Max_pooling=false, padding= ' VALID ') conv2 = Tf.nn.dropout (Conv2, keep_prob=conv_2_dropout) # Cr Eate the first capsules layer CAPS1 = Conv_caps_layer (Input_layer=conv2, capsules_size=self.H.caps_1_vec_len, Nb_filters=self.h.caps_1_nb_filter, kernel=self.h.caps_1_size) # Create the second C Apsules layer used to predict the output CAPS2 = Fully_connected_caps_layer (Input_layer=caps1, Capsul Es_size=self.h.caps_2_vec_len, Nb_capsules=self.nb_labels, iterations=self.h.routing_steps) return CAPS1, Caps2
Training
During training I used the Keras imagedatagenerator for data enhancement.
Results (accuracy): training: 99% verified: 98% tested: 97%
This result failed to achieve the best results of the classic convolutional neural network. However, given that I spend most of my time implementing the capsule network, rather than spending on hyper-parameter tuning and image processing, 97% is a good result for me. I'm still trying to improve the indicator. Classification Examples
If you like this article, please follow my headline number: The Brain in the new cylinder.
Original: Understand and apply capsnet on traffic sign classification