This article covers the second Hinton ' capsule network the Matrix paper and EM capsules, Routing both by authored E Hinton, Sara Sabour and Nicholas frosst. We would cover the matrix capsules and apply EM (expectation maximization) routing to classify images with different Viewpoints. For those want to understand the detail implementation, the second half of the article covers a implementation on the mat Rix capsule and EM routing using TensorFlow. CNN challenges
In our previous capsule article, we cover the challenges of CNN in exploring spatial relationship and discuss how capsule Networks may address those short-comings. Let's recap some important challenges of CNN in classifying the same class of images but in different viewpoints. For example, classify a face correctly even with different orientations.
Conceptually, the CNN trains neurons to handle different feature-orientations (0°, 20°, -20°) with a top level face Detec tion Neuron.
To solve the problem, we add convolution layers and features maps. Nevertheless this approach tends to memorize the dataset rather than a generalize. It requires a large volume of training data to cover different variants and to avoid overfitting. Mnist DataSet contains 55,000 training data. i.e. 5,500 samples per digits. However, it is unlikely the children need so many the samples to learn numbers. Our existing deep learning models including CNN are inefficient in utilizing. adversaires
CNN is also vulnerable to adversaires by simply move, rotate or resize individual features.
We can add tiny un-noticeable changes to a fool a deep network. The image on the left below are correctly classified by a CNN network as a panda. By selectively adding small changes from the middle of the picture to the panda, the CNN suddenly mis-classifies the ResU Lting image in the right as a gibbon.
(Image source OpenAI) capsule
A capsule captures the likeliness of a feature and its variant. So the capsule does not only detect a feature but it are trained to learn and detect the variants.
For example, the same network layer can detect a face rotated clockwise.
Equivariance is The detection of objects that can transform to each. Intuitively, the capsule network detects the face are rotated right 20° (a equivariance) rather than realizes the face mat Ched a VARIANT that is rotated 20°. By forcing the "model to learn" feature variant in a capsule, we may extrapolate possible-variants more effect Ively with less training data. In CNN, the final label is viewpoint invariant. i.e. the top neuron detects a face but losses the information in the angle of rotation. For Equivariance, the variant information like the angle of rotation is kept inside the capsule. Maintaining such spatial orientation helps us to avoid adversaires. Matrix Capsule
A matrix capsule captures the activation (likeliness) similar to that of a neuron, but also, captures a 4x4 pose matrix. In computer graphics, a pose matrix defines the translation and the rotation of an object which be equivalent to the Chang E of the viewpoint of an object.
(Source from the Matrix capsules with EM routing paper)
For example, the second row images below represent the same object above them with differen viewpoints. In Matrix capsule, we train the model to capture the pose information (orientation, azimuths etc ...). Of course, just like the deep learning methods, this is we intention and it is never guaranteed.
(Source from the Matrix capsules with EM routing paper)
The objective of the ' EM (expectation maximization) routing is to group capsules to form a part-whole using a Clustering technique (EM). In machine learning, we use the EM clustering to cluster datapoints into Gaussian distributions. For example, we cluster the datapoints below to two clusters modeled by two Gaussian distributions G1=n (μ1,σ21) g1=n (μ1,σ ) and G2=n (μ2,σ22) g2=n (μ2,σ22). Then we represent the datapoints by the corresponding Gaussian distribution.
In the face detection example, each of the mouth, eyes and nose detection capsules in the lower layer makes (V OTEs) on the pose matrices of its possible parent capsules. Each vote was a predicted value for a parent capsule's pose matrix, and it is computed by multiplying its own pose matrix M M with a transformation matrix WW, we learn from the training data. V=mwv=mw
We Apply the EM routing to group capsules into a parent capsule in runtime:
i.e., if the nose, mouth and eyes capsules all vote a similar pose matrix value, we cluster them together to form a parent Capsule:the face capsule.
A higher level feature (a face) are detected by looking to agreement between votes from the capsules one layer. We use EM routing to cluster capsules this have close proximity of the corresponding votes. Gaussian mixture Model & expectation maximization (EM)
We'll take a short break to understand EM. A Gaussian mixture model clusters datapoints into a mixture to Gaussian distributions described by a meanμμand a standar D deviationσ