After some thought, I don't believe that pooling operations is responsible for the translation invariant property in CNN S. I believe that invariance (at least to translation) are due to the convolution filters (not specifically the pooling) an D due to the fully-connected layer.
For instance, let's use the Fig. 1 as reference:
The blue volume represents the input image, while the green and yellow volumes represent layer 1 and Layer 2 output Activa tion volumes (see cs231n convolutional neural Networks for Visual recognition If your is not familiar with these Volum ES). At the end, we had a fully-connected layer that was connected to all activation points of the yellow volume.
These volumes is build using a convolution plus a pooling operation. The pooling operation reduces the height and width of these volumes, while the increasing number of filters in each layer Increases the volume depth.
For the sake of the argument, let's suppose that we had very "ludic" filters, as show in Fig. 2:
- The first layer filters (which would generate the green volume) detect eyes, noses and other basic shapes (in Rea L CNNs, first layer filters would match lines and very basic textures);
- The second layer filters (which would generate the yellow volume) detect faces, legs and other objects that is Aggrega tions of the first layer filters. Again, this is a example:real life convolution filters may detect objects, which has no meaning to humans.
Now suppose this there is a face at one of the corners of the the image (represented by, red and a magenta point). The eyes is detected by the first filter, and therefore would represent the activations at the first slice N volume. The same happens for the nose, except that it's detected for the second filter and it appears at the second slice. Next, the face filter would find that there is eyes and a nose next to each other, and it generates an activation at t He yellow volume (within the same region of the the face at the input image). Finally, the fully-connected layer detects that there are a face (and maybe a leg and an arm detected by other filters) and It outputs that it had detected an human body.
Now suppose that the face have moved to another corner of the image, as shown in Fig. 3:
The same number of activations occurs in this example, however they occur in a different region of the green and yellow VO Lumes. Therefore, any activation in the first slice of the yellow volume means that a-face is detected, independently of T He face location. Then the fully-connected layer was responsible to ' translate ' a face and a human body. In both examples, an activation is received at one of the fully-connected neurons. However, in each example, the activation path inside the FC layer is different, meaning that a correct learning at the FC Layer is essential-ensure the Invariance property.
It must is noticed that the polling operation only "compresses" the activation volumes, if there is no polling in this ex Ample, an activation in the first slice of the yellow volume would still mean a face.
In conclusion, what makes a CNN invariant to object translation is the architecture of the neural network:the convolution Filters and the fully-connected layer. Additionally, I believe that if a CNN was trained showing faces only at one corner, during the learning process, the fully- Connected layer may become insensitive to faces on other corners.
Source
Https://www.quora.com/How-is-a-convolutional-neural-network-able-to-learn-invariant-features/answer/Jean-Da-Rolt
< turn > Convolution neural Network How to learn the invariant characteristics of translation