Why use convolution?
In traditional neural networks, such as Multilayer perceptron (MLP), whose input is usually a feature vector, requires manual design features, and then the values of these features to form a feature vector, in the past decades of experience, the characteristics of artificial found is not how to use, sometimes more, sometimes less, Sometimes the selected features do not work at all (the truly functional feature is inside the vast unknown). This is why in the past convolutional neural networks have been the reason for the abuse of SVM.
If some people say that any feature is extracted from the image, then if the whole image as a feature to train the neural network is not OK, there will be no information lost! Let's not talk about how much redundant information an image has, just a lot more ...
If there is an image of 1000*1000, if the whole image is taken as a vector, the length of the vector is 1000000 (10^6). If the number of neurons in the hidden layer is the same as the input, it is also 1000000; then, the input layer to the hidden layer parameter data amount of 10^12, mom, what kind of machine can train such a network. So, we have to reduce the number of dimensions, but also to the entire image as input (human can not find a good feature). So, the brilliant convolution came. Now let's see what the convolution has done.
CNN convolutional Neural network hierarchy
The CNN network has a total of 5 hierarchies:
- Input layer
- Convolution layer
- Activation layer
- Pooling Layer
- fully connected FC layer
1 input Layer
As with traditional neural network/machine learning, models need to be entered for preprocessing, and the common input layer is preprocessed in the following ways:
- Go to mean value
- Normalization
- PCA/SVD dimensionality Reduction etc.
2 convolution layer
Local perception : The process of human brain recognition picture, not all of a sudden the whole picture at the same time to identify, but for each feature in the image of the first local perception, and then a higher level of the local comprehensive operation, resulting in global information. (Detailed later)
3 excitation Layer
The so-called excitation is actually a nonlinear mapping of the output of the convolution layer.
If the excitation function is not used (in fact, it is equivalent to the excitation function is f (x) =x), in this case, the output of each layer is a linear function of the input of the previous layer. It is easy to draw, no matter how many neural network layer, the output is the linear combination of input, and no hidden layer effect is the same, this is the most primitive perception machine .
The commonly used excitation functions are:
- sigmoid function
- Tanh function
- ReLU
- Leaky ReLU
- ELU
- Maxout
Incentive layer Recommendations: First Relu, because the iteration speed is fast, but it is possible that the effect is not added. If Relu fails, consider using leaky relu or maxout, at which point the general situation can be resolved. The Tanh function has a better effect on text and audio processing.
4 pooling Layer
pooling (Pooling): Also known as under or under sampling. It is mainly used for feature dimensionality reduction, compressing data and parameters, reducing over fitting, and improving fault tolerance of the model. Mainly include:
- Max Pooling: Max pooling
- Average Pooling: Average pooling
Through the pooling layer, the original 4*4 feature map is compressed into 2*2, which reduces the feature dimension.
Although it is not easy to distinguish the characteristics of the pool after the map, but it does not matter, the machine can still be recognized.
5 output layer (full connection layer)
After a number of previous convolution + excitation + pool, finally came to the output layer, the model will be learned a high-quality feature picture of the full connection layer. In fact, before the fully connected layer, if the number of neurons is too large, learning ability, there may have been fitted. Therefore, the dropout operation can be introduced to randomly delete some neurons in the neural network to solve this problem. It can also be used for local normalization (LRN), data enhancement and other operations to increase robustness.
When it comes to the fully connected layer, it can be understood as a simple multi-classification neural network (such as: BP Neural network), through the Softmax function to obtain the final output. The entire model was trained.
All neurons in the two layers have a weighted connection, usually the full-attached layer at the tail end of the convolutional neural network. It is the same way that traditional neural network neurons are connected:
A detailed description of the convolutional layer and the pooling layer of the CNN convolutional Neural Network
convolutional Neural Networks (convolutional neural Network, referred to as CNN), is a feedforward neural network, the artificial neurons can respond to the surrounding units, can be large-scale image processing. convolutional neural networks consist of convolutional layers and pooled layers.
convolutional neural networks are MLPs (multilayer perceptron), which is inspired by the way of biological thinking, and it has different class levels, and the working methods and functions of each layer are different. A good CNN tutorial (http://cs231n.github.io/convolutional-networks/) is available here. In this article, we introduce the calculation method of CNN and the flow process of data in detail.
convolutional neural network is a kind of artificial neural network, which has become the research hotspot in the field of speech analysis and image recognition. Its weight-sharing network structure makes it more similar to the biological neural network, reduces the complexity of the network model and reduces the number of weights. This advantage is more obvious when the input of the network is multidimensional image, so that the image can be used as the input of the network directly, and avoids the complicated feature extraction and data reconstruction process in the traditional recognition algorithm. Convolutional network is a kind of multilayer perceptron which is specially designed for recognizing two-dimensional shape, which is highly invariant to the transformation of translation, scale, tilt or co-form.
CNNs is affected by the early delayed neural network (TDNN). The delay neural network reduces the learning complexity by sharing weights in the time dimension, and is suitable for the processing of speech and time series signals.
CNNs is the first learning algorithm to truly successfully train a multi-layered network structure. It uses spatial relationships to reduce the number of parameters that need to be learned to improve the training performance of the general Feedforward BP algorithm. CNNs as a deep learning architecture is proposed to minimize the preprocessing requirements of data. In CNN, a small part of the image (local sensing area) as the lowest layer of the input of the hierarchy, the information is transferred to different layers, each layer through a digital filter to obtain the most significant characteristics of the observed data. This approach is able to capture the salient features of the observed data that are invariant to translation, scaling, and rotation, because the local sensing area of the image allows neurons or processing units to access the most basic features, such as directional edges or corner points.
(1) History of convolutional neural Networks
1962 Hubel and Wiesel through the study of the visual cortex cells of cats, the concept of receptive field was presented, and the 1984 Japanese scholar Fukushima neural cognition machine based on the concept of Sensation Wild (Neocognitron) It can be regarded as the first realization network of convolutional neural network, and it is the first application of the concept of sensation field in artificial neural network. The neuro-cognitive machine decomposes a visual pattern into many sub-patterns (features) and then goes into a hierarchical, connected feature plane, which attempts to model the visual system so that it can be identified even when the object is displaced or slightly deformed.
Usually, the neuro-cognitive machine contains two kinds of neurons, that is, the S-element and the anti-deformation C-element that bear the feature extraction. S-element involves two important parameters, namely, the sensing field and the threshold parameter, the former determines the number of input connections, while the latter controls the degree of response to the characteristic sub-pattern. Many scholars have been working to improve the performance of neuro-cognitive machines: in traditional neuro-cognitive machines, the visual fuzzy quantity brought by C-element in the photosensitive region of each S-element is normally distributed. If the edge of the photosensitive region produces a blurring effect larger than the center, the s-element will accept greater deformation tolerance resulting from this non-normal blur. What we want to get is that the difference between the training pattern and the effect of the deformation stimulation pattern on the edge of the sensing field and its center is getting bigger and larger. In order to effectively form this kind of non-normal fuzzy, an improved neuro-cognitive machine with double C-Fukushima layer is proposed.
Van Ooyen and Niehuis introduced a new parameter to improve the ability of the neuro-cognitive machine to differentiate. In fact, this parameter, as a restraining signal, suppresses the excitation of the neuron to the repetitive excitation feature. Most neural networks memorize training information in weights. According to Hebb learning rules, the more times a particular feature is trained, the more likely it is to be detected in the subsequent identification process. Some scholars also combine evolutionary computing theory with a neuro-cognitive machine, which makes the network pay attention to the different characteristics to help improve the distinguishing ability by weakening the training and learning of repetitive excitation features. All of these are the development process of neuro-cognitive machine, and convolutional neural network can be regarded as the generalization form of neural cognition machine, which is a special case of convolution neural network.
(2) network structure of convolutional neural network
Let's introduce some of the terms that the convolutional layer encounters:
• Depth/depth (see explanation)
• Step/stride (the length of the window slide)
• Padding value/zero-padding
What is the padding value? For example, such as there is a 5*5 picture (a lattice a pixel), we swipe the window to take 2*2, the step size of 2, then we found that there are still 1 pixels left to slide, then what to do?
Then we add a layer of padding to the original matrix, making it a 6*6 matrix, so the window can just walk through all the pixels. This is the effect of padding values.
convolutional Neural Network is a multilayer neural network, each layer is composed of several two-dimensional planes, and each plane consists of several independent neurons.
, when the CNN network works, it will accompany the convolution and constantly convert the convolution.
Figure 1: Conceptual demonstration of convolutional neural networks
The input image is convolution with three trained filters and an optional offset, and after convolution, the C1 layer generates three feature maps, then the four pixels of each group in the feature map are summed, weighted, and biased. A feature map of three S2 layers is obtained through a sigmoid function. These maps are then filtered to get the C3 layer. This hierarchy produces S4 as well as S2. Eventually, these pixel values are rasterized and connected into a vector input to the traditional neural network, resulting in output.
Generally, the C layer is a feature extraction layer, each neuron's input is connected to the local sensation field in the previous layer, and the local characteristics are extracted, and once the local feature is extracted, the position relationship between it and other features is determined; s layer is the feature map layer, and each computing layer of the network is composed of multiple feature mappings. Each feature is mapped to a plane, and the weights of all neurons on the plane are equal. The feature mapping structure uses the sigmoid function which affects the function core as the activation function of convolutional network, which makes the feature map have displacement invariance.
In addition, due to the sharing weights of neurons on a mapped surface, the number of free parameters is reduced and the complexity of network parameter selection is reduced. Each feature extraction layer (c-layer) in convolutional neural network is followed by a computing layer (S-layer) for local averaging and two extraction, and this unique two-time feature extraction structure makes the network more tolerant to the input sample when it is recognized.
(3) reduction of parameters and sharing of weights and values
It's like it's a great place to talk about CNN. The number of parameters that a neural network needs to train is reduced by feeling wild and sharing weights. What the hell is that?
convolutional Neural Network There are two kinds of artifacts that can reduce the number of parameters, the first artifact is called local perception field. It is generally believed that the cognition of people to the outside world is from local to global, and the spatial relation of image is more closely related to pixels, while the pixel correlation is weaker than that of distance. Thus, it is not necessary for each neuron to perceive the global image, but to perceive it locally, and then to synthesize the local information at higher levels to get a global message. The idea that the network is partially connected is also a visual system structure inspired by biology. Neurons in the visual cortex are locally receptive (i.e., these neurons respond only to specific areas of stimulation)
Local perception
Left: If we have an image of 1000x1000 pixels, there are 1 million hidden neurons, then they are all connected (each hidden layer neuron is connected to each pixel of the image), there is a 1000x1000x1000000=10^12 connection, that is, 10^12 weight parameters. However, the spatial connection of the image is local, just like a person through a local feeling of the field to feel the external image, each neuron does not need to feel the global image, each neuron only feel the local image area, and then in the higher level, these feelings of different local neurons can be obtained the overall information. In this way, we can reduce the number of connections, that is, to reduce the number of weight parameters that neural networks need to train. such as right: if the local feeling field is 10x10, the hidden layer of each feeling field only need and this 10x10 local image connection, so 1 million hidden layer neurons have only 100 million connections, that is, 10^8 parameters. Four less than the original 0 (order of magnitude), so the training is not so laborious, but still feel a lot of ah, there is no way to do?
We know that each neuron in the hidden layer is connected to the 10x10 image area, which means that each neuron has a 10x10=100 connection weight parameter. What if the 100 parameters of each of our neurons are the same? This means that each neuron uses the same convolution kernel to deconvolution the image. So we only have how many parameters?? Only 100 parameters Ah!!! Kiss! Regardless of the number of neurons in your hidden layer, I only have 100 parameters for the connection between the two layers. Kiss! This is the weight sharing Ah! Kiss! This is the main selling point of convolutional neural network Ah! Kiss! (a bit annoying, hehe) You may ask, is this a reliable way to do it? Why is it possible? This one...... Learn together.
Well, you will think, so the extraction of features is not reliable, so you only extracted a feature ah? Yes, it's smart, we need to extract a lot of features, right? If a filter, or convolution kernel, is a feature of the proposed image, such as the edge of a certain direction. So we need to extract the different characteristics, how to do, add a few more filters will not be OK? That's right. So suppose we add 100 filters, each of which has different parameters, indicating that it presents the various features of the input image, such as different edges. So each filter goes to the convolution image to get a different feature of the image show, which we call feature Map. So there are 100 feature maps of 100 convolution cores. These 100 feature maps form a single layer of neurons. It's clear by this time. How many parameters do we have on this floor? 100 convolution cores x each convolution core shares 100 parameter =100x100=10k, which is 10,000 parameters. Only 10,000 parameters Ah! Kiss! (Come again, can't stand it!) See right: Different colors to express different filters.
Hey yo, missing a question. It is said that the number of the hidden layer parameter is independent of the number of neurons in the hidden layer, which is only related to the size of the filter and the type of filter. So how do we determine the number of neurons in the hidden layer? It is related to the original image, that is, the size of the input (number of neurons), the size of the filter and the sliding step length of the filter in the image! For example, my image is 1000x1000 pixels, and the filter size is 10x10, assuming that the filter does not overlap, that is, the step is 10, so that the number of neurons in the hidden layer is (1000x1000)/(10x10) =100x100 neurons, assuming the step is 8, That is, the convolution core overlaps two pixels, so ... I will not forget, the idea of understanding is good. Note that this is just a filter, that is, the number of neurons in a feature map Oh, if 100 feature map is 100 times times. Thus, the larger the image, the number of neurons and the number of weights needed to train the gap between the rich and poor is greater.
It is important to note that the above discussion does not take into account the biased parts of each neuron. So the number of weights needs to be added 1. This is also shared with the same filter.
In short, the core idea of convolutional networks is to combine the three structural ideas of local sensation field, weighted value sharing (or weight reproduction) and time or spatial sub-sampling to obtain some degree of displacement, scale and deformation invariance.
4) A typical example illustrates
A typical convolutional network used to identify numbers is LeNet-5 (see the effects and paper). Most American banks used it to identify handwritten figures on cheques. Can reach the point of this commercial, it is conceivable accuracy. The current combination of academia and industry is, after all, the most controversial.
Let's also use this example to illustrate the following.
LeNet-5 has 7 layers, does not contain input, each layer contains training parameters (connection weights). The input image is 32*32 size. This is larger than the largest letter in the Mnist database (a recognized handwriting database). The reason for this is to hope that potential salient features such as stroke power outages or corner points can appear at the center of the highest level feature monitoring sub-field.
Let's be clear: there are multiple feature maps per layer, each feature map extracts a feature of the input via a convolution filter, and then each feature map has multiple neurons.
The C1 layer is a convolution layer (why is convolution?) An important feature of convolution operation is that the original signal features can be enhanced and the noise reduced by convolution operation, which consists of 6 feature maps feature map. Each neuron in the feature diagram is connected to the 5*5 neighborhood in the input. The size of the feature map is 28*28, which prevents input connections from falling outside the boundary (for BP feedback calculations, without gradient loss, personal insights). The C1 has 156 parameters that can be trained (each filter 5*5=25 a unit parameter and a bias parameter, altogether 6 filters, a total of (5*5+1) *6=156 parameters), a total of 156* (28*28) =122,304 a connection.
The S2 layer is a lower sample layer (why is it under sampling?). By using the principle of local correlation of images, the sub-sampling of images can reduce the amount of data processing while preserving useful information, and there are 6 14*14 feature graphs. Each cell in the feature map is connected to the 2*2 neighborhood of the C1 in the corresponding feature graph. The 4 inputs of each unit of the S2 layer are added, multiplied by a training parameter, plus a trained bias. The result is calculated by the sigmoid function. The training coefficients and biases control the nonlinearity of the sigmoid function. If the coefficients are relatively small, then the operation is approximate to the linear operation, and the sub-sampling is equivalent to the blurred image. If the coefficients are larger, the sub-sampling according to the biased size can be considered as noisy "or" or noisy "and" operations. The 2*2 of each cell does not overlap, so the size of each feature plot in S2 is 1/4 of the feature plot size in C1 (row and column 1/2). There are 6 14*14 feature maps. Each cell in the feature map is connected to the 2*2 neighborhood of the C1 in the corresponding feature graph. The 4 inputs of each unit of the S2 layer are added, multiplied by a training parameter, plus a trained bias. The 2*2 of each cell does not overlap, so the size of each feature plot in S2 is 1/4 of the feature plot size in C1 (row and column 1/2). The S2 layer has 12 (6* () =12) training parameters and 5880 (14*14* (2*2+1) *6=5880) connections.
Figure: Convolution and sub-sampling process: Convolution process includes: Using a trained filter FX to convolution an input image (the first stage is the input image, the later stage is convolution feature map), and then add a bias bx, to get convolution layer cx. The sub-sampling process consists of a four pixel summation of each neighborhood into one pixel, then a scalar wx+1 weighting, and then an increase in the bias bx+1, and then a sigmoid activation function, resulting in a roughly four times times reduced feature map sx+1.
So the mapping from a plane to the next plane can be considered as convolution operation, and S-layer can be regarded as a fuzzy filter, which plays the role of two feature extraction. The spatial resolution between the hidden layer and the hidden layer decreases, and the number of planes in each layer increases, which can be used to detect more characteristic information.
C3 layer is also a convolution layer, it also through the 5x5 convolution core deconvolution layer S2, and then get the feature map is only 10x10 neurons, but it has 16 different convolution cores, so there are 16 feature map. One thing to note here is that each feature map in C3 is connected to all 6 or several feature maps in the S2, indicating that the feature map of this layer is a different combination of the feature map extracted from the previous layer (this is not the only option). (see no, here is the combination, just like the human visual system that was talked about before, the underlying structure forms the upper layer of more abstract structures, such as edges that form shapes or parts of the target).
Just now, each feature map in C3 is composed of all 6 or several feature maps in the S2. Why not connect each feature map in S2 to the feature map of each C3? There are 2 reasons. First, the incomplete connection mechanism keeps the number of connections within a reasonable range. Second, and most important, it destroys the symmetry of the network. Because different feature maps have different inputs, they are forced to draw different characteristics (hopefully complementary).
For example, one way to exist is: the first 6 features of the C3 are entered with 3 adjacent feature map subsets in S2. The next 6 features are entered with a subset of 4 adjacent feature maps in S2. Then the 3 are entered with a subset of the 4 non-contiguous feature maps. The last one will have all the feature graphs in the S2 as input. This allows the C3 layer to have 1516 training parameters and 151,600 connections.
The S4 layer is a lower sampling layer consisting of 16 5*5-sized feature maps. Each cell in the feature map is connected to the 2*2 neighborhood of the corresponding feature graph in the C3, as is the connection between C1 and S2. The S4 layer has 32 training parameters (1 factors per feature figure and one bias) and 2000 connections.
The C5 layer is a convolution layer with 120 feature graphs. Each unit is connected to the 5*5 neighborhood of all 16 cells in the S4 layer. Since the size of the S4 layer feature map is also 5*5 (same as the filter), the size of the C5 feature map is 1*1: This constitutes the full connection between S4 and C5. The C5 is still labeled as a convolutional layer rather than a fully-connected layer, because if the input of LeNet-5 is larger and the others remain the same, then the dimension of the feature map will be larger than 1*1. The C5 layer has 48,120 training connections.
The F6 layer has 84 units (The reason why this number is chosen is from the design of the output layer) and is fully connected to the C5 layer. There are 10,164 parameters that can be trained. Like the classical neural network, the F6 layer calculates the dot product between the input vector and the weight vector, plus a bias. It is then passed to the sigmoid function to produce a state of unit I.
Finally, the output layer consists of a European radial basis function (Euclidean Radial Basis function) unit, one unit per class, each with 84 inputs. In other words, each output RBF unit computes the Euclidean distance between the input vector and the parameter vector. The farther away the input is from the parameter vector, the greater the RBF output. A RBF output can be interpreted as a penalty for measuring the input pattern and the degree of matching of a model of the RBF associated class. In terms of probabilistic terminology, the RBF output can be understood as the negative log-likelihood of the Gaussian distribution of the F6 layer configuration space. Given an input pattern, the loss function should be able to make the configuration of the F6 close enough to the RBF parameter vectors (i.e., the expected classification of the pattern). The parameters of these units are manually selected and remain fixed (at least initially). The components of these parameter vectors are set to-1 or 1. Although these parameters can be selected in the form of probabilities such as 1 and 1, or form an error-correcting code, they are designed as a formatted picture of the 7*12 size (or 84) of the corresponding character class. This representation is not useful for identifying individual numbers, but is useful for identifying strings that can be printed in an ASCII set.
Another reason to use this distributed encoding instead of the more commonly used "1 of N" encoding to produce output is that the non-distributed encoding is less effective when the category is large. The reason is that most of the time the output of the non-distributed encoding must be 0. This makes it difficult to achieve with the sigmoid unit. Another reason is that classifiers are used not only to identify letters, but also to reject non-letters. Using distributed coded RBF is more suitable for this target. Because unlike sigmoid, they are excited in areas where the input space is better constrained, rather than a typical pattern that is easier to fall outside.
The RBF parameter vector plays the role of the target vector of the F6 layer. It should be noted that the composition of these vectors is +1 or-1, which is within the range of the F6 sigmoid, thus preventing the sigmoid function from saturating. In fact, +1 and 1 are the most curved points of the sigmoid function. This allows the F6 unit to operate within the maximum nonlinear range. The saturation of the sigmoid function must be avoided, as this will result in a slow convergence and ill-posed problem for the loss function.
5) Training process
The mainstream of neural network for pattern recognition is guided learning network, and no Guidance Learning Network is used for clustering analysis. For guided pattern recognition, because the class of any sample is known, the distribution of the sample in space is no longer based on its natural distribution tendency, but rather to find an appropriate spatial partitioning method based on the spatial distribution of homogeneous samples and the degree of separation between different classes of samples, or to find a classification boundary, So that different classes of samples are located in different areas. This requires a lengthy and complex learning process that continuously adjusts the location of the classification boundaries used to divide the sample space so that as few samples as possible are divided into non-homogeneous areas.
Convolutional networks are essentially input-to-output mappings that can learn a large amount of mapping between input and output, without the need for precise mathematical expressions between inputs and outputs, as long as the network is trained with a known pattern for convolutional networks, which has the ability to map between input and output pairs. The Convolutional network performs a mentor training, so its sample set consists of a vector pair of shapes such as: (input vector, ideal output vector). All of these vectors should be the actual "running" result of the system that the network is about to emulate. They can be collected from the actual operating system. Before starting the training, all weights should be initialized with a few different small random numbers. The "small random number" is used to ensure that the network does not enter saturation due to excessive weights, resulting in training failure; "Different" is used to ensure that the network can learn normally. In fact, if you use the same number to initialize the weight matrix, the network is incapable of learning.
The training algorithm is similar to the traditional BP algorithm. It consists of 4 steps, and these 4 steps are divided into two stages:
The first stage, the forward propagation phase:
A) Take a sample (X,YP) from the sample set and input X into the network;
b) Calculate the corresponding actual output op.
At this stage, the information is transferred from the input layer to the output layer through a gradual transformation. This process is also the process that the network executes when it is running properly after the training is completed. In this process, the network performs a calculation (in effect, the input is multiplied by the weight matrix of each layer, resulting in the final output):
OP=FN (... (F2 (F1 (XpW (1)) W (2)) ... ) W (n))
Second stage, backward propagation phase
A) calculates the difference between the actual output op and the corresponding ideal output YP;
b) The inverse propagation of the adjustment weight matrix by minimizing the error.
6) Advantages of convolutional neural Networks
Convolutional Neural Networks CNN is mainly used to identify two-dimensional graphs of displacement, scaling and other forms of distorted invariance. Because CNN's feature detection layer learns through training data, it avoids explicit feature extraction and implicitly learns from training data when using CNN, and because the weights of neurons on the same feature map face are the same, the network can learn in parallel, This is also a major advantage of convolutional networks over the network of neurons connected to each other. Convolution neural network has unique superiority in speech recognition and image processing because of its local weight sharing special structure, its layout is closer to the actual biological neural network, weight sharing reduces the complexity of the network, In particular, the image of multidimensional input vectors can be directly input to the network, which avoids the complexity of data reconstruction during feature extraction and classification.
The classification of streams is almost always based on statistical features, which means that certain features must be extracted before they can be resolved. However, explicit feature extraction is not easy and is not always reliable in some application issues. convolutional neural networks, which avoids explicit feature sampling and implicitly learns from the training data. This makes the convolution neural network obviously different from other neural network classifier, and the feature extraction function is fused into multilayer perceptron through structure recombination and weight reduction. It can directly handle grayscale images and can be used directly to process image-based classification.
Convolutional networks have the following advantages in image processing compared to general neural networks:
- A) The topological structure of the input image and the network can match well;
- b) Feature extraction and pattern classification at the same time, and at the same time in training to produce;
- c) weight sharing can reduce the training parameters of the network, make the structure of neural network easier and more adaptable.
7) Summary
The close relationship between these layers and spatial information in CNNs makes them suitable for image processing and comprehension. Moreover, it also shows a better performance in extracting the salient features of the image automatically. In some cases, the Gabor filter has been used in an initial pre-processing step to simulate the response of the human visual system to visual stimuli. In most of the current work, researchers have applied cnns to a variety of machine learning problems, including face recognition, document analysis, and language detection. To achieve the purpose of finding coherence between frames and frames in a video, CNNs is currently trained through a temporal coherence, but this is not cnns specific.
How do I choose the size of the convolution kernel? The bigger the better or the smaller the better?
The answer is small and deep , individual smaller convolution cores are also bad, and the performance of the model can only be improved by stacking a lot of small convolution cores.
- CNN's convolution nucleus corresponds to a sensation field, which makes each neuron not need to feel the global image, each neuron only feels the local area of the image, and then at higher levels, the neurons that feel different parts can be combined to get global information. One of the benefits of this is that you can reduce the number of training parameters.
Vgg often have multiple identical 3x3 convolution cores stacked together, the design of these multiple small convolutional cores is actually very effective. Two 3x3 convolution layers are equivalent to 1 5x5 convolution layers, that is, a pixel is associated with the surrounding 5x5 pixels, which can be said to feel that the wild is 5x5. At the same time, the concatenation of 3 Series 3x3 convolution layers is equivalent to a 7x7 convolution layer. In addition, 3 series of 3x3 convolutional layers have fewer parameters than one 7x7, only the latter (3x3x3)/(7x7) = 55%. Most importantly, 3 3x3 convolution layers have more non-linear transformations than a 7x7 convolution layer (the former can be activated with three Relu, the latter being only once).
Calculation of feature map size after pooling of convolutional neural networks
the size after convolution
W: Matrix width, H: Matrix High, F: convolution core width and height, p:padding (number of 0 to fill), N: Number of convolution cores, S: Step
Width: width of output matrix after convolution, height: High output matrix after convolution
width = (w-f + 2P)/S + 1
Height = (h-f + 2P)/S + 1
When conv2d (), Max_pool () padding= ' same ', width=w,height=h when padding= ' valid ', p=0
Output image size: (width,height,n)
the size after pooling
width = (w-f)/S + 1
Height = (h-f)/S + 1
Boundary fill problem There are two problems with the convolution operation:
1. The image is getting smaller;
2. Image boundary information is lost, that is, some image corners and boundaries of information play a lesser role. therefore need padding.
Convolution core size is usually odd
On the one hand is to facilitate the same convolution padding symmetrical filling, the left and right sides symmetrical complement 0;
N+2p-f+1=n
p= (f-1)/2
On the other hand, the odd filter has a central pixel, which makes it easy to determine the position of the filter.
two ways to padding: "Same"/"valid"
"VALID" Discards only the columns (or columns that cannot be scanned at the bottom) that cannot be scanned to the right.
"Same" tries to add padding to the left and right, but if the number of columns added is odd, add the extra to the left (that is, when you keep the even numbers, the right and left padding are connected, the number is 1 more than the side/the top, or the bottom side is the same as in the vertical direction).
This article is my own study notes , referring to other bloggers ' notes, they just record their own learning process, if there is infringement please contact, thank you!