This article mainly describes the convolution and pooling technologies used by linear decoder in the big picture. For details, refer to http://deeplearning.stanford.edu/wiki/index.php/ufldl_tutorial.
Linear decoders:
The output layer in sparse autoencoder satisfies the following formula:
From the formula, we can see that the output value of A3 is the output of the f function, while in the common sparse autoencoder, the F function is generally the sigmoid function, therefore, the output value range is (0, 1), so we can know that the output value range of A3 is also between 0 and 1. In addition, we know that the output layer in the sparse model should have the same features as the input layer, that is, A3 = x1. In this way, we can deduce that X1 is between 0 and 1, that is, we need to first convert the data in the input to the network to the value between 0 and 1. Although this condition is met in some fields, for example, minist digital recognition in the previous experiment. However, in some fields, for example, the data after PCA whitening is used, the range is not necessarily between 0 and 1. Therefore, the linear decoder method appears. Linear decoder refers to the sigmoid function used in the hidden layer, while the excitation function in the output layer uses linear functions, such as the most special linear function-equivalent function. In this case, the output layer meets the following formula:
In this way, when using the BP algorithm to solve the gradient, you only need to change the formula for calculating the error point, and change it to the following formula:
Convolution:
Before learning about convolution, let's first understand why we need to develop from all connected networks to a local connected network. In a global connection network, if we have a large image, for example, 96*96, and the hidden layer needs to learn 100 features, all the points in the input layer are connected to the hidden layer nodes, you need to learn 10 ^ 6 parameters. In this way, the speed of using the BP algorithm is much slower.
Therefore, the local connection network is developed later. That is to say, each node in the hidden layer is connected only to a part of continuous input points. The advantage is that the visual cortex in the human cerebral cortex responds only to the local area at different locations. The convolution method is used to implement a local connection network in a neural network. Its Theoretical Basis in neural networks is for natural images, because they are stable, that is, the Statistical Features of a part of the image are similar to those of other parts, therefore, the features of a part we learned are also applicable to other parts.
The following describes how to implement convolution. For a large image xlarge dataset, the size of R * C is, first, we need to randomly sample small images with a size of a * B For This dataset, and then use these small image patches for learning (for example, sparse autoencoder). At this time, the hidden nodes are K. Therefore, the number of features finally learned is as follows:
In this case, convolution movement overlaps.
Pooling:
Although the convolution method can reduce many network parameters that need to be trained, such as 96*96 and 100 hidden layers, 8*8 patches and 100 hidden layers are used, the number of parameters to be trained is reduced to 10 ^ 3, which greatly reduces the difficulty of feature extraction. However, there is also a problem, that is, the dimension of its output vector has become very large. Originally, the fully connected network output only has 100 dimensions, the current network output is 89*89*100 = 792100 dimensions, which is greatly increased, which also brings difficulties to the design of the classifier, so the pooling method has emerged.
Why does the pooling method work? First, when using convolution, the stationarity feature of the image is used, that is, the Statistical Features of images in different parts are the same. Therefore, when using convolution to calculate a certain part of the image, the obtained vector should be a feature of the image. Since the image has a stationarity feature, Statistical Computation is performed on this feature vector, similar results should be obtained for all partial image blocks. The statistical calculation process for the results obtained by convolution is called pooling. Therefore, pooling is also effective. Common pooling methods include Max pooling and average pooling. And the learned features have rotation immutability (this reason is not clear for the moment ).
From the above introduction, we can simply understand that convolution is designed to solve the problem of computing complexity of unsupervised Feature Extraction learning, while the pooling method is used for supervised feature classifier learning, it is also to reduce the system parameters that need to be trained (of course, this is a general example of understanding, that is, we use unsupervised methods to extract the characteristics of the target, and employ supervised methods to train the classifier ).
References:
Http://deeplearning.stanford.edu/wiki/index.php/UFLDL_Tutorial