Deep Learning paper note (6) Multi-Stage Multi-Level Architecture Analysis

Last Update:2018-12-04 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Deep Learning paper note (6) Multi-Stage Multi-Level Architecture Analysis

Zouxy09@qq.com

Http://blog.csdn.net/zouxy09

I have read some papers at ordinary times, but I always feel that I will slowly forget it after reading it. I did not seem to have read it again one day. So I want to sum up some useful knowledge points in my thesis. On the one hand, my understanding will be deeper, and on the other hand, it will facilitate future surveys. You can also share your blog with us. Because of the limited foundation, some of my understanding of the paper may be incorrect. I hope you will not give me any comments. Thank you.

The thesis in this article comes from:

Kevin Jarret, koray kavukcuoglu, Marc 'aurelio ranzato, And Yann lecun, "what is the best multi-stage architecture for Object Recognition ?", In Proc. International Conference on Computer Vision (iccv '09), 2009

A simplified version of Matlab code is provided. The implementation is the random convolutional filters and linear logistic regression classifier Level Two target recognition systems.

The following is your understanding of some of these knowledge points:

What is the best multi-stage architecture for Object Recognition?

What kind of multi-stage architecture is the best for Target Recognition? In many current target recognition systems, the feature extraction stage generally consists of a set of filters, nonlinear transformations, and some types of feature pooling layers. Most systems use first-level feature extraction. At this time, the filter is hard-wired (manually selected, hard-wired, and the parameters cannot be adjusted), or two levels are used, at this time, one or two filters can be learned through supervised or unsupervised learning.

This article focuses on three issues:

1) How does the non-linearities nonlinear algorithm followed by the filter group affect the accuracy of recognition?

2) is the filter group learned through supervised or unsupervised learning better than a random filter group or a manually designated filter?

3) compared with only one-level feature extraction, are there other advantages of two-level feature extraction?

We have proved that:

1) using non-linear operators that contain correction and local contrast normalization can greatly improve the accuracy of target recognition.

2) Two-level feature extraction is better than the first-level feature extraction. Higher accuracy.

3) surprisingly, the two-level system of the random initialization filter group can achieve 63% recognition rate in Caltech data. Of course, this includes a suitable nonlinear operator and a pooling layer.

4) after fine-tuning, the system has reached the current leading level in the norb database. In addition, the unsupervised pre-training, coupled with supervised fine-tuning, can achieve better accuracy (more than 63%) in the Caltech database ). Then, in the mnist database that has not been processed, we can achieve the lowest error rate of 0.53% that we know currently.

I. Overview

In the past few years, many good feature descriptors have emerged for target recognition. In many ways, input images are divided into intensive patches arranged in regular order, and the features of these patches are extracted. Then combine the features of these patches in some way as the features of the input image. In summary, a large part of these systems are such a feature extraction process: the input goes through a filter bank (usually a directional edge detector) in a filter group ), after a non-linear operation (quantization, winner-take-all, sparsification, normalization, and/or point-wise saturation) operator ), then, a pooling operation (passing the value of the real space or feature space's neighbor through a max, average, or histogramming operator) is used to optimize and obtain certain immutability. For example, we are familiar with the sift feature. It first performs a directional edge detector on each small patch, and then uses the winner-take-all operator to obtain the most significant direction. Finally, calculate the histogram of the local direction on the patch of a larger block, and pooling into a sparse vector.

For a layer of feature extraction system, that is, after the above features are extracted, such as sift and hog, and then directly connected to a supervised learning classifier, a target recognition system is formed. There are also some models that use two or more levels of feature extractors, and then a supervised learning classifier to form a more complex target recognition system. The essential difference between these systems is that there is one or more Feature Extraction layers, non-linear operators used after filter banks, and filter banks (artificial selection, unsupervised learning, or supervised learning) and the top-level classifier selection (linear classifier or more complex classifier ).

In general, the choice of filter banks is Gabor wavelet, and some people choose some simple direction detection filter banks, that is, gradient operators, such as sift and hog. There are also some ways to learn these filter banks directly from training data through unsupervised learning methods. When training in natural images, the learned filter is similar to Gabor edge detection. One advantage of feature learning is that it can classify learning features. Because we have a certain degree of prior knowledge, we feel that the first level of features should be edge detectors, but what should the second level of features be? People do not have such a prior knowledge. Therefore, it is difficult to manually design a better secondary feature extraction device. Therefore, the secondary or multilevel features must be learned by the system itself. There are also a lot of methods available, including supervised, unsupervised, or joint.

At first glance, very few training databases like Caltech-101 (this database needs to recognize 101 categories of objects, but each type only provides very little labeled training data) training a complete system using only supervised learning algorithms seems naive and undesirable, because the number of model parameters is much larger than the number of training samples. Therefore, many people think that only a carefully trained or manually selected filter group can have good recognition performance, and then consider the choice of non-linear operators. In fact, these points are all wrong's.

Ii. Model Architecture

This section describes how to build a hierarchical Feature Extraction and classification system. One or more feature extraction stages are stacked. Each stage includes a filter combination layer, a nonlinear transform layer, and a pooling layer. The pooling layer is combined (average or maximum) the filter response in the local neighborhood achieves the immutability of small deformation.

1. filter bank layer-fcsg:

Fcsg generally consists of three parts: a generate product filter (c), a sigmoid/Tanh nonlinear transformation function (s), and a trainable gain coefficient (g ). Corresponding to the following three operations:

2. Correction layer rectification layer-rabs:

It is just a simple operation to take the absolute value (if it is Tanh, there is a negative value, but the negative value in the image is not expressed, and for Convolution, the larger the absolute value, the absolute values of nonlinear function output are the same in actual sense. The more similar the convolution is, the larger the output value ). In addition to absolute value operators, we have also tried other non-linear operators to produce similar results.

3. Local contrast normalization layer-N:

This module performs partial subtraction and division normalization, which forces local competition for adjacent features in Feature Map, it also forces the feature to compete in the same space location of different feature maps. The subtraction normalization operation is performed at a given position, which is actually the value of the position minus the weighted value of each pixel in the neighborhood. The weight is used to distinguish the difference between the distance and the location, the weight value can be determined by a Gaussian weighted window. Division normalization actually calculates the weighted sum value of each feature map in the neighborhood of the same spatial location, and then obtains the mean value of all feature maps, then, the value of the position of each feature map is recalculated by dividing the value of the point by Max (the mean value, the weighted sum of the value of the point in the neighborhood of the map ). Denominator represents the weighted standard deviation in the same spatial neighborhood of all feature maps. Oh, in fact, for an image, the mean and variance are normalized, that is, feature normalization. This is actually inspired by the computational neural science model. (For more information, see section 4 of this article)

4. Average pooling and subsampling layer average pooling and subsampling layer-PA:

The role of this layer is to make the extracted features robust to small deformation, which is similar to the role of complex cells in visual perception. Obtain the value of the lower sampling layer from the average value of all values in the sampling window.

5. Maximum pooling and subsampling layer max-pooling and subsampling layer-PM:

You can use any symmetric pooling operation to achieve translation immutability of extracted features. The maximum pool is similar to the average pool, but the maximum replaces the average. In general, pooled windows do not overlap.

Iii. Experiment and conclusion

This article has done a lot of experiments to verify the performance of different model architectures (combining different layers above ). The experiment results will not be listed here. You can view them in the original article. Here we will answer the first few questions:

1) How does the non-linearities nonlinear algorithm followed by the filter group affect the accuracy of recognition?

Our experimental conclusion is that a simple corrected non-linear operator will improve the recognition performance. There may be two reasons. A) the Inverse polarity of a feature (that is, a feature with negative values) is irrelevant to the target recognition. B) when the average pooling is adopted, the addition of the correction layer will eliminate the cancellations between adjacent filter outputs. If no correction is made, the average downsampling will only transmit the input noise. In addition, the local normalization layer also increases performance because it can make the supervised learning algorithm faster, it may be because all variables have similar variance (the same as the advantages of other Whitening and irrelevant methods), which will speed up the convergence.

2) is the filter group learned through supervised or unsupervised learning better than a random filter group or a manually designated filter?

The experimental results are very surprising, in the two-level system sampling random filter set in the Caltech-101 has reached a very high recognition rate of 62.9%, but in the norb database seems a little low-key, this may only happen when the number of training samples is small. In addition, unsupervised pre-training is followed by supervised fine-tuning, which has the best effect, although it is worse than simply using all the supervised methods.

3) compared with only one-level feature extraction, are there other advantages of two-level feature extraction?

Experiments show that two levels are better than the first level. Here we have two levels of system performance and the best level of system performance: SIFT features + PMK-SVM classifier comparable, maybe PM kernel also hides the second-level feature extraction function.

4. about local contract Normalization

Here, I will try again. Local contract Normalization: Local subtractive and divisive normalizations ). My understanding: natural images have low-order and high-order statistical features. The low-order (such as second-order) Statistical Features meet the Gaussian distribution, but the high-order statistical features are non-Gaussian distribution. In an image, adjacent pixels in space are strongly correlated. For PCA, because it operates on the covariance matrix, the second-order correlation of the input image can be removed, but the higher-order correlation cannot be removed. Some people prove that dividing by an implicit variable can remove the high-order correlation. You can understand that the pixel value of an image X is a random variable, which is obtained by multiplying two independent random variables, namely, the second order and the higher order, the correlation between the second order quantity can be removed from PCA, and then the higher order quantity (which is implicit and needs to be estimated using the Maximum Posterior Estimation Method of MAP) can be removed directly using X.

The operations of some papers are as follows:

For each pixel of the input image, we calculate the mean value of its neighbor (for example, the 3x3 window), and then subtract the mean value from each pixel and divide it by the neighbor window (for example, the 3x3 window) euclidean norm of the pulled 9-dimensional vector (if this norm is greater than 1, this constraint is exceeded to ensure that normalization only applies to the reduction response (divided by a value greater than 1 ), instead, the response will not be enhanced (the number divided by less than 1 increases )). In addition, when calculating the mean and norm, some papers Add the influence of distance, that is, the farther the distance from the center of the window, the smaller the influence, for example, add a Gaussian weight window (the correlation of adjacent pixels in the space decreases as the distance increases ).

In fact, many of them are still unclear here, so the above is not necessarily correct and only for reference. I also hope that some people will give you some advice. Thank you.

For more information about local contract normalization, see the following two articles:

S. LYU and so on: nonlinear image representation using divisive normalization.

N. Pinto: Why is real-world Visual Object Recognition hard?

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Deep Learning paper note (6) Multi-Stage Multi-Level Architecture Analysis

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Deep Learning paper note (6) Multi-Stage Multi-Level Architecture Analysis

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support