Reading papers_16 (learning hierarchical invariant spatio-temporal features for action recognition wi

Last Update:2018-12-06 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

　　Preface

This article is related to feature learning, also known as deep learning. It is a hot topic recently. Because it can learn some features of images and videos without supervision (of course, in other fields, such as voice and language processing), these features do not need to be manually set. Manually designed features, such as sift, surf, and hog, have been designed for a long time and are only applicable to 2D images, if you want to change the learning target to a video, you also need to extend these algorithms to 3D, such as hog3d and 3 dsurf. This expansion process also requires a lot of skills and time. In addition, a feature designed manually only performs well for some databases, and the effect on other databases cannot be guaranteed. The second disadvantage of manual feature design is that when you replace the data with other data in a certain field, such as the Kinect deep image, video data, and multi-structure data, then those manually designed feature points in the past would have worse performance. That is to say, if we change the data in a non-RGB field, these feature points will not be adapted, so we have to design them again, this is not an option. The third disadvantage is that the manual extraction process takes a long time and does not use big data feature extraction. Of course, deep learning is generally a layered learning structure. It also has a theoretical basis, that is, simulating the cerebral cortex of the human brain, because the visual area of the cerebral cortex also works in different layers, the underlying visual cortex is more sensitive to those underlying features. To sum up, feature learning is driven by so many application requirements and theoretically supported by biological nerves. It is destined to play a certain role in the AI field. Some experiments show that some features learned by feature are better than all other features. For example, the ISA model in this article is one of them.

　　ISA model in Natural Image　　

Let's take a look at the two-dimensional natural image model of the ISA model, as shown below:

The first layer in the graph is the path of the input image. The second layer is actually the first layer output of ISA. The W between the second layer and the first layer is the weight we need to learn, the weights of Layer 2 and Layer 3 of the output layer are fixed and do not need to be learned. We can see that the two green circles of the Layer 2 are adjacent to one of the three red circles, therefore, the number of subspaces here is 2. The input at the first layer is to convert the two-dimensional patch of the image into a one-dimensional vector. Of course, the learning process of W is the formula below the optimization graph, and the weight matrix W is an orthogonal matrix.

If each line of the weight matrix is converted into a two-dimensional patch image, these small patch images are the feature images we learned after the first output layer of ISA, as shown in:

The following English explanation shows that the learned patches have two features. First, they are sensitive to some edges, that is, they are some edge detection operators. The second feature is that two adjacent patches have similar features, such as detecting the edge in the horizontal direction or detecting the edge in the vertical direction.

In addition, these learned patches are insensitive to the translation of the input image, but sensitive to the frequency and rotation of the input image, therefore, the learned patches can recognize images with different frequencies and different rotation angles. The curves of the three features of these patches are shown below:

　　ISA model in natural video

The above ISA model is for two-dimensional images. It is very difficult to extend this model to three-dimensional video data, because a video data is equivalent to a cube, the dimension of the cube is very large. If we directly convert a video data into a one-dimensional vector and input it to the first layer of the ISA model, we will find that the dimension of this input vector is very large, so the number of columns of W is very large. In this way, in the process of training w parameters, we need to calculate a series of processes such as the feature vectors of W, the time complexity of these processes is proportional to the dimension of the input vector to the power of 3, so the computing speed is very slow. To solve this problem, the author uses the following method: instead of inputting the data of a video at a time, the author intercepts a Small Rectangular Patch in the video cube, after this patch is converted into a vector, It is input into the ISA model. At this time, the dimension of the ISA input vector is greatly reduced. In addition, the author uses PCA for dimensionality reduction, and the dimension is reduced. On the other hand, after using some training video patches to train an ISA model, the author copies these models directly to the next side, so that there are multiple such ISA models, then, they are output directly as the input of the second 'da' ISA model (of course, this is also subject to PCA whitening ), then the second model is trained using the same method until convergence. The output of the second ISA model is the final feature vector. The second layer W is also multiple video patches, which will be mentioned later. The two-layer Isa model is the stacked convolutional ISA mentioned in the paper. Its structure is as follows:

Of course, if we have trained the parameter W of the two ISA models, we can test the new video data and extract the feature vectors from the new video as follows:

We can see that the author combines the outputs of the two Isa layers as the final feature vector, because this can improve the recognition accuracy.

There are also two points to be mentioned: 1. the parameters in the ISA model trained by the author are the batch projection gradient descent method. The specific optimization process is not discussed in this paper, and I have not studied it in depth. In principle, this optimization algorithm should be a classic algorithm. 2. For the Red Circle output of the first layer of ISA, not all output values are used. Because some output values are small, it indicates that the response is small and should be removed. Therefore, the author sets a threshold value for this value. The threshold value is obtained through the cross-validation of the experiment.

　　Analysis of learned features

The later part of this paper analyzes the layer-2 patches learned in the ISA model. The patches learned by the first model are as follows:

These features are similar to the feature images learned in two-dimensional images. Each group has 10 patches, which indicates that the time dimension of the input video patch is 10 frames. Of course, learning from video data to features is also insensitive to image translation, and sensitive to image scale changes, speed changes, and rotation changes.

Shows the curve tests for these features:

The author also statistically finds the distribution of the rotation angles and velocity points of video data corresponding to the neurons with large responses, as shown in:

It can be seen that the underlying features learned are the most in the horizontal and vertical directions.

The features learned by the senior leadership are more complex, but not easy to explain, as shown in:

　　Lab results

The author tested the four well-known and challenging databases, including UCF, hollywood2, YouTube, and kth. The test results show that, the algorithm proposed by the author is better than the best results of other algorithms before, which is a surprising place. The algorithm accuracy is as follows:

　　Summary

Feature learning is still very powerful in learning, and should be a hot spot in future studies. It will greatly improve its role in various AI fields in the future.

　　Appendix

Download the lab report PPT.

　　References

Http://ai.stanford.edu /~ Quocle/

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Reading papers_16 (learning hierarchical invariant spatio-temporal features for action recognition wi

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Reading papers_16 (learning hierarchical invariant spatio-temporal features for action recognition wi

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support