Action Recognition with Fisher Vectors (IDT source codes)

Source: Internet
Author: User
Tags svm

Original URL:

Http://www.bo-yang.net/2014/04/30/fisher-vector-in-action-recognition


This is a summary of doing human action recognition using Fisher vectors with (improved) dense trjectory Features (DTF, HTTP ://lear.inrialpes.fr/~wang/improved_trajectories) and STIP features (http://crcv.ucf.edu/ICCV13-Action-Workshop/ download.html) on UCF 101 DataSet (http://crcv.ucf.edu/data/UCF101.php). In the STIP features, the low-level visual features HOG and HOF is integrated, with dimensions and respectively. The (improved) DTF employ more features (TR, HOG, HOF and Mbhx/mbhy) with longer dimensions.

You can find my Matlab code from my GitHub CHANNEL:DTF + Fisher vector code STIP + Fisher vector code dense Trajecto Ry Features

For some details of the DTF, please refer to my previous post. Pipeline

The pipeline of integrating Dtf/stip features and Fisher vectors are shown in Figure 1. The first step is subsampling a fixed number of STIP/DTF features (in my implementation,.) from each video clip in Trai Ning list, which'll is used to do PCA and train Gaussian Mixture Models (GMMs).

After getting the PCA coefficients and GMM parameters, treat UCF 101 video clips action by action. For each action, first load all train videos in this action (positive videos), and then randomly load the same number of VI Deo clips not the This action (negative videos). All of the loaded videos is multiplied with the saved PCA coefficients in order to reduce dimensions and rotate matrices. Fisher vectors is computed for each loaded video clip. Finally A binary SVM model is trained with both the positive and negative Fisher vectors.

When dealing and the test videos, similar process is adopted. The only difference is, the Fisher vectors was used for SVM classification, and which is based on the SVM model trained WI TH training Videos.

.

To well utilize the STIP or DTF features, features (HOG, HOF, MBH, etc.) is treated separately and they is only combined ( Simple concatenation) After computing Fisher vectors before linear SVM classification. pre-processing STIP Features

The offcial STIP features is stored in class, which means a all the STIP info of all video clips in each class is mix Ed together in a file. To extract STIP features for each video, I wrote a script mk_stip_data to separate STIP features for each video clip. And all the following operations is based on each video clip. DTF Features

Since The DTF features is "dense" (which means a lot of data), it took me, and the exact the (improved) DTF features of. UCF 101 clips with the Dedault parameters on a modern Linux desktop (I used threads for extraction in Paralle). The installation of DTF tools was also a very tricky task.

To save space, all the DTF features were compressed using script Gzip_dtf_files. For UCF101, the It would cost is about 500GB after compression. And the required space would be doubled if no compression. If you don ' t want to save the DTF features, you can call the DTF tools in Matlab and discard the extracted features. Fisher Vector

The Fisher Vector (FV) representation of visual features is an extension of the popular bag-of-visual words (BOV) [1]. Both of them is based on a intermediate representation, the visual vocabulary built in the low level feature space. A probability density function (in most cases a Gaussian Mixture Model) are used to Model the visual vocabulary, and we can Compute the gradient of the log likelihood with respect to the parameters of the model to represent an image or video. The Fisher Vector is the concatenation of these partial derivatives and describes in which direction the parameters of the Model should is modified to best fit the data. This representation have the advantage to give similar or even better classification performance than BOV obtained with SUP ervised visual vocabularies.

Following is the algorithm of computing Fisher vectors from features (actually I implemented this algorithm in Matlab, and If you is interested, please refer here):

.

During the subsampling of STIP features, I randomly chose-HOG or HOF features from each training video clip. For some videos, if the total number of features were less than, I would use all of their features. All the subsampled features is square rooted after L1 normalization.

After that, the dimensions of the subsampled features were reduced-half of their original dimensions by doing PCA. At this step, the coefficients of the PCA were recorded, which would is used in later. The GMMs were trained with the half-sized features, and the parameters of GMMs (i.e. weight, mean and covariance) were stor Ed for the following process. In my program, the GMM code implemented by Oxford Visual Geometry Group (VGG) is used and which eventually call Vlfeat. In my code, Gaussians were used.

When computing the Gaussians, sometimes value Inf would be returned. For the INF entries, a very large number (in my Code, 1E30) was assigned instead to make the subsequent computation smoother . Before the L2 and power normalization, the unexpected NaN entries is replaced by a large number (in my implementation, 123 456). SVM Classification

Binary SVM Classification (LIBSVM) is used in my implementation. For each action, positive video clips is labeled as 1 while negative videos is as labeled-1 during training and test. In my code, the SVM cost was set to 100. The option of SVM training is:

-T 0-s 0-q-C 100-b 1. Results

The action recognition accuracy of all the 101 actions were 77.95% when using above pipeline and STIP features. And the confusion matrix is shown in Figure 3.

The mean accurary of the fist actions with DTF features is 90.6%, while the STIP is only 84.32%. The mean accuracy of the whole UCF 101 data (train/test list 1) was around 85% using DTF features, about 8% higher than usi NG BOV Representations (internal test). and the best result I got with the ISA neural network on UCF 101 is only 58% in November, 2013. Conclusion

It is obvious, that Fisher vector can leads to better results than bag-of-visual words in action recognition. Compared to other low-level visual features, DTF features has more advantages in action recognition. However, in the long run I still believe deep learning methods-when deep neural networks could is trained with millions of Vidios[5], they would learn more info from scratch and achieve state-of-the-art accuracy.ReferencesGabriela Csurka, Florent perronnin, Fisher vectors:beyond bag-of-visual-words Image Representations, Communications in C Omputer and Information Science Volume 229, pp 28-42. Chih-chung Chang and Chih-jen Lin libsvm:a library for support vector machines. ACM Trans. Intell. Syst. Technol., 2 (3): 27:1–27:27, May 2011. Heng Wang and Cordelia Schmid. Action recognition with improved trajectories. In ICCV 2013-IEEE International Conference on Computer Vision, Sydney, Australia, December 2013. Ieee. Jorge Sanchez, Florent perronnin, Thomas Mensink, and Jakob Verbeek. Image classification with the Fisher Vector:theory and practice. International Journal of Computer Vision, 105 (3): 222–245, December 2013. Andrej karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-fei. Large-scale video classification with convolutional neural networks. In CVPR, 2014.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.