Spatio-temporal feature extraction and representation for RGB-D human action recognition

Source: Internet
Author: User
Tags svm

Propose a novel and valid tive framework to largely improve the performance of human action recognition using both the RGB videos and depth maps. the key contribution is the proposition of the Sparse Coding-based Temporal pyramid Matching approach (sctpm) for Feature Representation. due to the pyramid structure and sparse representation of extracted features, temporal information is well kept and Approximation
Error is already CED. in addition, a novel center-wide Ric motion local ternary pattern (CS-mltp) descriptor is proposed to capture spatial-temporal features from RGB videos at low computational cost. using the sctpm-represented 3D joint features and CS-mltp features, both feature-level fusion and classifierlevel fusion are supported ed that further improves the recognition accuracy.

From the feature extraction perspective, a new local spatiotemporal feature descriptor, center-defined Ric motion local ternary pattern (CS-mltp) is proposed to describe gradient-like
Characteristics of RGB sequences at both the spatial and temporal directions ctions.

From the Feature Representation perspective, our contribution lies in the design of a temporal pyramid Matching approach based onsparse codingof the extracted features to represent the temporal patterns, referred to as Sparse Coding temporal pyramid matching (sctpm ).

From classification perspective, we evaluate both feature-and classifier-level fusion of two sources based on fast and simple linear SVM classifier.

 

In the field of video-based human action recognition, the common method is to design local spatio-temporal feature extraction and representation algorithms that mainly involves three steps :( 1) local interest point detection, such as the spatio-temporal interest points (STIP) [15] and cuboid detector [4]; (2) local feature description, such as histogram of Oriented Gradient (hog) [3] and histogram of optical flow (HoF) [8]; (3) feature quantization and representation such as the K-means and bag-of-words (BOW ).

 

The proposed CS-mltp descriptor has advantages in several aspects: First, it can be easily combined with any detectors for action recognition since we adopt a 16-bin coding scheme; second, it encodes both the shape and Motion Characteristics to ensure high performance with stability.

 

Depth features

Since 3D joint features are frame-based, the different numbers of frames in each sequence requires algorithms to provide solutions for ''temporal alignment ''. most existing algorithms solve this problem through Temporal modeling that models the temporal evolutions of different actions. for example, the HMM is widely used to model the temporal evolutions [19,32, 5]. the Conditional Random Field (CRF) [6] predicts the motion patterns in the manifold subspace. the dynamic temporal Warping (DTW) [22] tries to compute the optimal alignments of the motion templates composed by 3D joints. however, the noisy joint positions extracted by the skeleton tracker [26] may undermine the performance of these models and the limited number of training samples makes these algorithms easily suffer from the overfitting problem

Features and Classifier Fusion

Different from the literature work that based on HMM, our algorithm is based on the proposed sctpm, and we recommend e two kinds of fusion schemes. the feature-level fusion Concatenates the histograms generated from two sources to form a longer hist-ogram representation as the input to classifier, and the classifier-level fusion combines the classification results on both sources to generate final result

Spatio-Temporal Feature Extraction

Spatio-Temporal Feature Representation

In Sparse Coding, the dictionary vis learned in the training phase that collected a larg-E number of features from training samples by iteratively optimizing EQS. (5) and (6 ). in the coding phase, the sparse codes are retained by optimizing EQ. (5) given learnedv.

Feature Representation

Classification

For multi-class, the linear SVM is equivalent of learningllinear
Functions.

To perform RGB-D human action recognition, We fuse the features of depth maps and color images at two levels, (1) A featurelevel fusion where the histograms generated from two sources are simply concatenated together to form a longer histogram representation as the input to classifier, and (2) A classifier-level fusion where the classifiers for the two sources are trained separately and Classifier Combination is already med subsequently to generate final result.

Classifier Fusion

Different fusion methods (DOT multiplication or addition) will get different fusion results ~

 

Summary:

The workload of this paper is quite large. In fact, idea is also easy to think of, but no one has done so well.

Features: 3D joint features, CS-mltp (because it describes motion information, there is frame difference between the front and back frames)

The cuboid detector is not too clear ~~

Spatio-temporal feature extraction and representation for RGB-D human action recognition

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.