Propose a novel and valid tive framework to largely improve the performance of human action recognition using both the RGB videos and depth maps. the key contribution is the proposition of the Sparse Coding-based Temporal pyramid Matching approach (sctpm) for Feature Representation. due to the pyramid structure and sparse representation of extracted features, temporal information is well kept and Approximation
Error is already CED. in addition, a novel center-wide Ric motion local ternary pattern (CS-mltp) descriptor is proposed to capture spatial-temporal features from RGB videos at low computational cost. using the sctpm-represented 3D joint features and CS-mltp features, both feature-level fusion and classifierlevel fusion are supported ed that further improves the recognition accuracy.
From the feature extraction perspective, a new local spatiotemporal feature descriptor, center-defined Ric motion local ternary pattern (CS-mltp) is proposed to describe gradient-like
Characteristics of RGB sequences at both the spatial and temporal directions ctions.
From the Feature Representation perspective, our contribution lies in the design of a temporal pyramid Matching approach based onsparse codingof the extracted features to represent the temporal patterns, referred to as Sparse Coding temporal pyramid matching (sctpm ).
From classification perspective, we evaluate both feature-and classifier-level fusion of two sources based on fast and simple linear SVM classifier.
In the field of video-based human action recognition, the common method is to design local spatio-temporal feature extraction and representation algorithms that mainly involves three steps :( 1) local interest point detection, such as the spatio-temporal interest points (STIP) [15] and cuboid detector [4]; (2) local feature description, such as histogram of Oriented Gradient (hog) [3] and histogram of optical flow (HoF) [8]; (3) feature quantization and representation such as the K-means and bag-of-words (BOW ).
The proposed CS-mltp descriptor has advantages in several aspects: First, it can be easily combined with any detectors for action recognition since we adopt a 16-bin coding scheme; second, it encodes both the shape and Motion Characteristics to ensure high performance with stability.
Depth features
Since 3D joint features are frame-based, the different numbers of frames in each sequence requires algorithms to provide solutions for ''temporal alignment ''. most existing algorithms solve this problem through Temporal modeling that models the temporal evolutions of different actions. for example, the HMM is widely used to model the temporal evolutions [19,32, 5]. the Conditional Random Field (CRF) [6] predicts the motion patterns in the manifold subspace. the dynamic temporal Warping (DTW) [22] tries to compute the optimal alignments of the motion templates composed by 3D joints. however, the noisy joint positions extracted by the skeleton tracker [26] may undermine the performance of these models and the limited number of training samples makes these algorithms easily suffer from the overfitting problem
Features and Classifier Fusion
Different from the literature work that based on HMM, our algorithm is based on the proposed sctpm, and we recommend e two kinds of fusion schemes. the feature-level fusion Concatenates the histograms generated from two sources to form a longer hist-ogram representation as the input to classifier, and the classifier-level fusion combines the classification results on both sources to generate final result
Spatio-Temporal Feature Extraction
Spatio-Temporal Feature Representation
In Sparse Coding, the dictionary vis learned in the training phase that collected a larg-E number of features from training samples by iteratively optimizing EQS. (5) and (6 ). in the coding phase, the sparse codes are retained by optimizing EQ. (5) given learnedv.
Feature Representation
Classification
For multi-class, the linear SVM is equivalent of learningllinear
Functions.
To perform RGB-D human action recognition, We fuse the features of depth maps and color images at two levels, (1) A featurelevel fusion where the histograms generated from two sources are simply concatenated together to form a longer histogram representation as the input to classifier, and (2) A classifier-level fusion where the classifiers for the two sources are trained separately and Classifier Combination is already med subsequently to generate final result.
Classifier Fusion
Different fusion methods (DOT multiplication or addition) will get different fusion results ~
Summary:
The workload of this paper is quite large. In fact, idea is also easy to think of, but no one has done so well.
Features: 3D joint features, CS-mltp (because it describes motion information, there is frame difference between the front and back frames)
The cuboid detector is not too clear ~~
Spatio-temporal feature extraction and representation for RGB-D human action recognition