Summary of feature extraction for behavioral recognition

Source: Internet
Author: User

Summary of feature extraction for behavioral recognition


Human behavior recognition is at the stage of motion recognition, and action recognition can be regarded as the combination of feature extraction and classifier design. The feature extraction process is very challenging because of the influence of occlusion, dynamic background, moving camera, angle of view and illumination change. This paper summarizes the methods of feature extraction in the current behavior recognition, and divides its features into global and local features, and introduces its advantages and disadvantages separately.

Keywords: behavior recognition feature extracts global feature local feature

1. Preface

Nowadays, human behavior recognition is a hotspot in computer vision research, and the goal of human body behavior recognition is to automatically analyze the behavior in action from an unknown video or image sequence. Simple behavior recognition is the action classification, given a video, just classify it correctly into a number of known action categories, complex point recognition is not only one action category in the video, but there are many, the system needs to automatically identify the type of action and the starting time of the action. The ultimate goal of behavioral recognition is to analyze which people in the video are at what time and what they are doing, that is, the so-called "W4 system".

The following 4 aspects of the behavior recognition to do a preliminary introduction.

1.1 behavior Recognition Application background

Human Behavior Recognition application background is very extensive, mainly focus on intelligent video surveillance, patient monitoring system, human-computer interaction, virtual reality, smart home, intelligent security, athlete training, in addition to content-based video retrieval and intelligent image compression has broad application prospects and potential economic value and social value, Many methods of behavioral recognition have been used.

1.2 History of behavioral recognition research

The research of behavioral recognition analysis can be traced back to the 1975 Johansson[1], the author proposes a 12-point human body model, the point model method of describing behavior plays an important role in the later behavior description algorithm based on human body structure. Since then, the progress of the research history of behavioral recognition can be divided into the following 3 stages, the 1th one is the preliminary research stage of behavioral analysis in the 1970s, the 2nd is the progressive development stage of behavioral analysis in the 1990s, and 3rd is the rapid development stage of behavioral analysis in recent years. It can be seen from the literature [2]~[7] 6 more famous review of behavioral recognition that the number of people who study behavioral recognition is increasing, the number of papers is also soaring, and many kinds of important algorithms and ideas are produced.

1.3 classification system of behavioral recognition methods

There are many methodological systems for the analysis and recognition of human motion on the visual. FORSYTH[8] and other people focus on the movement from the video sequence of the human posture and movement information recovery, which belongs to a regression problem, and human behavior recognition is a classification problem, these 2 problems have many similarities, such as the characteristics of the extraction and description of many are universal. TURAGA[5] such as human body behavior recognition is divided into 3 parts, that is, mobile recognition (movement), Action Identification (action) and behavioral recognition (activity), these 3 categories are in low-level vision, middle-vision, high-rise visual counterpart. GAVRILA[9] uses 2D and 3D methods to study the behavior of the human body separately.

In the Division of Behavioral recognition methodology, a new division has recently emerged [7], Aggarwal The study of human behavior into 2 categories, one is based on a single level to achieve, the other is based on hierarchical system to achieve. The single-layer realization consists of 2 kinds of space-time features and sequence features, and hierarchical system is divided into statistical method, syntactic analysis method and description-based method 3 kinds. Figure 1 Aggarwal A hierarchical chart of the behavioral recognition methodology system.


Figure 1 Hierarchical structure of behavior recognition methods

The classification system is more perfect, but also can reflect the current research progress. According to Turaga's 3 hierarchical theory, the current behavior recognition basically still stays in the second stage, namely action recognition. Action recognition is simpler than real-life behavior, so we recognize these behaviors by simply classifying them correctly. Such a behavior recognition system is divided into behavioral feature extraction and classifier design two aspects, through the training data extraction of a certain feature, supervised or unsupervised to train a classification model, the new data also extracted features and sent to the model, to obtain the classification results. Based on this idea, this paper has made a more comprehensive introduction to the feature extraction of behavioral recognition.

1.4 difficulties in behavioral recognition research

The development of behavioral recognition has made great progress, in the lower layer, middle and high level have made a certain breakthrough, but the behavior recognition algorithm is not mature, there is no algorithm suitable for all the behavior classification, 3 visual levels are still a lot of serious problems to be solved. The difficulty of the research is mainly reflected in the following aspects:

1.4.1 There is too much variation between classes within the action class

For most actions, even the same action has a different form of expression. For example, walking can be done in different backgrounds, the speed of walking can be from slow to fast, the step length of walking is also long and short. Other movements also have similar results, especially some non-cyclical movements, such as walking when crossing the road, which is significantly different from the usual cyclical pace of walking. Thus, the type of action itself is many, plus each species and there are many variants, so the study of behavioral recognition brings a lot of trouble.

1.4.2 environmental background and other impacts

Environmental background and other factors such as the impact of computer vision is the biggest difficulty in all fields. There are a variety of perspectives, the same movements from different perspectives to observe the different two-dimensional images; between people, between people and the background of the interaction between the computer to the classification of the pre-feature extraction brought difficulties, at present, to solve the problem of multi-vision and occlusion, a multi-camera fusion is proposed through 3-dimensional reconstruction to deal with In addition, the influence factors include dynamic change and clutter background, environment illumination change, image video low resolution and so on.

1.4.3 The impact of time changes

It is always known that human behavior is inseparable from the time factor. And the video we take the format may be different, its playback speed is slow and fast, which led to our proposed system to the video playback rate is not sensitive.

1.4.4 acquisition and labeling of data

Since the behavior recognition problem is regarded as a classification problem, a lot of data is needed to train the classification model. These data are video data, each action in the video location and time are uncertain, but also to consider the different manifestations of the same action and the difference between different actions, that is, the diversity of data and comprehensiveness. This collection process is not a small amount of work, there are already some open databases on the Internet for everyone to experiment, this will be introduced in the 3rd part of this article.

In addition, it is difficult to label video data manually. Of course, some scholars have also put forward some methods of automatic labeling, such as using the Web image search engine [10], the use of video subtitles [11], and the use of film description of the text to match [12][13][14].

1.4.5 understanding of high-level vision

As mentioned above, the current research on behavioral recognition is still at the level of Action recognition (action recognition). Its processing behavior can be divided into 2 categories, a class of restricted categories of simple rules, such as walking, running, waving, bending, jumping and so on. The other is the specific behavior in the specific scenario [15]~[19], such as detection of abnormal behavior of terrorists, loss of packets after the sudden departure and so on. In this scenario there is a strict restriction on the description of the behavior, at which time the description generally uses motion or trajectory. These 2 kinds of behavioral recognition research are not perfect, encountered a lot of problems, and from the high-level behavior recognition requirements are far from. Therefore, the understanding representation and recognition of high-level vision is a huge problem.

2. behavior recognition feature extraction

In this section, we will focus on how to extract features from a picture sequence. In this paper, the characteristics of behavioral recognition are divided into 2 categories: Global features and local features.

The global feature is to treat an object as a whole, which is a research thought from top to bottom. In this case, the person in the video must be positioned first, which can be either a background subtraction or a target tracking algorithm. Then, some coding is done for the target, so that the global feature is formed. This global feature is effective because it contains very much information about the human body. However, it relies too much on the processing of the underlying vision, such as accurate background subtraction, human positioning and tracking. And these processing process itself is a difficult point in computer vision. In addition, these global features are sensitive to noise, visual angle changes, occlusion, etc.

Local feature extraction is a relatively independent image block for collecting human body, and it is a kind of research thinking from bottom to top. The general practice is to extract some space-time points of interest in the video, then extract the corresponding image blocks around these points, and finally combine the blocks to describe a particular action. The advantage of local features is that it does not depend on the underlying body segmentation and tracking, and is not very sensitive to noise and occlusion problems. However, it needs to extract a sufficient number of points of interest that are related to the action category and therefore require a lot of preprocessing.

2.1 Global Feature Extraction

The global feature is the detection of the entire body of interest is described, generally through the background subtraction or tracking method to get, usually used is the edge of the human body, silhouette contour, optical flow and other information. These features are sensitive to noise, partial occlusion, and change of perspective. The following are introduced from the two-dimensional features and three-dimensional features.

2.1.1 Two-dimensional global feature extraction

DAVIS[20] such as the earliest use of contour to describe the body's motion information, with Mei and MHI 2 templates to save the corresponding action information, and then use the Markov distance classifier to identify. Mei is a motion energy diagram, used to indicate which parts of the movement have occurred, MHI is a movement history chart, in addition to the spatial location of the movement takes place also embodies the time sequence of the movement. These 2 features are obtained from the background subtraction chart. Figure 2 is to sit, wave, and crouch the movement history of these 3 movements MHI.

Figure 23 The corresponding MHI of the action

In order to advance the silhouette information, wang[21] and other people use R transform to get the silhouette of the human body. HSUAN-SHEN[22] Extracts the contours of the human body, which are used to describe the angle between the baselines using a star-shaped skeleton, which is extended from the human hand, foot, and Prime center to the contours of the human body. WANG[23] uses both silhouette information and contour information to describe the action, which is described using the contour-based average motion shape (MMS) and the average energy (AME) two templates based on the motion foreground. When the contour and silhouette template is saved, the newly extracted features are compared with the daniel[24] using European distance to measure its similarity, then he then uses the chamfer distance to measure [25], thus eliminating the background subtraction of this preprocessing step.

In addition to using contour silhouette information, the body's motion information is often used. For example, based on the pixel-level background difference, optical flow information and so on. When the background difference method does not work well, we can often adopt the optical flow method, but this often introduces motion noise, effos[26] only calculates the light flow at the center point of the human body, which reduces the effect of noise to some extent.

2.1.2 three-dimensional global feature extraction

In three-dimensional space, the 3D space-time body (STV) can be obtained by the data in the given video, and the calculation of STV needs precise positioning, target alignment and sometimes background subtraction. BLANK[27][28] and other people for the first time from the video sequence of the silhouette information to get STV. As shown in 3. Then, the local space-time salient points and their directional characteristics are derived by Poisson equation, and the global feature is obtained by weighting these local features, in order to deal with the different duration of different actions, Achard[29] uses a series of STV for each video, And each STV is only part of the information that covers the time dimension.

Another way is to extract the corresponding local descriptors from the STV, which is described in the section on local feature extraction, where we first consider the STV feature as a global feature. BATRA[30] Stores the silhouette of STV and samples STV with a small 3D binary space block. YILMAZ[31] Extracts the different geometrical features of the STV surface, such as its maximum point and minimum point. Of course, there are scholars keel[32] to combine the STV of the silhouette with the optical flow information, as a global feature of behavioral recognition.

Figure 3 STV of jumping, walking, and running 3 actions

2.2 Local Feature Extraction

Local feature extraction of human behavior recognition refers to the extraction of points or blocks of interest in the human body. Therefore, there is no need for accurate human positioning and tracking, and the local characteristics of the human body's apparent changes, visual changes and partial occlusion problems are not very sensitive. Therefore, the classifier with this characteristic in behavior recognition is more. The following is introduced from the local feature point detection and the local feature point description 2 section.

2.2.1 detection of local feature points

Local feature points in behavioral recognition are points in time and space in the video, and the detection of these points occurs in the mutation of the video motion. Because the points produced during a motion mutation contain most of the information about the human behavior analysis. As a result, these feature points are difficult to detect when the human body is moving in a linear motion or uniform motion.

LAPTEV[33] extends the Harris corner to 3D Harris, one of the STIP family of space-time points of interest. The pixel values in the neighborhood of these spatio-temporal feature points have significant changes in both time and space. In this algorithm, the scale size of neighborhood blocks can be adaptive to time and space dimension. This space-time feature is shown in point 4.


Figure 4 Temporal and spatial feature point detection diagram

DOLLAR[34] pointed out that there is a disadvantage of the above method, that is, the number of stable points of interest detected is too small, so Dollar alone in time and space dimension first using Gabor filter to filter, In this way, the number of points of interest detected will change with the time and space of the local neighborhood size change. Similarly, rapantzikos[35] uses discrete wavelet transforms on 3 dimensions to select a significant point in time and space through the low-pass and high-pass filtering responses of each dimension. At the same time, in order to integrate color and motion information, rapantzikos[36] added color and motion information to calculate its significant points.

Different from the idea of detecting the point of interest in the whole human body, wong[37] first detects points of interest in the subspace associated with motion, which correspond to a part of the movement, such as an arm swing, in which some sparse points of interest are detected. In a similar way, bregonzio[38] first estimates the focus of visual attention by calculating the difference in the back frame, and then uses Gabor filtering to detect significant points in these areas.

2.2.2 description of local feature points

Local feature description is a description of an image or a block in a video, and its descriptors should be insensitive to the clutter, scale and direction of the background. The spatial and temporal dimensions of an image block are usually determined by the size of the detected points of interest. Figure 5 shows the cuboids descriptor [34].

Figure 5 Cuboids Description Sub

A feature block can also be described with a mesh based on local features, since a mesh includes a locally observed domain pixel, which is considered a block, thus reducing the effect of local changes in time and space. The two-dimensional surf feature [39] is extended to 3 dimensions by willems[40], and each cell of these Esurf features contains all Harr-wavelet features. LAOTEV[14] uses local hog (gradient histogram) and Hof (optical flow histogram). KLASER[41] extends the HOG feature to 3 dimensions, which forms the 3d-hog. Each bin of the 3d-hog is made up of a regular multipatch, and 3d-hog allows for fast density sampling of cuboids at multiple scales. This extension of the two-dimensional feature point detection algorithm to 3-D feature points is similar to the work of extending the SIFT algorithm [42] to 3-D sift scovanner[43]. In wang[44], he compared the various local description operators, and found that in most cases, it is best to integrate the descriptive operators of gradient and optical flow information.

There is also a descriptive sub-popular, that is, the word bag [45][46], which is the use of the word frequency histogram features.

2.3 Global and local feature fusion

The integration of global and local features, combined with sufficient information of global features and local characteristics of the change of perspective, partial occlusion problem is not sensitive, strong anti-jamming advantages. Such articles are more numerous and their main ideas are combined from 2.1 and 2.2 methods. THI[47] The 2 features are well-integrated, the global feature is the MHI operator described above, and the aift algorithm [48] to further select a better MHI. The local feature is also the STIP feature mentioned earlier, and the SBFC (sparse Bayesian feature selection) [49] algorithm is used to filter out some of the more noisy feature points. Finally, 2 kinds of features are fed into the extended 3-D ISM model, and their ism[50] is a common algorithm for target recognition, that is, the implicit shape model of training the target. THI[47] is shown in the method structure 6.

Figure 6 combination of local and global features

3. Behavior Recognition Common database

3.1 Weizmann

WEIZMANN[27] Database contains 10 actions are walking, running, jumping, jumping, moving to one side, single hand-waving, 2 hand-waving, one-hop, 2-arm-waving jump, each action has 10 people to perform. In this video set, the background is static, and the foreground provides the silhouette information. This data set is simpler.

3.2 KTH

KTH[45] The pedestrian database contains 6 actions, namely walking, jogging, waving and clapping. Each action is done by 25 different people. Each of these moves was completed in 4 different scenarios, with 4 scenes, outdoor, indoor, outdoor, outdoor and in different colours.

3.3 PETS

PETS[51], which is called tracking and monitoring performance evaluation meeting, its database is obtained from the real life, mainly from video surveillance system directly from the video, such as the supermarket monitoring system. Since 2000, this meeting has been organized almost every year.

3.4 UCF

UCF contains a data set, which refers to the UCF's motion database [52], which includes 150 video sequences with a total of 13 actions. Because it is real-life video data, so its background is more complex, these kinds of actions are difficult to identify.

3.5 Inria XMAS

The Inria xmas database [53] is taken from 5 angles, 4 directions in the room and 1 directions from the top of the head. A total of 11 people complete 14 different movements, which can be carried out in any direction. The camera is stationary, and the ambient lighting conditions are virtually unchanged. In addition, the data set provides information such as body contour and volume element.

3.6 Hollywood

Hollywood Movie Database contains several, one [14] video set has 8 kinds of actions, namely answer phone, under sedan, handshake, hug, kiss, sit down, stand, stand. These movements are drawn directly from the film and are performed by different actors in different environments. The other [54] on the basis of the above added 4 more actions, cycling, eating, fighting, running. and its training set gives the automatic description text annotation of the movie, and some of it is manually labeled. This data set is very challenging due to factors such as occlusion, moving cameras, dynamic backgrounds, and so on.

4. Summary

In this paper, the method of feature extraction in behavior recognition is introduced, and it is divided into 2 parts, including global feature extraction and local feature extraction, although it has made a lot of achievements since it is self-identifying, but because of the dynamic environment, occlusion and other problems, the challenge is very great, it needs to extract more robust, More adaptable and better-performing features, and this is still a goal that continues to be pursued over the next few years or even decades.

Reference documents:

  1. Johansson, G. (1975). "Visual motion perception." Scientific American.
  2. Aggarwal, J. K. and Q Cai (1997). Human Motion analysis:a Review, IEEE.
  3. Moeslund, T. B. and E. Granum (2001). "A Survey of Computer vision-based Human motion capture." Computer vision and Image understanding Bayi(3): 231-268.
  4. Moeslund, T. B., A. Hilton, et al. (2006). "A survey of Advances in vision-based human motion capture and analysis." Computer vision and Image understanding 104(2): 90-126.
  5. Turaga, P., R. Chellappa, et al. (2008). "Machine Recognition of Human activities:a survey." Circuits and Systems for Video technology, IEEE transactions on 18 (11): 1473-1488.
  6. Poppe, R. (2010). "A Survey on vision-based human action recognition." Image and Vision Computing (6): 976-990.
  7. Aggarwal, J. and M. S. Ryoo (2011). "Human activity analysis:a Review." ACM Computing Surveys (Csur) (3): 16.
  8. Forsyth, D. A., O. Arikan, et al. (2006). Computational studies of human motion:tracking and motion synthesis, now Pub.
  9. Gavrila, D. M. (1999). "The visual analysis of Human Movement:a survey." Computer vision and Image Understanding (1): 82-98.

Ikizler-cinbis, N., R. G. Cinbis, et al. (2009). Learning actions from the Web, IEEE.

Gupta, S. and R. J. Mooney (2009). Using closed captions to train activity recognizers that improve video retrieval, IEEE.

Cour, T., C. Jordan, et al. (2008). Movie/script:alignment and parsing of video and text transcription.

Duchenne, O., I. Laptev, et al. (2009). Automatic annotation of human actions in video, IEEE.

Laptev, I., M. Marszalek, et al. (2008). Learning realistic human actions from movies, IEEE.

Haritaoglu, I., D. Harwood, et al. (1998). "W 4 s:a real-time system for detecting and tracking people in 2 1/2d." Computer VISION-ECCV ' 98:877-892.

Tao, D., X. Li, et al. (2006). Human carrying status in visual surveillance, IEEE.

. Davis, J. W. and S. Taylor (2002). Analysis and recognition of walking movements, IEEE.

Lv, F., X. Song, et al. (2006). Left luggage detection using Bayesian inference.

Auvinet, E., E. Grossmann, et al. (2006). Left-luggage detection using homographies and simple heuristics.

Bobick, A. F. and J. Davis (2001). "The recognition of human movement using temporal templates." Pattern analysis and Machine Intelligence, IEEE transactions on (3): 257-267.

Wang, Y, K. Huang, et al. (2007). Human activity recognition based on R transform, IEEE.

Chen, H. S., H. T. Chen, et al. (2006). Human action recognition using star skeleton, ACM.

Suter Wang, L. and D. (2006). Informative shape representations for human action recognition, Ieee.

Weinland, D., E. Boyer, et al. (2007). Action recognition from arbitrary views using 3d exemplars, IEEE.

Weinland, D. and E. Boyer (2008). Action recognition using exemplar-based embedding, Ieee.

Efros, A. A., A. C. Berg, et al. (2003). Recognizing action at a distance, IEEE.

Blank, M., L. Gorelick, et al. (2005). Actions as space-time shapes, IEEE.

Gorelick, L., M. Blank, et al. (2007). "Actions as space-time shapes." Pattern analysis and Machine Intelligence, IEEE transactions on (12): 2247-2253.

Achard, C., X. Qu, et al. (2008). "A novel approach for recognition of human the actions with Semi-global features." Machine Vision and Applications 19 (1): 27-34.

Batra, D., T. Chen, et al. (2008). Space-Time shapelets for action recognition, IEEE.

Yilmaz, A. and M. Shah (2008). "A differential geometric approach to representing the human actions." Computer vision and Image Understanding 109 (3): 335-351.

Ke, Y., R. Sukthankar, et al. (2007). Spatio-temporal shape and flow correlation for action recognition, IEEE.

Laptev, I. (2005). "On space-time interest points." International Journal of Computer Vision (2): 107-123.

Dollár, P., V. Rabaud, et al. (2005). Behavior recognition via sparse spatio-temporal features, IEEE.

Rapantzikos, K., Y. Avrithis, et al. (2007). Spatiotemporal saliency for event detection and representation in the 3D wavelet domain:potential in human action Recognition, ACM.

Rapantzikos, K., Y. Avrithis, et al. (2009). Dense saliency-based spatiotemporal feature points for action recognition, Ieee.

Panax Notoginseng Wong, S. F. and R. Cipolla (2007). Extracting spatiotemporal interest points using global information, IEEE.

Bregonzio, M., S. Gong, et al. (2009). Recognising action as clouds of space-time interest points, IEEE.

(2006). Bay, H., T. Tuytelaars, et al. "Surf:speeded up robust features." Computer VISION–ECCV 2006:404-417.

Willems, G., T. Tuytelaars, et al. (2008). "An efficient dense and scale-invariant spatio-temporal interest point detector." Computer VISION–ECCV 2008:650-663.

Klaser, A. and M. Marszalek (2008). "A Spatio-temporal descriptor based on 3d-gradients."

Mikolajczyk, K. and C. Schmid (2004). "Scale & affine invariant interest point detectors." International Journal of Computer Vision (1): 63-86.

Scovanner, P., S. Ali, et al. (2007). A 3-dimensional SIFT descriptor and its application to action recognition, ACM.

. Wang, H., M. M. Ullah, et al. (2009). "Evaluation of local spatio-temporal features for action recognition."

Niebles, J. C., H. Wang, et al. (2008). "Unsupervised learning of human action categories using spatial-temporal words." International Journal of Computer Vision (3): 299-318.

Schuldt, C., I. Laptev, et al. (2004). Recognizing human actions:a local SVM approach, IEEE.

Thi, T. H., L. Cheng, et al. (2011). "Integrating local action elements for action analysis." Computer vision and image understanding.

Liu, G., Z Lin, et al. (2009). "Radon representation-based feature descriptor for texture classification." Image processing, IEEE transactions on (5): 921-928.

Carbonetto, P., G. Dorkó, et al. (2008). "Learning to recognize objects with little supervision." International Journal of Computer Vision (1): 219-237.

Leibe, B., A. Leonardis, et al. (2008). "Robust object detection with interleaved categorization and segmentation." International Journal of

Computer Vision (1): 259-289.


Rodriguez, M. D. (2008). "Action Mach a spatio-temporal maximum average correlation height filter for Action recognition." CVPR.

Weinland, D., R. Ronfard, et al. (2006). "Free viewpoint action recognition using motion history volumes." Computer Vision and image

Understanding 104 (2): 249-257.

Marszalek, M., I. Laptev, et al. (2009). Actions in context, IEEE.

Summary of feature extraction for behavioral recognition

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.