Using rgb-d data for human body detection with dataset

Source: Internet
Author: User
Tags svm

Human body detection using rgb-d data

Lucianospinello, Kai O. Arras

Summary

Human detection is an important problem in robotics and intelligent systems. Previous research was done using cameras and 2D or 3D rangefinder. In this paper, we propose a new method of human body detection using rgb-d. We drew inspiration from hog (histogram of orientedgradients) and designed a method for detecting the human body in dense depth data, called the depth direction histogram hod(histogram of oriented Depths). Hod encodes the direction of the local depth change, and relies on the search in the scale space of the predicted depth information to obtain a 3 times-fold acceleration of the detection process. We then presented the Combo-hod, a rgb-d detector combined with HOD and hog. The experiment includes a comprehensive comparison of several detection methods, including hog, several hod variants, a geometric human detector for 3D point clouds, and a adaboost detector based on Haar features. In the range of up to 8 meters and equal error rate of 85%, experiments have shown that hod and Combo-hod are robust to the real data of the indoor environment obtained with the Kinect sensor.

1 Introduction

Human detection is an important and fundamental component of many robots, interactive systems and intelligent vehicles. Sensors commonly used in human detection are cameras and rangefinder (range Finder). Both of these sensors have pros and cons, and with reliable and inexpensive rgb-d sensors becoming available, the only way to detect them is to be eliminated.

in the field of robotics , many researchers use distance data for human detection. Early research worked with 2D distance data. The use of 3D distance data for human body detection is a relatively new problem. Navarro [3] divides the data into virtual 2D slices, where significant vertical targets are found on the ground plane and a series of SVM classifier features are used to determine whether the human body. Bajracharya [4] detects the human body from a point cloud of stereoscopic vision and uses a fixed human model to judge a series of geometric and statistical features of vertical objects in point clouds. But these detection methods all need a ground plane hypothesis , Spinello and so on [5] the use of partial voting methods and learning the best special collection of the top-down verification process to overcome this limitation.

In the field of computer vision , the problem of detecting human body from a single picture has been studied for a long time. Recent research work such as the use of part-based voting methods and sliding window search methods in [6-10] . In the previous method, each part of the body is independently voting on the human body; In the latter method, the fixed-size detection window slides on different scale spaces of the image to classify each region. There is also a study of multi-model human detection problems:[one] proposed a 2D distance data and camera system training,[] using a stereo system to unite image data, differential mapping and optical flow ; Use grayscale and low-resolution time-of-flight cameras.

The contribution of this paper to the field of human detection is as follows:

(1) We present a robust human detection method based on dense depth information called HOD ( histogram of oriented depths), inspired by the Hog method and the depth characteristics of the Kinect rgb-d sensor.

(2) We are based on a well-trained scale-to-distance mapping and a new way of using the integration chart to predict depth information in a scale space search.

(3) We propose Combo-hod, a new fusion method for human detection using rgb-d data.

(4) The experiment includes a comprehensive comparison of several detection methods, including hog, several hod variants, a geometric human detector for 3D point clouds [5], and a adaboost detector based on the Haar feature [.]

Note that our approach is neither dependent on background learning nor dependent on ground plane assumptions.

The structure of the paper is as follows: The Kinect sensor features are discussed in the next section, followed by a description of the Hod descriptor and the Combo-hod method that detects the human body in the rgb-d data Sectionⅲ . Data sets, performance metrics, and contrast experiments are described in Sectionⅳ . Sectionⅴ is a summary.

Figure 1detects the human body on rgb-d data (right) and color image data (left). This method is neither dependent on background learning nor dependent on ground plane estimation.

2 Kinect sensor features

In this section we analyze and discuss the features of the Microsoft Kinect rgb-d sensor used in the experiment. The Kinect sensor includes an IR camera , an IR transmitter (ir projector), and a standard color camera that leverages Infrared Structural light principle [+] to measure depth. The depth map resolution is 640*480, with a bit depth of 11 bits per pixel. Interestingly, not all bits are used to encode depth information: the depth value beyond the distance range is positioned as Vmax = 1084, the depth minimum is vmin = 290, so only 794 depth values (10 bits) are encoded to encode the depth information for each pixel.

The relationship between the original depth value V and the distance from D in meters is []:

where B = 0.075m, which corresponds to the distance between the IR emitter and the IR Camera (baseline), FX is the length of the focus of the IR camera in the horizontal direction (Focal length). D is ignored if it is a negative value. The equation (1) is a hyperbolic relationship, similar to the point-to-point correspondence in a stereoscopic camera system. Figure 2 shows the relationship between V and D, and a reasonable distance (adequate Play Space) that the sensor given in the instruction manual can work reliably. The space is limited to a maximum of 2m to 2.5m before the device.

Figure 2, features of the Kinect depth data. The blue curve is the relationship between the pixel values in the depth map and the distance values in meters. The red line represents the minimum measurement depth of the sensor. The green area is the recommended use range in the Kinect manual, and the yellow area is the range of distances used in this article to detect the human body. Notice that we are testing the body in almost 4 times times the recommended distance range, so the depth resolution becomes quite coarse.

In this article, we detect the human body within a range of 0 to 8 meters, which is almost 4 times times the recommended range, so challenging is the loss in depth resolution. The depth value of 86.9% is used to encode depth information between 0 and 2.5 meters, leaving only 140 values to encode depth information between 2.5 and 8 meters. This effect, derived from the hyperbolic features of the formula (1) , can be seen clearly in the point cloud of the two different distances in Figure 3 . In the front about 2 meters in the shape of the person is very detailed, and the distant target human body only a few points to describe, very coarse. This makes the 3D geometrical information of the remote target of the sensor seriously lost .

Another effect, especially at long distances, is sensitivity to material on the surface of an object. The strong infrared absorbing surface (ir-absorbing surface) makes the emitted infrared return very weak, which results in the loss of the depth information of the block and is shown in the image on the right of Figure 3 .

Figure 3, left: hyperbolic resolution loss effect. A side view of two people with different distances before the sensor. A nearby person is described precisely and in detail. The farther away the quantization is, the more serious the loss of the shape information of the human body. The geometrical method used for human detection on this data will be very poor. Right: A distant infrared absorbing surface can result in a large amount of deep data loss (such as the upper body of the leftmost person, and white for depth data loss).

3 using rgb-d data for human detection

We present the detectors in this section. First summarizes the common image of the hog detector, and then introduces Hod, a new method inspired by hog for dense depth data, and finally introduces the Combo-hod method of combining two kinds of data.

A. Hog:histograms of oriented gradients

The Hog method proposed by Dalal and Triggs[6] is the most widely used method of visual human detection [9][10]. This method uses a fixed-size detection window, and the window is divided into a cell-based grid. The gradient direction of the pixels in each cell is counted in a one-dimensional histogram. The intuitive statement is that the local appearance and shape can be well described by the distribution of local gradients, without needing to know the exact position of these gradients in the cell . A group of cells are synthesized to block, and local contrast is normalized. The histogram in all blocks is strung together to form the description sub-vector of the detection window, which is used to train the linear SVM classifier. In the detection of human body, in the image of different scale space sliding detection window, the calculation of each location and scale of the hog descriptors, and then using a good learning SVM classifier classification, see paper [6].

B. Hod:histograms of oriented depths

Based on the idea of hog, we propose a new human body detector hod for dense depth data.

1) Operating principle : Hod Follow the same processing flow as hog in the depth image. This includes dividing a fixed window into a cell, calculating the descriptors for each cell, and counting the depth direction gradient into a one-dimensional histogram. The four cells form a block and use the L2-hys method to make blocks normalized, thus providing better robustness to depth noise. The intuitive expression is that the local depth variation array can well describe the local 3D shape and appearance . The last obtained Hod eigenvector is used to train a soft linear SVM classifier using the two training methods given in the paper [6] .

2) Deep image processing : It has been discussed in the Sectionⅱ that the original depth map is very uneven in encoding the real distance. For distant targets, a depth value can correspond to a distance of 15cm. This is important for the Hog/hod framework because the block around the target contour has a very large weight on the result. Especially those blocks with the highest positive weights corresponding to the SVM hyper-plane. Therefore, we preprocess the original depth map with the formula (1) to enhance the segmentation of the foreground background. To enhance the numerical stability of the gradient calculation, the resulting distance value in meters is multiplied by M/dmax, where m = 100, is the constant gain, Dmax = 20, which is the maximum distance in meters. This preprocessing step is similar to the idea of gamma correction for contrast enhancement. We can use some knowledge about the sensor to eliminate the nonlinear effects on the model.

3) Scale space search for predictive depth information : Most visual inspection methods, such as hog, use the search in the image scale space to discover the target. In the Hod method, you can use the depth information to guide this search process . With in-depth information for predictive assistance, search is more efficient and accurate .

Our idea of improving the search process is to propose a method for quickly judging the corresponding scale of each position in the depth map. First, the average human height HM is calculated from the training data set, the ground position in the data set and the height of each sample are precisely labeled. This information is used to calculate a scale-to-depth mapping (as shown in Figure 4 ):

FY is the length of the IR camera in the vertical direction, Hm = 1.74m is the average height of the human body, HW is the height of the test window at the scale of 1 o'clock (in meters). The left part of the formula (2) indicates that the half plane with a height of HM is perpendicular to the camera's image projection at distance d. To limit memory usage, quantify every 1/3 scale. Calculates the scale s of each pixel in the depth map, forming a scale map from which you can get a list of all the scales. This list s includes only the scales where the human body may exist in the image. This method avoids exhaustive searches at all scales of the image pyramid.

Each image corresponds to a scale list sand then the search for the scale space. When searching, only the depth information in the Search window corresponds to the scale in the list S, which is categorized in the SVM classifier.

A simple way to solve this problem is to select a scale s in the scale list s to see if the depth value of each position in the detection window is compatible with S. This approach requires scanning each location in the search window and testing if at least one depth value is compatible with S, and the computational complexity is high, especially when large scales are encountered.

Using the integration chart [+], we present a method for the compatibility of test scales that can be completed more quickly in O (1) time. The integral graph is a technique that can quickly calculate the pixel values in a rectangular region. The pixel value of each point in the integration graph is the number of pixels in the upper left of the point in the original image. The process of building the integration graph is time consuming O (n), and n is the size of the original. The main advantage of using integral graphs is that area integrals can be quickly calculated by adding and subtracting 4 times. This principle is extended to integral tensor, a multilayer integration graph, with the same number of layers as the number of scales in S. Each layer in the integral tensor is a binary image whose non-white (non-white) pixels are on the scale of the layer. This allows you to efficiently test whether a given search window contains at least one pixel of a scale. Integral tensor the construction of each picture needs to be done once.

When detecting, select a scale s in S. For each search window position, the Search window is area integral using the corresponding scale s of the integral tensor. If the result is greater than 0, indicating that there is at least one depth pixel compatible with the scale s, the HOD descriptor is computed, otherwise the detection window is not considered and the next window continues to be tested.

In Figure 4, the quantization regression curve of the reaction metric depth and the detection window scale relation is obtained. The maximum size of the curve is limited to 20 to avoid oversized detection windows.

C. combo-hod:rgb-d Human Detection

The two detection methods described above are individually considered color or depth map, in order to take advantage of rgb-d data, we propose combo-hod, a new combination of two kinds of data detection method. This combination is significant: The depth data is invariant to light changes, but is affected by the low return signal strength and the resolution is limited. Color image data is rich in color and texture, with high resolution, but is easily unavailable in non-ideal light .

Combo-hod is to train a hog detector on the image data separately and train a hod detector on the depth data. This method relies on the scale space search for the predicted depth information described above: each detection window has a corresponding compatibility scale, calculates the HOD descriptor on the depth map, and computes the hog descriptor on the same window of the color graph. When no depth data is available, the detector automatically degrades to a standard hog detector. A calibration procedure is required to calculate the external parameters that correspond to the two images.

When Hog and hod descriptors are categorized, it is time for information fusion. The decision function is given by the HOD or hog descriptor and the symbol of the dot product of the SVM hyper-plane plus offset. In order to fuse the two information, we fit a sigmoid function to the output of each SVM according to the method proposed in the paper [Platt] , and map the output value to the probability . The probability of PD and hog detectors from the HOD detector is fused by the following filters:

P is the probability of the final detection of the human body, is the number of errors in the verification of the same set error rate Hod Error hog the number of errors in the ratio.

4 Experiments

In order to compare and evaluate different detection methods, we collected a large number of indoor human data. The data set is collected in the lobby of a lunch hour at a university cafeteria. There is also a collection of datasets in other university buildings designed to produce background samples (negative samples). This is to prevent the detector from learning the background of the canteen hall, especially since the sensor is fixed when collecting data. The datasets are manually labeled, including the target bounding box and visible state (fully visible, partial occlusion) in the 2D depth map. A total of 1648 human cases were marked in 1088 images. Datasets can be obtained on the author's home page.

The evaluation criteria used are accuracy-recall (Precision-recall) and equal error rates (Equal error rate, ERR). true Positiveis considered correct when the detection result overlaps with a manually labeled target that is greater than 40%. According to the non-reward non-penalty (no-reward-no-penalty) principle in the paper [9] , if the test results are matched with partially occluded human bodies, neither the correct detection nor the false positives are recorded.

The training set used to train all the detectors consists of 1030 human depth data samples (and their horizontal flip mirrors) and 5,000 negative samples randomly selected from the background data set.

Results

The experiment compares the new HOD detector and other depth-based detection methods based on the visual detection method and the new multi-model rgb-d detection method Combo-hod.

Taking into account the importance of depth quantification of Kinect data, we evaluated two depth data: HOD11, considering all available 11-bit depth ranges, and HOD8, using only 8-bit depth data. The HOD detectors using other preprocessing techniques are compared with the preprocessing methods in the Sectionⅲ-b , taking into account the typical processing techniques in computer vision, including contrast enhancement, light equalization (square root operation, logarithmic operation), and no preprocessing.

The left figure in Figure 5 shows clearly that the HOD11 is better than HOD8 in the entire accuracy-recall range, indicating that the extra 3-bit depth coding helps distinguish the human body from the background. All preprocessing operations on the depth data also play a role (the results are not shown in Figure 5 ). For HOD11, the best preprocessing method is the method described in sectionⅲ-b , which proves that the technology with excellent theory is better than heuristic algorithm. In particular, the HOD11 error rate of EER is 83%, while the EER of the best HOD8 is 75%.

When it comes to rgb-d data, a fundamental question is how much depth information contributes to the pure visual inspection technology. To estimate this problem, we consider the hog detector using pure RGB data and the haar-based AdaBoost detector (HA) originally proposed by Viola and Jones[] . In the left-hand figure of Figure 5 , it can be seen that both methods are not HOD11 and combo-hod good, the EER of the Hog method is 73%,ha method (not shown in Figure 5 ) EER is 13%. The main cause of this result is illumination, and the ambient light in the data set is not very good. Dark areas cause blurred images to move the human body, and the Kinect RGB camera automatically lengthens the shutter time to produce brighter images. The background area with direct sunlight produces saturated images with poor contrast. These phenomena can also lead to the failure of the AH method because the Haar wavelet does not have invariance for light changes. The results show that for the human body detection system under changing conditions, the use of visual-based detection method is not enough, and the depth information is used to assist the detection.

Figure 5, left : The accuracy-recall curve of several detection methods. The best performance is the combination of the HOD and hog rgb-d detector combo-hod. There are two kinds of depth data hod detectors, 8bit depth data and 11bit depth data. HOD11 is the best performing depth-based detection method. Visual-based hog detectors do not perform well due to light conditions. The BUTD detector is not performing well because of the hyperbolic depth resolution loss of the Kinect data.

Just as important is the comparison with the geometrical method. So we compared HOD11 with BUTD[5] , BUTD is a human body detector for sparse 3D data, such as point cloud data from Velodyne sensors. Results HOD11 performance slightly better (see Figure 5 in the Middle ), EER is 72%, but BUTD is only nearly the point, at the recall rate of 53%, the accuracy can still reach 98%. However, BUTD relies heavily on shape information, so it can be greatly affected by the loss of resolution at a distance. However, at a very close vicinity of the resolution, two detectors behave similarly when ERR is 86% (see Figure 5 in the Middle ).

Figure 5, figure : The two methods are similar in performance compared to the maximum 2.5 meters recommended by Kinect for Butd and HOD11 in the range of use.

The diagram on the right of Figure 5 shows the computational performance of the HOD detector. We compared the size of each image test with the scale space search using the predicted depth information, and the number of scales that were tested with no depth information (labeled hod-). Hod-uses a pyramid search with a scale increment of 5%, regardless of the image content, unlike the mapping between scale and depth in HOD, which corresponds to a map for each depth map. The size of the test is reduced by approximately 3 times times over the image of the entire dataset, so the processing time for each picture is reduced by about 3 times times, as shown in Figure 5 on the right . hod-. This algorithm is fully implemented on the GPU and can handle Kinect's rgb-d data stream (2*640*480,30FPS) in real time on the Nvidia GTX480 graphics card.

Figure 5, right : A scale space search using the predictive depth information, the number of scales for each image test, and the number of metrics to test with no depth information (labeled hod-). Scale space search accelerated by 3 times times.

Finally, compared with all other methods, the Combo-hod detector presented in this paper has the best performance. The Combo-hod method has the highest EER value in Figure 5, which is 85% (is it better if the EER is higher ?) ). This suggests that the combined use of depth information and color image information can provide a wider range of conditional changes, making human detection more reliable. Multi-modal data can help improve human detection when a single detector cannot be processed.

Figure 6 is the detection result of the combo-hod detector. The figure shows several human bodies detected at different distances, which contain some partial occlusion and error conditions.

Figure 6, the detection results of the combo-hod detector on the rgb-d data. Human detection in different parts of occlusion, vision and depth. False positives occur when the data of both sensors is not available (negative), false positives (falsepositive) occur when both types of data have clutter. In the third column, the human body can still be detected when no depth data is available. Our approach neither relies on background learning nor relies on ground plane estimation.

5 Summary

This paper introduces Combo-hod, a new method of detecting human body in rgb-d data. In this paper, the characteristics of the Kinect data used are introduced, which can guide the follow-up research. Depth Direction histogram The hod encodes the local direction change, and the scale space search based on the predicted depth information can achieve 3 times times speedup. Then combining Hod and hog, the Combo-hod method for human detection in RGB and depth data is presented. The 4 times-fold distance between the Kinect's recommended operating space reaches the equal error rate of EER to 85%. The contribution of depth data to the pure visual method and shape-based 3D method is analyzed through the contrast experiment. The Combo-hod is superior to other detection methods and can achieve 30fps real-time detection on the GPU.

paper Download : http://download.csdn.net/detail/masikkk/6947075

author Dr. Luciano Spinello personal page : http://www.informatik.uni-freiburg.de/~spinello/index.html

rgb-d Data set used in the paper download : http://www.informatik.uni-freiburg.de/~spinello/RGBD-dataset.html

Using rgb-d data for human body detection with dataset

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.