Thesis details
Gesture Recognition or hand tracking is very important in Human-Computer Interaction and has been studied for decades. However, there are still many difficulties: the hand movement is composed of a lot of complex finger activities, and at the same time, it moves quickly under a variable large perspective.
There are several methods to achieve better results, one of which uses a very complex mesh model (mesh model, I don't know how to do it ), however, the limitation is that its local optimal method speed is also very poor; the other is that the polygonal model can achieve real-time results, but it needs to be processed by GPU.
The local optimal method mentioned above means that once a local optimal value is found, it cannot be recovered from the tracking error.
So there are some global methods to search in the entire parameter space. However, the speed is obviously quite slow. At the same time, relatively poor results are often obtained. Some solutions use global parameter search and local optimization. However, some auxiliary devices are used, such as color gloves and multiple cameras! There are also some ways to solve the problem of low degrees of freedom (DOFs. Or you need to set the angle of view.
However, this article is highly accurate in complex situations. It can be said that it is the best method at present.
Core:
The basic framework adopted is: local optimization + initialization by part detection. The model of the hand used is to use a group of Globules to approximate the model of the hand; A cost function is used to measure the distance between a model and a sparse point cloud.
Therefore, we need to understand the following aspects: Hand model, hybird optimization (here the author combines the advantages of Several optimization methods), Part detection and part_based initialization, and the cost function here.
As shown in, it is the hand model.
As shown in (a), it indicates the hand motion model (hand motion model ). The author constructed a motion model of a hand with 26 dimensions (DOF, Degrees of Freedom. Six dimensions represent the global hand posture, and each finger has four dimensions in the other 20 dimensions. (In fact, I don't quite understand what this dimension refers to.) These 26 motion parameters are represented by θ.
(B, c) indicates the geometric model of the hand, (B) is a template used in other papers, and (c) is a spherical (B). 48 spherical bodies are used: 6 fingers each, but 8 thumbs up (I only see 7 in the figure). The remaining 16 represent the palm of your hand. In addition, the size of the anterior ball and their center are added. It constitutes the geometric model of the hand.
This geometric model is represented as follows: For every sphere:, where C (θ) represents its center (sometimes for simplicity, θ is not written) [This is a bit incomprehensible. The relationship between the center and the preceding 26 dimensions]. r indicates the radius and is fixed.
The data section in this article is not easy to understand. It probably means that the camera of intel gets the depth information. At the same time, we also wrapped our wrist with a black strap so that we could accurately find the area of the hand and then perform a median filter on the area of the hand, and Morphological Open operations (the process of first corrosion and then expansion is called open operations. It is used to remove small objects, separate objects at slender points, and smooth the boundary of large objects without significantly changing the area .) Then we get a deep map D and convert it to the 3D point cloud, which is represented as P)
Hybird Optimization
First, we need to initialize a gesture (from the previous frame or the current frame), and then we can track the gesture through the local optimization method of equation (1. Generally, the iterated closest point method is used to optimize the point model.
Hand initialization:
Here they propose a simple and robust method for initialization. F indicates the detected finger. Assume that the detected finger is straight (degrees of freedom is 2), and the undetected finger is bent (degrees of freedom is 0 ). the parameter of such a gesture is 2f + 6 DOFs. It is θ '.
Initialize some gestures randomly to obtain the optimal parameters from them.
At the same time, the author also said that this method is not very effective for complicated gestures. But this is already the best method.
Summary:
This article uses Intel's gesture camera to obtain deep images and then convert them into 3D point cloud. Introduce the cost function. Then, the finger position is detected (the 2D + 1d representation is used), and the constraint is formed by the detected finger position for the initialization of the gesture, the optimal solution of cost function equation is optimized by the ICP-PSO algorithm, and the result of the gesture is obtained. The 48-ball model defined by the user performs real-time matching.
Next, we will analyze the cost function and the algorithm in detail.
This section is complete.
How to Learn about gesture tracking: Realtime and robust hand tracking from depth (2)