This paper has provided code Https://github.com/CMU-Perceptual-Computing-Lab/caffe_rtpose, can be run.
The following is my understanding of the article, if there are errors welcome to discuss, such as reprint please indicate the source.
1. Introduction
The challenge of Pose estimation:
1〉 images don't know how many people, in what position, what scale
2. People become complicated by contact and occlusion
3〉 real-time requirements, the more people in the image, the greater the computational complexity.
A common Approach:person detection + pose estimation for each person (Top->down)
Question: 1〉if person Detector fails-> No recovery (when people are close to someone detector can easily detect)
2〉 calculation of time and number of people, the more the more time-consuming.
Bottom up approaches does not have the above two problems.
However, bottom up does not directly benefit from the global information-〉 the key is to exploit people contextual (contextual cues) from other body parts and other cues.
This article uses the bottom up method, but utilizes global contextual information in the detection of parts and their.
This paper presents part Affinity Fields (PAFS), a set of 2D vector Fields. Each 2D vector field encode the position and direction of a limb (limb).
These fields (which contain parts connections and orientations) and confidence maps for parts (the joint's confidence map) sequential learning and forecasting through the prediction jointly framework.
Confidence maps for parts and part Affinity Fields are 2D spatial grids that can be expressed unstructured, multimodal uncertainty hat arises due To occlusion and contacts, and can be analyzed with convolution.
-------
The following sentence, do not understand:
As the confidence maps and Affinity fields encode global context in their prediction, they allow a efficient algorithm th At uses greedy association over a minimum spanning tree without significant loss in the quality of pose.
------
3.method
3.1. Confidence Maps for part detection
Each body part (j) is counted as a confidence map. So how many part (joints) There are, how many confidence maps are relative to part.
Each point in the image area has a confidence value that forms the confidence map.
The value of each point in the confidence map is related to the distance from the truth position, the nearer the confidence the higher.
Described by Gauss Distribution, the confidence peak value is the truth position.
Assuming that k individual, each person in the image has a predictive location of the confidence map, the K-confidence map set together into a confidence map, take the maximum value of the confidence of people at that point.
This look at the figure2b should be able to understand the biggest meaning, a bit like the feeling of intersection.
The article said that the use of max instead of average is to: in time multiple peak points close, precision is still unaffected. As shown in Figure 3a.
In the test phase, the candidates is obtained by confidence suppression on the predicted maps.
--
[PS: Non-maximum inhibition, referred to as NMS algorithm. is an effective method to obtain the local maximum value. The non maxima suppression NMS is a widely used method in target detection, localization and other fields. The target location process, whether using SW (sliding Window) or SS (Selective Search) method, will produce a lot of candidate areas. NMS, simply put, for the intersection of the selection of which the highest confidence as the final result, for the disjoint is directly preserved, as the final result. ]
3.2. Part Affinity Fields for part Association
With the body parts, then do not know how many people in the case of these parts group synthesis full-body pose (which parts is the same person).
Idea 1 (final unused):
Whether a connection between any two body parts requires a measurement of the confidence level (whether the same person).
Then, you can take n points in the middle of the line between the two body parts, and compute their confidence map as measurement.
The confidence map-sc of limb (limb) c is synthesized by taking the largest value in the confidence maps of n points on the limb (limb).
The use of midpoint expression may result in space ambuigity (ambiguity) due to the overlap of multiple people, as shown in Fig. 4b is n=1.
The limitation of this method is that only the location information is used, but the orientation information of the limb is ignored.
To solve the above problems, the part affinity field is proposed.
Thinking 2:part affinity field (the core contribution of this paper)
Benefits: Location and orientation information are used.
Each type of limb (limb) has a affinity field between the two body parts associated, each of which has a 2D vector description direction.
The dimensions of the Affinity field map are w*h*2 (because vectors are two-dimensional).
If there is more than one person overlapping, the vector of the K individual is summed and divided by the number.
In test, the method for calculating the confidence score:
Calculates the predicted alignment of the PAF (vector) and the direction of the candidate limb (the direction is consistent and calculated by dot product).
3.3. Multi-person parsing using PAFs
This section is about how to analyze reasoning after getting the confidence map and part affinity fields, which is called the bottom-up algorithm.
Define some expressions first:
It is assumed that by suppressing the maximal value of confidence map, there are multiple detection candidate in each body part. (There are many people in the image, so there will be multiple detection candidate).
Suppose to be the location of the detection candidate of the first m of the J body part.
The following z represents the connection relationship, and the goal is to find the optimal possible way to connect.
Find the problem with the optimal connection between the 22 body part:
becomes the problem of a maximum weight bipartite graph matching, as shown in Figure 4a.
----
About maximum weight bipartite graph matching problem:
can refer to the http://www.csie.ntnu.edu.tw/~u91029/Matching.html to understand the concept, the concrete solution method has many, this paper uses Hungarian algorithm is the link in the Hungarian calculus method.
[In addition, the refer:d on the paper. B. West et al. Introduction to graph theory, Volume 2. Prentice Hall Upper Saddle River, 2001. 4]
----
So, when you become a graph problem, you can understand this:
Graph's nodes is the body part detection candidates,
The edges of graph is the connections between all possible body part,
The weight on each edge is the part affinity aggregate calculated by Formula 7.
A matching in a bipartite graph are a subset of the edges chosen in such a way that no two edges a node.
is to find the maximum weight of the edge of the connection mode.
The following is a mathematical expression.
This paper uses Hungarian algorithm to obtain a maximum match.
.
The question of finding the full body pose of multiple persons becomes:
Find maximum weight cliques partition in K-partite graph, as shown in Figure 5a.
(In fact, paper does not do the whole graph optimization, but simplifies it.) The following is actually a reason to explain the simplification. )
-----
PS: In the expression of English, Maximal clique and maximum are completely different.
A clique (clique) is a child of a freeform graph (undirected graph) that has an edge between any two vertices in the child graph.
A great group of Maximal clique is a regiment that cannot be contained by larger groups, in other words, there is no longer a point between an edge and any vertex in the regiment.
The size of a regiment is the number of vertices contained in a regiment, the size=k of a group called K-group.
The largest group maximum clique is the largest maximal clique of size in a graph. --------
Solve the above maximum weight cliques partition:this problem is NP hard [the] and many relaxations.
---
PS: About NP hard can refer to http://blog.csdn.net/bitcarmanlee/article/details/51935400
In layman's words, the NP problem is that the correctness of the solution is easily tested, and it is easy to verify that there is a polynomial algorithm.
----
Two additional relaxation are added to the optimization in this article:
1. Choose the least edges to form the tree skeleton (skeleton) of human pose instead of using the entire graph
2. Decompose the cliques partition problem into a series of bipartite matching Subproblems, and then independently analyze the match between adjacent tree nodes.
Then paper discusses why the minimal greedy algorithm (the ultimate optimal composition of each small step) also contains global inference over multiple person, roughly because CNN itself has a relatively wide range of sensory fields, So the global message is in there, too.
(Here my understanding is more superficial, in fact will seek the overall graph optimal, simplified in order to find 22 part connection optimal.) As long as each individual limb is optimal, the combination is optimal. )
Under these two conditions, the optimization problem is reduced to:
So, by formula (8)-(10) We can sequentially obtain the correct correct candidates of each limb (limb).
Then the share of the same part limb together to get full-body pose.
3.4. Joint Learning part detection and association with sequential prediction
The above diagram is the network structure of this article. The network is divided into multiple stages, and at the end of each stage there is intermediate (middle) supervision (supervised). In the first stage, the top 10 layers of the image feature F will be used by subsequent stage. At the time of training, the first 10 layers are initialized from VGG-19. After 10 floors, the network is divided into 2 routes, each of which has 5 floors before loss. S is confidence map (j size is H ' *w ', J is the number of body part type), L is the PAF (c size is H ' *w ') (*2. Here paper does not say, from Formula 6 to see should be w*h*2 bar), C is the number of limb type. After each stage, S and L are reunited with F in the stage1 as input to the next stage.
Loss equation: Calculates the L2 loss between predictions and ideal values. (2 branch are calculated as such)
Here, the loss equation has a spatially weighted weight spatially, because some datasets do not fully mark all the people, and the mask they provide indicates that some areas may contain unlabeled.
W is binary mask. In the unmarked position W is 0.
The following is the loss equation at the T-stage.
The final objective function is to sum the loss of each stage.
4. result
Testing on 2 datasets: 1〉mpii 2〉mscoco2016
It is less effective than other algorithms for people with smaller scales.
It is the noteworthy we method has won but has lower accuracy than the top-down methods on people of smaller (APM). The reason is so we method has
To deal and a much larger scale range spanned by all people in the the image in one shot. In contrast, Top-down methods rescale The patch of all detected area to preferable size independently and thus suffer les s from small people.