DPM (deformable Parts Model)-Principle (i) (reprint)

Last Update:2016-12-15 Source: Internet

Author: User

Tags svm

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

DPM (deformable Parts Model)

Reference:

Object detection with discriminatively trained partbased models. IEEE Trans. Pami, 32 (9): 1627–1645, 2010.

"Support vectors machines for multiple-instance learning," Proc. Advances in neural information processing systems,2003.

Author's homepage: http://www.cs.berkeley.edu/~rbg/latent/index.html

Supplement and FIX:

Hog features (excerpt from graduation thesis)

DPM Target detection algorithm (excerpt from graduation thesis)

General Ideas

DPM is a very successful target detection algorithm that continuously obtains the VOC (Visual Object Class) 07,08,09 years of Detection champion. At present, it has become an important part of classifier, segmentation, human posture and behavior classification. 2010 Pedro Felzenszwalb was awarded the "Lifetime Achievement Award" by VOC. DPM can be seen as an extension of hog (histogrrams of oriented gradients), with general ideas consistent with hog. The gradient direction histogram is computed first and then the gradient model of the object is trained by SVM (Surpport Vector machine). With such a template can be directly used to classify, simple understanding is the model and target matching. DPM has just done a lot of improvement on the model.

It is a humanoid model trained in the hog paper. It is a single model, for the upright front and back people to detect the effect is very good, compared with the previous made a major breakthrough. It is also by far the best feature (more recently by CVPR20 's 13-year paper "Histograms of Sparse Codes for Object Detection"). But what if it's the side? So naturally we would think of using multiple models to do it. DPM uses 2 models, and the latest version of Versio5 's program on the home page uses 12 models.

Is the model of the bicycle, the left is the side view, the right picture is from the front view. Well, I admit it's beyond recognition, it's just a rough version. Training just give a bunch of pictures of bicycles, no label belongs to component 1, or component 2. Directly according to the aspect ratio of the boundary, divided into 2 semi-training. This will certainly have a lot of wrong situation, training out of the natural distortion. But it doesn't matter, the paper just take these two model as the initial value. The point is that the author uses multiple models.

The two models on the right use 6 sub-models, and the area of the white rectangular frame is a sub-model. People who have basically seen bicycles know it's a bike. The reason is better than the left to identify, because the problem of the component category is basically solved, there is the image resolution is twice times the left, this is not in detail, look at the paper.

With a multi-model can solve the problem of perspective, there is a serious problem, animals are moving, even if there is no life of the car has a lot of styles, only with a model, if the animals move, such as beauty flirt, the model and the beauty of the matching degree is much lower. In other words, our models are too rigid to accommodate the movement of objects, especially non-rigid objects. Naturally we can also think of adding sub-models, such as a sub-model to hand, when the hand moves, the sub-model can detect the position of the hands. The matching degree of the model and the main model is combined, the simplest is the sum, the model matching degree does not increase it? The idea is simple! There is a small detail, the sub-model must not be too far away from the main model, imagine if the hand to the body position is twice times the height of so far, then this is still a person? Maybe it's a good idea to test if it's a ghost. So we added the sub-model to the main model of the position offset as cost, that is, the composite score minus the offset cost. Essentially, the spatial prior knowledge of the sub-model and the master model is used.

Well, finally came a photo. The right side is our offset cost, the center of the circle is naturally the rational position of the sub-model, if the location of the detected sub-model is exactly here, then it is 0, in the periphery that will lose a certain value, the farther away from the loss of the value of the larger.

In fact, the part Model was presented as early as 1973, see the representation and matching of pictorial structures (Wood has a look ... ）。

In addition Hog features can refer to my blog: OPENCV Hog Pedestrian detection source analysis, sift features and very similar, originally also wanted to write but, then lazy, and more verbose, referring to a with me the same session of the Peking University Beauty series Blog Bar. Sift principle and source code analysis of "OpenCV"

In summary, the essence of DPM is the spring deformation model , see a 1973 paper on the representation and matching of pictorial structures

2. Detection

The detection process is relatively simple:

Overall score:

It is the score of rootfilter (which I call the main model), or the degree of matching, the essence is the convolution, and so is the partfilter behind it. The middle is the score of N Partfilter (previously referred to as a sub-model). It is the rootoffset that is set up to component the alignment. Coordinates for Rootfilter's left-top position in root feature map, mapped to the coordinates in part feature map for the first partfilter. is because the resolution of part feature map is twice times the root feature map, offset from the Rootfilter left-top.

The score is as follows:

The upper formula is in the Patfilter ideal position, namely anchor position's certain range, looks for a synthesis matching and the deformation optimal position. The offset vector, which is the offset vector, is the cost weight of the offset. For example, it is the most common Euclidean distance. This step is called the distance transform, which is the transformed response in. This part of the main procedures are train.m, FEATPYRAMID.M, dt.cc.

3. Training 3.1 + Example learning (multiple-instance learning) 3.1.1 MI-SVM

General machine learning algorithms, each training sample requires a class designator (for two categories: 1/-1). In fact, that kind of data has been abstracted, the actual data to obtain such a marking or difficult, the image is a typical. There is the work of data tagging is too large, we want to lazy, so more just give a positive and negative sample set. The samples in the negative sample set are negative, but the samples inside the sample are not necessarily positive, but at least one sample is positive. For example, the detection of human problems, a picture of the sky can be a negative sample set; a self-portrait is a positive sample set (you can take n samples in n regions, but only partially a positive sample of someone). The category of the positive sample is very unclear and the traditional method cannot be trained.

The question comes, the image is not labeled? There should be labels should have a category label Ah? This is because the picture is a person, the amount of data is very large, it is inevitable that some of the standard is not good enough, this is called the weak supervision set (weakly supervised set). So if the algorithm can automatically find the optimal position, then the classifier is not more accurate? The location of the label is not very accurate, this example is not very obvious, remember the location of the sub-model previously mentioned? For example, the position of the wheel of a bicycle is completely non-positional and only knows that there is a wheel in the bounding box area accessory. If you don't know the exact location, you won't be able to extract samples. In this case, the wheel will have a number of possible positions, it will form a positive sample set, but only part of it contains the wheel.

In response to the above question, "support vectors machines for multiple-instance learning" proposed MI-SVM. The essence of the idea is to extend the maximum sample spacing of the standard SVM to maximize the sample set spacing. In particular, the sample in the positive sample set is the most positive sample for training, and the other samples in the positive sample set await. Negative samples were also taken as negative samples in the negative samples of the separation interface. Because our aim is to ensure that positive samples are positive, negative samples cannot be positive. is basically the standard SVM. Maximum positive sample (farthest from the boundary), minimum negative sample (closest to the interface):

For positive samples: the sample that most resembles the large positive sample that is selected in the positive sample set.

For negative samples: You can expand Max, because the smallest negative sample satisfies the remaining negative samples, so any negative sample is:

Target function:

This means selecting the largest positive sample in the positive sample set, and all the samples in the negative sample set. The only difference from the standard SVM is the boundary of the Lagrangian coefficients.

The constraints of the standard SVM are:

To end up with an iterative optimization problem:

The idea is simple: the first step is to optimize the positive sample concentration, and the second step is to optimize the SVM model. As with K-means, this clustering-like algorithm is just two simple steps, but it bursts into endless force.

Here you can refer to a blog multiple-instance learning.

On the detailed theoretical derivation of SVM I have to recommend the most worshipped MIT Doctor pluskid: Support Vector Machine series

On the solution of SVM: SVM Learning--sequential Minimal optimization

SVM Learning--coordinate desent Method

In addition, with multi-sample learning corresponding to the multi-marker learning (multi-lable learning) is interested to understand the next. The two links are very large, and many examples are that the input samples have ambiguous (positive or negative) tokens, whereas multiple tokens are ambiguous for the output sample.

3.1.2 Latent SVM

1) I think MI-SVM can be seen as a special case of LATENT-SVM. First, the latent variable is explained, and MI-SVM determines which of the samples in the positive sample is the positive sample is a latent variable. However, this variable is a single, relatively simple, the value is only the number of the positive sample set. The latent variables for LSVM are particularly numerous, such as bounding box's actual position, X, Y, and the sample component ID in a level in the Hog feature pyramid. In other words, we have a picture of a positive sample, labeled bounding box, where we want to extract an area as a positive sample of a component in a certain location, a certain scale.

See LATENT-SVM's training process directly:

This part is also involved in the Data-minig. First of all, look at the 3-6,12 in the loop first.

3-6 is the first step for MI-SVM. 12 corresponds to the second step of the MI-SVM. The author uses the gradient descent method directly to solve the optimal model β.

2) now say data-minig. Why does the author not directly optimize, but also make a data-minig why? Because the number of negative samples is huge, the total number of samples used in the Version3 is 2^28, in which the POS sample number is particularly low, the negative sample is too many, the direct result of the optimization process is very slow, because many negative samples away from the interface for optimization is almost no help. Data-minig's role is to remove those easy-examples that are very small in optimization and keep close to the hard-examples of the interface. corresponding to 13 and 10 respectively. The theoretical support for doing so proves as follows:

3) Simply say the random gradient descent method (Stochastic Gradient Decent):

First, the gradient expression:

Gradient approximation:

Optimization process:

This part of the main program:pascal_train.m->train.m->detect.m->learn.cc

3.2 Training Initialization

LSVM is sensitive to the initial value, so initialization is also a play. is divided into three stages. I will not swim the English aspect, directly on.

The following is a slight mention of the various stages of work, mainly in the paper does not have the latent variable analysis:

PHASE1: It is the traditional SVM training process, which is consistent with the hog algorithm. The authors randomly sorted the positive sample according to aspect ration (aspect ratio), and then the coarse one was divided into two halves to train two component Rootfilte. The size of the two rootfilter is determined directly by the point-of-sale POS examples. When a positive sample is taken, the positive sample is scaled directly to the rootfilter size.

PHASE2: It's LSVM training. Latent variables the actual position of the sample in the image includes the space position (x, y), the scale position level, and the class C of component, which belongs to Component1 or component 2. The parameter to be trained is two rootfilter,offset (b)

PHASE3: also LSVM process.

The addition of the model is first mentioned. The authors have fixed each component with 6 Partfilter, but in fact it will be reduced according to the actual situation. To reduce the parameters, the Partfilter are symmetrical. The anchor point of the Partfilter in Rootfilter (anchor location) was fixed when the Partfilter was selected by the maximum energy.

The latent variables at this stage are the most: Rootfilter (X,y,scale), Partfilters (X,y,scale). The parameters to be trained are rootfilters, Rootoffset, partfilters, defs (offset cost).

This part of the main program:pascal_train.m

4. Details

4.1 Contour Prediction (bounding Box prediction)

Look closely at the bike's revolver, if we only use Rootfilter detected area, namely the red area, then the front wheel will be cut off part, but if can synthesize partfilter detected bounding box can get more accurate bounding box as the right picture.

This part is very simple is to use the least squares (Least squres) regression, the program in the trainbox.m directly left to remove.

4.2 HOG

The author has made great changes to the hog. Instead of using 4*9=36-dimensional vectors, the authors extract 18+9+4=31-dimensional eigenvectors from each 8x8 cell. The author also discusses the 9+4-dimensional features of the Component based on the visualization of PCA (Principle), which can achieve the effect of hog 4*9 dimension.

A lot of this is not going to be elaborate. Open the question a word has not written, to catch the question ... Mainly the features.cc. with the following picture, I slowly study:

SOURCE Analysis:

DPM (defomable Parts Model) source Analysis-detection

DPM (defomable Parts Model) source Analysis-Training

from:http://blog.csdn.net/ttransposition/article/details/12966521

DPM (deformable Parts Model)-Principle (i) (reprint)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More