The focus of this paper is how to improve the positioning of bounding box, using the predictive form of probability, and the base of the model is region Proposal. This paper proposes a locnet depth network, which is not dependent on the regression Equation. The paper mentions that locnet can easily be combined with existing detection systems, but what I'm puzzled about is (1) their training methods, which are not explicitly mentioned in this paper, but simply using iterative methods (2) What is the structure of the two networks after the fusion? Can it be seen as a multitasking system, or is there a two network?
Detection method
The input candidate bounding box (obtained using selective search or sliding windows), gets a more precise box with an iterative Approach. The detection consists of two processes: the identification model (recognition Models) and the positioning model (localization models). The recognition model calculates a confidence level for each box (confidence score), measures positioning accuracy, locates the model to adjust the box boundary to generate a new candidate box, and then enters into the recognition model. The pseudo code is as follows,
As you can see, in the recognition model, some of the boxes are deleted based on the calculated confidence level, which is done to reduce the computational Complexity. however, as can be seen from the process, the reliability of the positioning model is almost useless, the iterative process to identify the model is not necessary to Calculate.
Positioning model
But the above is not the focus of concern, mainly to look at the way to improve positioning accuracy. The proposed locnet model steps are summarized as follows
(1) for the input box, enlarge the multiplier of a factor, get a larger area r, and project R into the feature Map.
(2) after a layer similar to ROI pooling, output a fixed-size map. You need to expand the instructions Here. Divide an area into m*m, which can produce two vectors, each representing the probability of each row or column of the region R contained in the bounding box (left). For example, for ground Truth box, the probability of a row or column within a boundary is 1, otherwise 0,
Where B stands for four border l,r,t,b. This appellation in–out Probability.
In addition, boundary probabilities are Defined. A row or column is the probability of a boundary. For ground Truth box, there is
(3) after several convolution layers and relu activation, two branches appear, respectively, corresponding to two Vectors. Then the vector of row and column is obtained by Max Pooling.
(4) after the FC layer, using the sigmoid function output in–out probability or boundary probability or both outputs.
Loss function
The most important thing is to define the loss Function. Using the Bernoulli distribution model, that is, each row or column has two possible (yes or no), after taking the logarithm, this is also the logistic regression common loss function cross entropy, for in–out probability has
which, for the same reason. For boundary probabilities there are
There are two balance factors, because there are fewer rows or columns as boundaries, so increase their weights,
。
problem introduction: for the entire model, it is strange that the last branch corresponding to the row and column of the Max pooling, unexpectedly can through such a boundary information, really do not understand why. This makes people think about why this operation is so pooling and omnipotent. Someone asked this question: what is the use of the pooling layer of the CNN network, what the max pooling and average pooling in the image classification do to the features, and what is the result? But it seems to come to the conclusion that this is a result of a pat on the head ... This systematic analysis of the paper "a theoretical analyses of feature pooling in Visual recognition" also said that this is an experience of the operation, and it seems that the results of the paper did not come to ...
The paper "a Theoretical analysis of the feature pooling in Visual recognition" notes, when a porter, mainly records pooling Binary feature part, the back of which has not been understood, The final conclusion is that Pooling can turn federated features into a more important representation while ignoring other irrelevant details.
For simplicity, Assuming that the Bernoulli distribution is obeyed, the mean pooling operation maximizes pooled Operations.
In this paper, we discuss the classification of distribution, given two classes of C1 and C2, the two conditional distributions (maximum Pooling) for the calculation of the Sub-criteria are the and, mean pooling and. Although it is a conditional distribution under a given category, it actually implies the probability that it belongs to a category, that is, a Posteriori. It can therefore be used to calculate the scalability of two distributions.
The method of increasing the two distribution scalability is to increase their mean expected distance, or to make their sample standard deviation smaller.
For the mean pooling, because the preceding assumptions obey the Bernoulli distribution, the distributions (note that this is not the conditional probability distribution, and for conditional probability distributions, they each have different mean values) mean, but the variance becomes smaller.
For the maximum pooling, the mean is, the variance is. Defined as a division of the class condition, the distance for the mean is
which as Well. The upper formula is the function of p, and the P is extended to the real field, and the most valued point is
The function rises first and then drops, the limit is 0. Suppose that when P=1 is the desired distance of the mean, there will be a lot of p, which can make the distance increase. Suppose that, if, can be rolled out, This indicates that one of its selected features represents more than half of the patches in the image (this sentence I understand is, because that is the probability of selecting/generating features under the category, i.e. the probability of activation is too high), But generally this does not happen when codebook contains more than 100 codeword (because it is very high).
For the maximum pooled variance, the same will go through a first rise and then down the PROCESS.
According to the above, the paper summarizes several points:
1, The maximum pool is particularly suitable when the characteristics are very sparse time to separate (that is, there is a very low probability to activate, when the situation rarely occurs)
2. Using all available samples to perform pooling may not be optimal.
3, the optimization of pooling technology will increase with the size of the Dictionary.
"CV paper reading" + "porter" locnet:improving Localization accuracy for Object Detection + A Theoretical analysis of feature pooling In Visual recognition