Read a period of time goal detection of the paper, here to write a summary of the article. Do not necessarily understand correctly, if there are problems, please correct me. 1, RCNN
RCNN is based on selective search (SS) Searching region proposal (RP), and then on each RP CNN inference, this algorithm is more direct.
The framework should also be easy to see and understand.
SS extracts approximately 2K of RP for each picture, and then infers the RP. 2, Sppnet
RCNN calculation is not necessary, because the box has overlapping ah, since it has been calculated, why have another box to calculate it again. So computation sharing began.
Conv5 previous calculations, for a picture only need to be done once.
In addition, Sppnet has three main contributions:
(1) The SPP can produce a fixed size feature, which allows the input image to be of any size. Why, then? The authors found that in the CNN, in fact, outside the full connection, the other layer basically does not need a fixed size. That being the case, then let the full connection before the calculation fit any size, as long as the full connection to the time, let it fixed size is good. How to do it. Is the SPP in the picture. The pooling results of fixed length are obtained by using different pooling steps.
(2) The SPP uses many levels of space bins, while keeping the sliding window size unchanged, so that the target can be more robust deformation.
(3) Thanks to the changeable input image size, the SPP can achieve different scales of pooling.
Problem: Sppnet is still a multiple training, the main reason is: The SPP layer is multi-scale, for the gradient passed over, it is involved in the calculation of gradient receptive field is relatively large, the calculation is very slow. And the classifier used later is SVM, it should be more time-consuming. In fact, sppnet training, is to use SS to find the RP, and then enter the front layer, the characteristics of the deposit, and then load the feature into the back layer, the front layer of training and the back layer of training is separated. Fast RCNN
This article is aimed at the problems in Sppnet, the main contributions are:
Put forward the roipooling. This roipooling is actually a simplified version of the SPP. The SPP does not use the pyramid, which leads to the difficulty of gradient transfer. Then just use one layer of the pyramid. No, I have to change a name, it's called roipooling. After using the roipooling, the resulting is also a uniform size of the characteristics of the gradient is not as relevant as before the experience of the wild so large, so can be connected to the training.
Multi-task loss. Previously sppnet is not bounding Box (BB) regression and classification is separate. Here's a multitasking loss. And the classifier is not so complicated with SVM, only need softmax.
Question: This also actually has the question, you see SS in inside not has not been involved in the study. And the SS seems to be off-line computing. (This needs to be verified again.) Another important problem is that after the BB Regression and classification network is strong, you are also subject to the SS extraction of the RP restrictions. That is to say, SS may have become a bottleneck. Faster-rcnn
This article is actually to undertake the above, fast rcnn is not limited by the SS can not end to end training. Then do a network to replace SS Bai. This is faster-rcnn key RPN. But RPN how to do it.
After several convolution of the original graph, the resulting 256 40*60 feature map, followed by a small convolution neural network on these feature map slide, here are explained. In order to achieve a certain scale does not distort, using the anchor method, that is, each RP for a number of dimensions of the transformation, the article used 9. In this way, for a 40*60, there is a total of 40*60*9 RP, or a lot of.
There is one thing that may not be clear, that is, what is the output of the RPN. On each window, the RPN output is a 256-dimensional feature. 256 How did you get here? is to feature the corresponding positions on the 256 map maps respectively.
The following classifications and regressions are also used for the full connection.
This is used to do RP calculation, in fact, is the BB regression and classification sharing, because their input is this 256d characteristics ah.
Problem: This computational complexity is still too high to meet real-time requirements. YOLO
The biggest advantage of this network is fast, how to do it. First, the concept of shared computing must be used. And then the image is divided into pieces. The
picture is divided into s*s lattices, each of which predicts b bb, and each BB contains 5 quantities: x,y,w,h,confidence. This confidence is a score to see if there is a need to test the target. The
also has a forecast score for each of the C classes on each grid.
in the test phase,
This does not get the probability of the target. That's very nice.
The following is the network structure:
Yolo thought and the front RPN very different, now do not need to RPN, it is directly so training. But there is a problem, if a grid contains a number of targets, it is not only to return to one of them, and should be the most obvious one. This is YOLO small target detection is not ideal, because the small target in the back layer of the feature is less. And SSD is fast, but the effect of target detection is not satisfactory. SSD
Although the YOLO effect is not good, but its groundbreaking look once's idea is very commendable and memorable. SSD inherited the work of YOLO and expanded it.
The default box (DB) in SSD is an important task. The feature is obtained by the convolution and so on, then the feature actually has the relative position information. Divide the feature into different db, so you can achieve the same effect as YOLO. But this DB has anchor.
Take a closer look at the picture and the text inside it. Even if the label and the predicted BB are in a db, you have to see if the anchor matches. This is more accurate for the return db.
Then look at the network framework, which is a comparison with YOLO.
The front layer of the SSD is similar to the YOLO, but the back is different. Yolo is directly with the feature map back to the BB, but also made a classification, SSD is also integrated with the characteristics of different layers.
In fact, from the SSD can be seen, YOLO did not fully utilize a good share of the calculation, because the characteristics of the front layer is only used to give the next layer as input ah, but it is still feature ah, you have so used, in fact it also contains a lot of useful information. Why do you say that? The front layer feature map is larger, and its details are preserved better. Moreover, in general, the front layer tends to extract information such as edges. The edge is very important to the visual, ah, if you use it, you can certainly have a promotion bar. Of course, this is only one aspect.
Incidentally, another article by the author, Parsenet, also uses similar ideas to look at together. Inside-outside Network
At first glance, this ion is similar to SSDs, yes, it is also the use of Multi-layer feature fusion method to obtain better features.
This skip-connection can be a good use of different features. However, there is still a problem, you think, the characteristics of different layers are not the same, and the number of different layers have not normalized it. If the combination of mechanically, is not a bit of damage to the original characteristics, and does not necessarily have a good promotion.
So the author gives the following method:
If that is the case, it must not be called a masterpiece, it must play a game of tricks. Think of the image as a sequence problem, but not a time series, but a spatial sequence. So, can we introduce RNN to do some of the serialized articles? Yes, that's what the author did.
Let's put some rnn pictures below, and feel the great God's style.
Written in the last
Well, has been the target detection of several papers a little bit, but there are many details are not to say more, to reproduce these papers, certainly not to see the original text, and then read the code. The best thing is to see the network of the article, there must be a good experience. Here is a good online visualization tool Oh, interested can try. Ps
The last thing to do is to find a job. Recently has been busy recruit work, but found that the visual work is hard to find AH. May be their own learning is not fine, but I have been a strong sense of curiosity and interest in the visual, but also hope to be able to make some achievements in visual. So I hope to have a dream team can take me, not necessarily have a very high salary, but must have ambition. In addition, if it is to do NLP, I am also very interested, but not in-depth understanding of this aspect, if the team is willing to take, I also very want to learn. All right, it's over, please comment. Reference
Do a rather unprofessional reference, because you can not find a job, it is really a snack tired. Read more papers later and then fill it up. You objectively forgive me.