[CNN] interpretation of Light-Head R-CNN: Balancing accuracy and speed, light-headr-cnn
Speaker: Li Zeming | researcher at the face ++ Research Institute
Edit Qu Xin
Produced by QbitAI | public account QbitAI
On the evening of December 20, the quantum bit eat melon club and Face ++ paper interpretation series of the third phase of the lecture, this issue in the Face (Megvii) Research Institute to interpret the recent published Light-Head R-CNN paper.
The Light-Head R-CNN proposed in this paper, constructed a lightweight Head R-CNN network, beyond the current best level in COCO dataset, but also maintained a high time efficiency.
Li Zeming, a researcher at the face ++ Research Institute, is also the main Member of the COCO 2017 Detection competition team, Light-Head R-CNN paper.
According to the reader's requirements, the quantum site organizes the highlights as follows:
△Share video playback
Light-Head R-CNN this paper mainly discusses how R-CNN balances accuracy and speed in object detection.In this paper, we propose a better two-stage detector design structure, which not only improves the accuracy, but also makes the complexity of the model (head) that removes the base model more flexible and controllable.
Based on resnet101, we have reached the new state-of-the-art result 40.6, exceeding mask rcnn and retinanet. At the same time, if a smaller network is used, such as a small model similar to xception145M, the light-head has reached 100 + FPS and 30.7 mmap, which is more efficient than ssd and yolo.
First, we try to find out the reason why the two-stage detection method is not fast enough. In fact, both two-stage and single-stage can achieve very high accuracy. However, in terms of speed, single-stage is often more advantageous, such as ssd and yolo.In this paper, we want to show that through careful design, the two-phase object detector can also achieve extremely fast, and the accuracy can be higher.
Review paper results
Compared with the state-of-the-art algorithm, the algorithm has higher accuracy and efficiency.
The Red Triangle curve corresponds to the result of the paper. The horizontal axis is the inference time, which is the speed of your object detector during testing. The unit is millisecond. The vertical axis is the MMAP of COCO, from 0.5 to 0.95. in this range, an average Map is obtained. The leftmost Red Triangle is the result of running out with a small model, the middle of the triangle is the result of running out with a ResNet-50, the top of the red triangle is the result of running out with a ResNet-101. We can see that no matter the accuracy or efficiency, the results in this paper have reached the state-of-the-art.
About the title of the article: Some netizens enthusiastically named the title of r-cnn. Our method is actually not as strong as "Miss Yu, therefore, it was named a bullet. If we look at this result curve, in fact, Light-Head R-CNN, its speed and accuracy are particularly high. To sum up, this method should be a quick and accurate method.
Light-head r-cnn is very flexible and universal, which will be reflected later in the structure of my method. The framework is also very unified. From the perspective of light-head rcnn, the faster and rfcn structures are actually very similar.
In addition, we tested titan xp. Compared with titan x of the old series, our results are slower, but still above the blue and green curves. Then, the test phase uses 1 card and 1 batch.
Two categories of Object Detection
Currently, object detection is generally divided into two categories:Single stage object detection; two stage object detection method.
Two stage object detection is based on proposal, classic is the R-CNN of this series of detection methods; and single stage does not rely on proposal, such as the anchor-based SSD method. Therefore, the single stage has one fewer proposal predictor in structure. Two stage often introduces additional computations to perform regression and classification of proposal, the so-called ROI.
That is to say, the detector of Two-stage is one step more than one stage, the regression and classifier (R-CNN) of proposal ). If we simply say speed, two-stage must be slower than one-stage. however, single-stage detector such as retinanet and ssd also has some problems, that is, each anchor needs to be classified. The number of channels predicted by anchor increases compared with the rpn of two-stage.
In two stage, the first stage is just a binary classification, and there are not many channels in it. At the same time, if we reduce the complexity of the second stage, that is, the ROI Prediction Part, until the computing workload is negligible compared to the previous base model, so we can also say that the second stage has no significant impact on the speed of the network. In this way, the overall complexity is less efficient than single stage detection.
So this paper will discuss a question, that is, how to design the second stage? In the current situation, the second stage is too heavy.
Our conclusion is: under normal circumstances, the two-stage detection method will have more advantages in accuracy, but because it introduces a heavy second stage, therefore, the speed may be affected. However, compared with one-stage object detection, if the two-stage detection improves the accuracy, We can sacrifice more precision to improve the speed.
Why? It is reasonable to achieve the same precision and speed faster than the other method when the speed is fixed. In the second half of the paper, we mentioned that a small model will be used to replace the previous base model, because the small model will sacrifice a certain degree of accuracy to improve performance.
That is to say, with the same precision, if the speed matches the single stage, two stage still has its advantages.
Therefore, if you can reduce the computing workload of the second stage, and the cost efficiency is high enough than that of the single stage, you can consider introducing the second stage.
How to increase speed
Let's look back at the Fast and R-CNN. In fact, both of them are basically less different. One is to put the calculation amount behind the ROI operation, and the other is to put the calculation amount ahead of the ROI operation. That is, in the Head part, both of them have introduced a relatively large amount of computing.
First specific analysis of the cause of Faster R-CNN and R-FCN on small models is not fast enough.The Faster R-CNN uses two heavy fc (or resnet's 5th stages) for proposal prediction, while the R-FCN creates a large score map (Category X 7 7 7) channel. Apart from the basemodel, both methods introduce a relatively large amount of computing.
Based on the previous observations, we have designed a more flexible and universal framework.The most important point is to make the feature map of pooling very thin. Why make it thinner? Because of the Head part, the complexity is determined by two factors: the thickness of the pool feature map; and the part that classifies and returns the pool feature. If there are many logics, it still affects the overall network efficiency.
Then there is a problem involved:Can the pooling feature map become very thin?
We have done some verification experiments. At first, I did an experiment on the original R-FCN. We tried to press it to 10 × P, which is equivalent to compressing more than 3900 channels to 490 channels. Then we found that there were almost no points. We have done a series of experiments on VOC, and pressing it into 10 × P won't fall down. On COCO, only a few zeros are dropped, and then a little more is pressed to 5 × P, which means only a few zeros are dropped.
In this case, after the feature map is squashed, it is impossible to generate the final result through Average voting. However, you can add a full connection layer to the end of class 81 to get the final result.
B1 in Table is the original R-FC baseline 32.1, and then directly add 10 × P feature map, to achieve a 31.4, only a few points below zero.
B2 is the RFCN baseline of reproduce, which uses setting in FPN paper. In addition, the loss of regression in R-CNN is multiplied by twice. Based on B2, even if it is reduced to 10 x P, MMAP also drops the number of zeros. In fact, if we reduce it to a channel of 5 × P, the result will be less than 10 × P, and the loss will be less than 0.2 compared with 10 × P.
I also want to do a group of faster R-CNN Based on the feature map thinning experiment. However, this experiment cannot be compared directly. Why? Because the two fully connected layers of the second stage have a large amount of computing, you cannot directly reduce the pooling feature map. Even if you want to reduce it, you must cut down the computing workload of the second stage. Originally, the pool feature is relatively thin. If a thick second stage is introduced, the network will become abrupt. Then the comparison with faster can be obtained with the result of adding the cheap R-CNN later.
The second part of Head, that is, in the prediction part of the second stage, we introduced an additional full link layer, which is about 2048 channels.The combination of the two is the so-called Light-Head part in the article. In fact, the Head has become more flexible and controllable at this time. Because the feature map of pooling is very thin, you can use a larger convolution layer, which also brings a little performance improvement.
This table compares the current baseline on the market. Our result is 39.5 in single-scale training and 40.8 in multiscale-train, exceeding all the previous state-of-the-art, for example, retinanet and mask rcnn. We also tried to add feature pyramid, which is about 41.5.
Let's talk about the process of getting this result. The previous baseline has reached 37.7, and then pooling is made into an alignment, which can increase by about 1.3 points. We used 0.3 NMS during training and changed it to 0.5, it can increase by about 0.5; then it can increase by a point by adding multiscale train. That is, the final 40.8 result.
The original intention of our design is to make the second stage more flexible and controllable. SoI tried another method: replace the previous base model with a smaller model. Then we designed a 145 MB network similar to xception.
There is basically no big difference between a large model and a small model. The big difference is that in the fifth stage, a large model uses atrous algorithm, which is useless. The small model cut down the number of RPN convolution to 256. The results of the small model are as follows. We can see that Light-head r-cnn is more efficient than all the speed models, including ssd and yolo.
In fact, Light-Head R-CNN can continue to cut, we have made up some new experiments, the pooling feature map channel cut to 5 × P, the results do not change. Then there is no need to use the kernel as large as 15, and the result will not be dropped to 7. Even if the large kernel of the feature map is thrown away, the precision will not be much less than 1 × 1.
Therefore, if Light-Head is used, the complexity of its second stage (excluding the Head of the base model) will be more controllable and flexible. This is an example of the comparison between the results of the big and small models we ran last.
Finally, let's play a video of detection and comparison between large and small models.
△Xception145△Res101
A wide range of recruits:
Hope to join the face ++ technology to climb the CV peak of students can submit his resume to Yu Gang: yugang@megvii.com, long-term recruitment algorithm intern. If you have excellent internships, you will have the opportunity to skip the interview phase and start the megvii Research Institute. Q &
In terms of speed, is it to simplify the Head to improve more or to simplify the base model to improve more?
If you use a large network such as resnet 101, even if you reduce the backend size, it will not increase your network speed significantly, because there is a lot of computing workload in the base model. In this case, you need to first cut down your base model, for example, to achieve a very efficient Xception network like 145M. If it is a small model, you need to simplify the Head. You need to compare the two methods according to different situations.
Will this be open-source?
It must be open-source. All the big model results in this experiment have come out, but there is still a small detail of the model that needs to be debugged, because there are some differences between TensorFlow and face ++ platforms.
Why is Light-Head better than two-stage?
Light-Head results are higher than two-stage results like R-FCN and Faster R-CNN. How can this be better than others? Compared to the R-FCN, we actually added a cheap sub-network with only one full connection layer in the second stage, which can raise 1.8 points. We also added a relatively large kernel, because our pooling feature map is relatively thin, so we can use a relatively large kernel size, then, this will raise about 0.6 points for your results (faster will also increase ).
What is the number of ROI?
We use one thousand ROI for testing. The Light-Head R-CNN is sensitive to the amount of ROI between the R-FCN and the Faster R-CNN. Because the R-FCN is a framework with no computing workload in the second stage, and faster it is a heap of computing workload on the second stage. Our second stage is very lightweight, but it does not have no computing workload.
If I use the ROI pooling, does it conflict with the original intention of the RFCN network's location sensitivity?
If you use the ROI pooling to directly voting the final result, it is not very good and there is no location sensitivity, but there is a lightweight fc in the Light-Head to process the global location information.
Related Learning Resources
The above is all the content shared by Li Zeming from the face ++ Research Institute.QbitAI)Interface reply171226"AvailableFull PPTAndVideo Playback Link.
Review of the first phase of Object Detection: Interpretation of The COCO2017 object detection algorithm by the ++ Research Institute
Review of the second phase of human posture estimation: a detailed explanation of the face ++ Research Institute COCO2017 human posture estimation champion Thesis
-End-
Artificial Intelligence cyberspace physical Operating System
AI-CPS OS
"Artificial intelligence Sabo physical operating system" (new-generation technology + commercial operating system "AI-CPS OS": cloud computing + Big Data + Iot + blockchain + artificial intelligence) Branch used today, enterprise leaders must understand how to fully penetrate the "technology" into the entire company, products and other "business" scenarios, the use of AI-CPS OS to form a digital + intelligent power, to achieve the re-deployment of the industry, the re-construction of enterprises, and the rejuvenation of self.
The true value of AI-CPS OS does not come from the construction of technology or function, but is to pass the unique competitive advantage of automation + information, intelligence + product + service and Data + analysis integration, this integration method can release new business and operation modes. If it is impossible to achieve a larger integration of different functions, and there is no willingness to subvert the status quo, these will not be achieved.
Leaders cannot rely on a single strategic approach to cope with multidimensional digital changes. In the face of a new generation of technology + commercial operating system AI-CPS OS disruptive digital + intelligent power, the leader must be in the industry, enterprises and individuals at the three levels to maintain a leading position:
Re-industry layout:How can your world view be changed? How do you reflect on the industry model?
Rebuild an enterprise:What changes do your company need to make? How are you going to redefine your company?
Re-build yourself:Who do you need to be? How do you reshape yourself and stay ahead in the digital + intelligent age?
AI-CPS OS is a digital intelligent innovation platform, the design idea is to integrate big data, Iot, blockchain and artificial intelligence and other seamless integration in the cloud, can help enterprises integrate innovation into their own business system, achieve synergy between advantages of various cutting-edge technologies on the cloud. AI-CPS OS digital + Intelligent Power and Industry, enterprises and individuals at the three levels of the formation of the leadership model, so that the digital into the leaders of the enterprise and the core of the leadership:
Details:This power enables people to observe and perceive what is happening in the real world and the digital world at a more real and meticulous level, in this way, we can better understand and implement personalized product control, micro-business scenario events, and result control.
Intelligence:The model changes with time (data), and the entire system is capable of intelligent (self-learning.
Efficient:Enterprises need to establish real-time or quasi-real-time data collection and transmission, model prediction, and response decision-making capabilities, so that intelligence can change from batch and periodic behavior to Real-Time accessible behavior.
Uncertainty:Digital change subverts and changes the mindset, structure, and practical experience that leaders once relied on. The result is the disruptive force of compound uncertainty. The main uncertainty lies in three fields: technology, culture, and system.
Fuzzy Boundary:The constant integration of the digital world and the real world into CPS not only changes the core products, economic theorems, and possibilities of the industry that people know, but also blur the boundaries between different industries. This effect is spreading rapidly to the ecosystem, enterprises, customers, and products.
AI-CPS OSThe digital + intelligent power is formed to stimulate economic growth in three ways:
Create a virtual labor force and undertake the complex tasks that require adaptability and agility, that is, "intelligent automation", to distinguish it from traditional automation solutions;
Supplement and upgrade existing labor and physical assets to improve capital efficiency;
The popularization of AI will promote innovations in multiple industries and open up new economic growth spaces.
Suggestions for decision makers and business leaders:
Go beyond automation and start new innovation models: Use Dynamic Machine Intelligence with autonomous learning and self-control capabilities to create new business opportunities for enterprises;
Meeting the new generation of information technology and AI: seamless integration of human intelligence and machine intelligence
Evaluate future knowledge and skill types;
Develop ethics: develop ethics standards for the AI ecosystem and
Identifying clearer standards and best practices during the development process;
Attach importance to the redistribution effect: Prepare for the potential impact of artificial intelligence and formulate strategies to help
People with high unemployment risks;
New capabilities required for developing digital + intelligent Enterprises: staff teams must actively master the judgment, communication, imagination, creativity, and other important capabilities unique to humans. For Chinese enterprises, creating an inclusive and diverse culture is also very important.
Zi Yue: "The gentleman is different from each other, and the villain is the same ." The Analects of Confucius Zilu cloud computing, big data, Iot, blockchain and artificial intelligence are integrated like gentlemen, reflecting that technology is productivity together.
If we say that the last Columbus geographic discovery expands the physical space of mankind. This geographic big discovery expands people's digital space. In the mathematical space, new commercial civilization is established to discover the new mode of wealth creation and bring a new space for wealth to human society. Cloud computing, big data, Iot, and blockchain are the ships that enter this digital space, and AI is the sail of that ship, the sail of Columbus!
New Generation Technology + business artificial intelligence saobo physical OS AI-CPS OSAs the core driving force for a new round of industrial revolution, it will further release the huge energy accumulated by previous technological revolutions and industrial changes, and create new powerful engines. Restructuring the production, distribution, exchange, consumption, and other economic activities to form a new demand for intelligence in various fields from the macro to the micro, giving birth to new technologies, new products, new industries, new business models. This has led to a major change in the economic structure, profoundly changing the way of human production and living and the way of thinking, and achieving an overall jump in social productivity.
Industrial intelligence officer AI-CPS
Use"Artificial Intelligence cyberspace physical Operating System(New Generation Technology + commercial operating system "AI-CPS OS": cloud computing + Big Data + Iot + blockchain + artificial intelligence ),Build status awareness in the scenario-real-time analysis-autonomous decision-precise execution-cognitive computing and machine intelligence for Learning Improvement; achieve industrial transformation and upgrading, DT-driven business, and value innovation to create an industrial interconnection ecosystem chain.
Long press the QR code above to follow