Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You have look once:unified, real-time object detection. IN:CVPR. (2016)
Yolo's all-in-one is a look Once, as the name implies is only seen once, the target area prediction and target category prediction, the author regards the target detection task as the target area prediction and category prediction regression problem. The method uses a single neural network to predict the object boundary and class probability directly, and realizes the end-to-end item detection. As a result, the recognition performance has been greatly improved to 45 frames per second, while in the fast version of YOLO (Fast YOLO, which has fewer convolutional layers), 155 frames per second can be achieved.
Compared with the best system, the YOLO target region is more accurate, but the false positive of background prediction is better than the best method at present.
First, Introduction
Humans glanced at the images, immediately knew what was in the image, where they were and how they interact with each other. The human visual system is fast and accurate, enabling us to perform complex tasks, such as car driving.
The traditional target detection system uses the classifier to perform the detection. In order to detect objects, these systems are evaluated in different size sizes using classifiers in different locations of the test picture. If the target detection system uses the deformable parts models (DPM) method, the target area is proposed by the sliding frame method, and then the classifier is used to realize the recognition. The recent R-cnn class method uses region proposal methods, first generating potential bounding boxes, and then using classifiers to identify these bounding boxes regions. Finally, the post-processing is adopted to remove the repetitive bounding boxes to optimize. Such methods are complex, slow and difficult to train.
We convert the target detection problem to a single regression problem that extracts bounding boxes and class probabilities directly from the image, and the target category and location can be detected at a glance (you just look Once,yolo).
Yolo concise and clear: see. The YOLO algorithm uses a single convolutional neural network to predict multiple bounding boxes and class probabilities. compared with traditional object detection methods, this unified model has the following advantages:
- Very fast. Yolo prediction process is simple and fast. Our basic version can reach 45 frames/s on the Titan X GPU, and the Express version can reach 150 frames/s. As a result, YOLO can implement real-time detection.
- YOLO uses full-image information to make predictions. Unlike the sliding window method and the region proposal-based method, Yolo can take advantage of full-image information during training and prediction. The fast r-cnn detection method incorrectly detects patches in the background as the target because fast R-CNN cannot see the global image in the detection. The error rate is half as low as the fast R-cnn,yolo background prediction.
- Yolo can learn the general information of the goal (generalizable representation), which has a certain universality. We use natural images to train YOLO and then use art images to predict them. Yolo is much more accurate than other target detection methods (DPM and R-CNN).
in accuracy , the YOLO algorithm still lags behind the most advanced detection systems . Although it can quickly identify objects in an image, it is difficult to pinpoint certain objects, especially small objects.
Two unified detection (Unified Detection)
The author unifies the process of target detection into a single neural network. The neural network uses the whole image information to predict the target's bounding boxes, and realizes the end-to-end real-time target detection task.
As shown in 2-1, YOLO first divides the image into an sxs lattice (grid cell). If the center of a target falls into a grid, the grid is responsible for detecting the target. Each grid cell predicts bounding boxes (B) and the boxes's confidence value (confidence score). The confidence value represents the confidence level of the box that contains a target. Then, we define the confidence value as. If there is no target, the confidence value is zero. In addition, we want the predicted confidence value to be the same as the ground Truth intersection over Union (IOU).
Each bounding box consists of 5 values: x,y,w,h and confidence. (x, y) represents the center of the box associated with the lattice. (w,h) is the width and height of the box associated with the full-image information. Confidence represents the prediction of boxes IOU and gound truth.
Each lattice (grid cell) predicts the probability value of a condition C (). The probability value C represents the probability that the lattice contains a target, and each lattice predicts only a class of probabilities. When testing, each box is multiplied by the class probability and box confidence to reach a specific category of confidence score:
This score represents the probability of the category appearing in box and the suitability of the box and the target. When evaluated on the Pascal VOC dataset, we used S=7,b=2,c=20 (the dataset contains 20 categories) and the result is 7x7x30 tensor.
Section 35th, the YOLO algorithm of target detection