1. Target positioning
1.1 Introduction to classification, positioning and testing
-Image classification
Image classification, is to give you a picture, you determine the target category, such as cars, cats and so on.
-Classification with localization
Positioning classification, not only to determine the target category, but also to output the position of the target object, such as the box up.
-Detection
Detection, there may be multiple objects in the picture, you need to find them out. 1.2 Positioning Classification
1.2.1 Given a picture, you want to determine what the target is, where it is, place the picture into convolutional neural network CNN Training, output category and location. As shown above, given 4 classes: pedestrian, vehicle, motorcycle, background (no body), after convolution output 2, position (BX, by, BH, BW), to (0, 0), (1, 1) respectively, the upper left and bottom left corner of the picture, (BX, by, BH, BW) to determine the rectangular box, (bx, by) determine the car upper left corner coordinates, BH, BW, respectively, represent the height and width of the car.
1.2.2 Training data (x, y), X for the picture, assuming 32*32*3, Y for the label, need to represent the classification and positioning of the position box, such as y= (PC, BX, by, BH, BW, C1, C2, C3), pc=1 that the picture target for pedestrians, cars, motorcycles, pc=0 means no target , as a background picture. The C1,C2,C3 is used to indicate which category the target is specifically classified. such as y= (1, 0.3, 0.6, 0.3, 0.4, 0, 1, 0) indicate the target for the car; y= (0,?,?,?,?,?,?,?) Indicates a background picture, no target, no need to know the following parameters. 2. Feature Point detection
The characteristic point is the artificial point, and a picture is put into convolutional neural network, which modifies the coordinates of the output to increase the feature point. 2.1 Human Face recognition
For human face, suppose there are 64 feature points, (l1x, l1y), (l2x, l2y) 、... 、 (l64x, l64y) respectively, the left outer corner, left inner corner, right inner corner, right outer corner of the eye and so on, output y= (face, l1x, l1y, ..., l64x, l64y). 2.2 Posture Detection
For human posture, it is also possible to artificially specify a number of feature points, to find these feature points, you can use them as input, training to identify different postures, such as exercise, lie down, squat and so on. 3. Target Detection
Sliding window detection
The general idea is to give a tag training set (X, Y), the target object to be detected in the tag training set as full as possible to fill the entire picture, training with the training set, the output indicates whether the target, 1 or 0.
Apply the above trained network, using different size sliding window to convolution the image to be detected, if the recognition is successful, the target object in the image to be detected, such as a car, will be detected.
4. Sliding window Implementation of convolution 4.1 Converting an all-connected layer into a convolution layer
As pictured above, the upper part is a typical lenet, convolution, pooling, stretching, access to the full-connected layer, output. The lower half is converted into a convolution layer for the full join layer.
-For the Network layer unit 5*5*16, the first convolution with 5*5*16, the output is 1*1, and then with 400 filters, you can get 1*1*400, different from the full connection of 400.
-Specific understanding 5*5*16 convolution 5*5*16 into 1*1, you can think of 16 5*5 pictures, for each 5*5 picture, you use 5*5 for convolution, then get 1*1, intuitive understanding is a picture rolled into 1 points, then 16 pictures rolled into 16*1, and then 16 filter convolution, into 1 * *.
-The following FC conversion CN is also true. 4.2 Applying sliding window detection
The simple principle is to follow a certain size of the window in the original picture slide, until the detection of the target, the reason is very simple, but the implementation will be repeated many calculations. As shown in figure:
-Suppose 14*14*3 is a tag training set, and the output Y represents the category. The picture to be detected is 16*16*3, the stride is assumed to be 2, in the first algorithm, we want to 16*16*3 the picture divided into 4 14*14*3 into the top convolution network for convolution, get 4 output results, but this one repeated calculation too many things, such as the middle overlap part, you use 5*5* 3 convolution, the overlap is calculated 4 times completely.
-assuming that the 28*28*3 picture is a picture to be detected, repeat the calculation with the above method more. 4.3 How to improve
We can convolution the entire picture to be tested without continuous convolution, to get all the predicted values at once, and to reduce the duplication of calculations. The first use of the small picture (14 * 14 * 3) Tag Training set training, itself is used by convolutional Neural network CNN, assumed to be x1->cnn1->y1. When using a sliding window on the test set (28 * 28 * 3), the convolution is started at (14 * 14 * 3), and then the CNN1 is generated, and then the predicted value is generated, then the duplicate calculation is set to X2->x1->cnn1->y2, we can directly convolution test set (28 * * 3), x2->cnn1->y3, avoid a large number of repeated calculations, may result Y3 not necessarily better than Y2, but fast. 5.Bounding Box Forecast
Target detection needs to determine the bounding box, and must consider the accuracy, real-time, the following describes the YOLO algorithm. 5.1 YOLO Algorithm
Suppose the size of the image to be measured x (100*100*3), the picture is divided into 3*3 grid, referring to the above classification positioning of the output Y format 8-dimensional vector (PC, BX, by, BH, BW, C1, C2, C3). For each lattice, if there is only one target to be detected at the center of the grid, PC = 1, otherwise, PC =0; (bx, by) represents the center point coordinate, (BH, BW) indicates the target's height and width compared to the grid size. The target output y is 3*3*8. The advantage of constructing convolutional neural network cnn,x->cnn->y is that the bounding box can be accurately predicted, which is the credit of the neural network, the specific reason is unknown. 5.2 Yolo Supplement
As shown in figure, the output (BH, BW) value is the target height width for the lattice size ratio, the value can be greater than 1, but (BX, by) the value is between 0 and 1, because the upper left corner of each lattice is represented (0,0), and the lower right corner represents (in).
6. Orthogonal
Orthogonal is used in the target detection algorithm, it is simple, is the prediction box and the actual box and set ratio of the collection, the general value is greater than 0.5, you can determine the correct prediction.
7. Non-maximal value suppression 7.1 Algorithm Overview for a picture to be measured, apply the YOLO algorithm and the orthogonal ratio, remove all the prediction boxes of the pc<0.6 and select the Prediction box with the largest remaining PC value to remove all other forecast boxes
7.2 Examples
such as the detection of cars, the first to leave all the prediction box greater than 0.6, in the comparison of all the forecast box, left the car selected 0.8, the right car selected 0.9, the other prediction box is all deleted.
8.Anchor Boxes
The above-mentioned lattice can only detect a target, for detecting multiple targets requires another algorithm, Anchor boxes algorithm. 8.1 Brief Examples
The Anchor boxes algorithm describes multiple objects by amplifying the output Y, assuming that a lattice is to detect two objects, the center point of two objects falling on 3*3grid, a lattice, and a person and a car, then y = (y1, y2), Yi = (PC, BX, by, BH, BW, C1, c2, C3), I = 1, 2.
8.2 Anchor Boxes Algorithm
For a previous lattice corresponding to a target, now a lattice not only corresponds to a target, but also for a anchor box, that is (grid cell, anchor Box), and then select the highest orthogonal. Take two anchor boxes for example, originally 3*3*8 become 3*3*2*8.
9.YOLO Algorithm
Before learning the basic elements of target detection, these elements can be combined to form the YOLO algorithm:
-Input x (100*100*3), divide it into 3*3grid mesh, target Class 3 to be detected, man, car, motorcycle, two anchor Boxes;
-Placement of classified positioning of convolutional neural networks, using maxima suppression and orthogonal ratio, output y = (y1, y2), Yi = (PC, BX, by, BH, BW, C1, C2, C3), i=1,2.
-Output Y (100*3*3*16)