Recently I have read some facial recognition reviews and several classic papers. Here we will briefly record the reproduction process of mtcnn and tensorflow. Sensory face detection is a direction under the target detection, but it is changed from general target detection to face detection, that is, multi-classification to 2 classification, and small target detection. In addition, face detection also includes key-point detection, which can be used to increase the recall rate. The main idea is to rely on the general object detection, in addition to YOLO and R-CNN series, but also an extra algorithm series for Cascade network structure, mtcnn is one of the representatives. I used to talk about two methods for common target detection. This time I will focus on the mtcnn solution of cascade networks.
I understand the idea of reading the paper. First, let's look at the summary and sort out the context. Find several major breakthroughs in the classic method, and then search for the blog of the method to understand the general meaning. Later I read the paper, because the paper has more details, and then I combined the paper to read the source code and reproduce it. The source code is written by others, not by myself.
For more information, see blog 80820255.
Https://zhuanlan.zhihu.com/p/31761796
Mtcnn is divided into three networks: p-net, R-Net, and o-Net. First, let's look at the overall flowchart of the prediction section (I only reproduce the source code of the prediction section, but I will talk about the loss function ).
1. Stage 1
Different from training, the input image is not fixed in size, but scaled to generate an image pyramid, that is, a series of images. The minimum size is 12*12. The r-Net network structure can be considered as convolution of the original image (12*12, strides = 2), and then the predicted classification and location offset are generated (no face key points ), in the network output, each 1*1 square is mapped to 12*12 as the accept field of the source image. The predicted values of the series are generated. Therefore, the input size is 12*12. Some network code is not listed. It is a network structure and is easy to understand. The Preprocessing code is as follows:
### The function is to input the entire source image to generate an image pyramid and input it into the Network to increase accuracy. Factor_count = 0 total_boxes = NP. empty (0, 9) points = [] H = IMG. shape [0] W = IMG. shape [1] minl = NP. amin ([h, w]) M = 12.0/minsize minl = minl * m # creat scale pyramid scales = [] While minl> = 12: scales + = [M * NP. power (factor, factor_count)] minl = minl * factor factor_count + = 1 # First Stage ## send a minimum of 12*12 series of images to the PNET network, return box for output. Scale is the scaling ratio, which can be used to speculate the coordinates in the source image for J in range (LEN (scales): scale = scales [J] HS = int (NP. ceil (H * scale) Ws = int (NP. ceil (w * scale) im_data = imresample (IMG, (HS, WS) # At this time, the input is not 12*12, but the accept field of each unit is 12, in this way, each piece generates a series of prediction boxes im_data = (im_data-127.5) * 0.0078125 img_x = NP. expand_dims (im_data, 0) img_y = NP. transpose (img_x, (,) out = PNET (img_y) ### output out0 = NP in the input PNET network. transpose (out [0], (,) ### binary classification, that is, the probability out1 = NP of the face. transpose (out [1], (,) ### prediction box offset regression out0 size (1, h/12, W/12, 2)
Out1 size (1, h/12, W/12, 4)
# How to generate a series of prediction boxes when the output is a collection of prediction box coordinates at the first layer.
boxes, _ = generateBoundingBox(out1[0,:,:,1].copy(), out0[0,:,:,:].copy(), scale, threshold[0])
The following is the genereteboundingbox code. The function is used to output the coordinates, offsets, and prediction scores of each 12*12 in the first layer.
Def generateboundingbox (IMAP, Reg, scale, T ): ### Reg indicates whether the offset imag is the ratio T reduced by class scal as the threshold # Use heatmap to generate Bounding Boxes stride = 2 cellsize = 12 ### transpose computing is very common, the purpose is to calculate the ratio easily. Common Object detection is also commonly used, and the position coordinates are transposed to IMAP = NP. transpose (IMAP) dx1 = NP. transpose (REG [:,:, 0]) dy1 = NP. transpose (REG [:,:, 1]) dx2 = NP. transpose (REG [:,:, 2]) dy2 = NP. transpose (REG [:,:, 3]) y, x = NP. where (IMAP> = T) ### filter the coordinates greater than the threshold. Because each small cell has a predicted probability value, the four coordinate offset values H/12, W/12, y, and X can be regarded as index if y. shape [0] = 1: dx1 = NP. flipud (dx1) dy1 = NP. flipud (dy1) dx2 = NP. flipud (dx2) dy2 = NP. flipud (dy2) score = IMAP [(Y, X)] ### the score indicates the probability of predicted faces. The score of the prediction box that exceeds the threshold is Reg = NP. transpose (NP. vstack ([dx1 [(Y, X)], dy1 [(Y, X)], dx2 [(Y, X)], dy2 [(Y, X, x)]) ### predicted if Reg. size = 0: Reg = NP. empty (0, 3) BB = NP. transpose (NP. vstack ([y, X]) ### why * 2 + 1? Should it be * 2 + 4? Q1 and Q2 values should be in the upper left corner of each prediction box in the source image, and the coordinate Q1 = NP in the lower right corner. fix (STRIDE * BB + 1)/scale) q2 = NP. fix (STRIDE x BB + cellsize-1 + 1)/scale) boundingbox = NP. hstack ([Q1, q2, NP. expand_dims (score, 1), Reg]) return boundingbox, Reg # returns the coordinates and corresponding offsets of each 12*12 block and the score of this block.
Then, the predicted values are corrected and pre-trimmed to generate the proposal coordinates. The predicted regression boxes are extracted for input to the second network. The Code is as follows:
1 # inter-scale NMS performs NMS on the predicted prediction box, and filters the prediction Box 2 pick = NMS (boxes. copy (), 0.5, 'Union ') 3 if boxes. size> 0 and pick. size> 0: 4 boxes = Boxes [Pick,:] 5 total_boxes = NP. append (total_boxes, boxes, axis = 0) 6 7 numbox = total_boxes.shape [0] #### Number of filtered prediction boxes 8 If numbox> 0: 9 pick = NMS (total_boxes.copy (), 0.7, 'Union ') ### raise the threshold, and further perform nms10 total_boxes = total_boxes [Pick,:] 11 regw = total_boxes [:, 2]-total_boxes [:, 0] 12 regh = total_boxes [:, 3]-total_boxes [:, 1] 13 qq1 = total_boxes [:, 0] + total_boxes [:, 5] * regw14 qq2 = total_boxes [:, 1] + total_boxes [:, 6] * regh15 qq3 = total_boxes [:, 2] + total_boxes [:, 7] * regw16 qq4 = total_boxes [:, 3] + total_boxes [:, 8] * regh17 total_boxes = NP. transpose (NP. vstack ([qq1, qq2, qq3, qq4, total_boxes [:, 4]) ### in turn, coordinate in the lower right corner and score 18 total_boxes = rerec (total_boxes.copy () ### change the prediction box to a square 19 total_boxes [:, 0: 4] = NP. fix (total_boxes [:, 0: 4]). astype (NP. int32) # rounded up 20 dy, Edy, dx, EDX, Y, ey, X, ex, tmpw, tmph = pad (total_boxes.copy (), W, H) ##### trim coordinates so that they do not exceed the image size
2. Stage 2
Network structure. The input size must be 24x24. The input image is the porposals generated by stage1. This is similar to the first step of faster-rcnn. You can first filter out the pure image and remove irrelevant parts. This is also the significance of cascade. Note: Only the score and prediction box coordinates are corrected for the prediction network output, and there is no key point information. The following code is similar to the first step.
1 numbox = total_boxes.shape [0] 2 if numbox> 0: ### the prediction box obtained from the first step is cropped in the source image, resize, enter 3 # Second Stage 4 tempimg = NP in R-net. zeros (24, 24, 3, numbox) 5 for K in range (0, numbox): 6 TMP = NP. zeros (INT (tmph [k]), INT (tmpw [k]), 3) 7 TMP [dy [k]-1: Edy [K], DX [k]-1: edX [K],:] = IMG [Y [k]-1: ey [K], X [k]-1: ex [K],:] 8 If TMP. shape [0]> 0 and TMP. shape [1]> 0 or TMP. shape [0] = 0 and TMP. shape [1] = 0: 9 tempimg [:, K] = imresamp Le (TMP, (24, 24) #### resize porposalsresize to 24*24. 10 else: 11 return NP. empty () 12 tempimg = (tempimg-127.5) * 0.007812513 tempimg1 = NP. transpose (tempimg, (,) 14 out = RNET (tempimg1) # output 15 out0 = NP from input to R-net. transpose (out [0]) #### predicted box coordinate offset 16 out1 = NP. transpose (out [1]) ###### prediction score 17 score = out1 [1,:] 18 ## the predicted box coordinates filtered out in step 1. At this time, the coordinates are the coordinates offset in the source image, not the coordinates offset after resize. That is, the offset can be directly added to the coordinates of the source image. 19 IPASS = NP. where (score> threshold [1]) 20 total_boxes = NP. hstack ([total_boxes [IPASS [0], 0: 4]. copy (), NP. expand_dims (score [IPASS]. copy (), 1)]) 21 mv = out0 [:, IPASS [0] ### offset value obtained in step 2 22 if total_boxes.shape [0]> pick = NMS (total_boxes, 0.7, 'join ') 24 total_boxes = total_boxes [Pick,:] ### NMS first. in step 1, the threshold is increased by 25 total_boxes = bbreg (total_boxes.copy (), NP. transpose (MV [:, pick]) #### coordinates after the offset 26 total_boxes = rerec (total_boxes.copy () ## change to square
3. Stage 3
There is nothing to say about the network in step 3, which is the same as the process in step 2. However, there is another key point to predict. Finally, the result is obtained.
1 numbox = total_boxes.shape [0] 2 if numbox> 0: 3 # Third Stage ### follow Step 2, input the prediction image obtained in step 2 to the third network. 4 total_boxes = NP. fix (total_boxes ). astype (NP. int32) 5 dy, Edy, dx, EDX, Y, ey, X, ex, tmpw, tmph = pad (total_boxes.copy (), W, h) 6 tempimg = NP. zeros (48, 48, 3, numbox) 7 for K in range (0, numbox): 8 TMP = NP. zeros (INT (tmph [k]), INT (tmpw [k]), 3) 9 TMP [dy [k]-1: Edy [K], DX [k]-1: edX [K],:] = IMG [Y [k]-1: ey [K], X [k]-1: ex [K],:] 10 if TMP. shape [0]> 0 and TMP. shape [1]> 0 or TMP. shape [0] = 0 and TMP. shape [1] = 0: 11 tempimg [:, K] = imresample (TMP, (48, 48) 12 else: 13 return NP. empty () 14 tempimg = (tempimg-127.5) * 0.007812515 tempimg1 = NP. transpose (tempimg, (3, 1, 0, 2) 16 out = ONet (tempimg1) 17 out0 = NP. transpose (out [0]) 18 out1 = NP. transpose (out [1]) 19 out2 = NP. transpose (out [2]) 20 score = out2 [1,:] 21 points = out122 IPASS = NP. where (score> threshold [2]) 23 points = points [:, IPASS [0] 24 total_boxes = NP. hstack ([total_boxes [IPASS [0], 0: 4]. copy (), NP. expand_dims (score [IPASS]. copy (), 1)]) 25 mv = out0 [:, IPASS [0] 26 27 W = total_boxes [:, 2]-total_boxes [:, 0] + 128 H = total_boxes [:, 3]-total_boxes [:, 1] + 129 points [0: 5,:] = NP. tile (W, (5, 1) * points [0: 5,:] + NP. tile (total_boxes [:, 0], (5, 1)-130 points [,:] = NP. tile (h, (5, 1) * points [5: 10,:] + NP. tile (total_boxes [:, 1], (5, 1)-131 If total_boxes.shape [0]> 0: 32 total_boxes = bbreg (total_boxes.copy (), NP. transpose (MV) 33 pick = NMS (total_boxes.copy (), 0.7, 'Min') 34 total_boxes = total_boxes [Pick,:] 35 points = points [:, pick] 36 37 return total_boxes, points #### get the final predicted value
There is only the prediction part of the code here. during training, there is a difference from other target detection, that is, there are five key points detected, also included in the loss function for training, the accuracy of recognition can be increased.
.. This part is very rough.
1. Face classification loss function. Cross entropy loss function
2. loss function of the prediction box. Square loss
3. Key loss function. It is also a square loss.
3. Comprehensive Training and overall loss functions. Each part of the network has different weights.
Mtcnn Face Recognition