"Paper reading" illuminating pedestrians via simultaneous Detection & segmentation

Source: Internet
Author: User
paper SourcesICCV2017 ARXIV report GitHub code (CAFFE-MATLAB)

The main problem in this paper is pedestrian detection. The author discusses how to apply semantic segmentation to pedestrian detection and improve the detection rate without damaging the detection efficiency. The author proposes a semantic Fusion network (segmentation infusion networks) to promote the joint supervision of semantic segmentation and pedestrian detection. The pedestrian detection is the main task, and the semantic segmentation mainly plays the role of correcting and directing the feature generation of the shared layer. The author mentions illuminating pedestrians in the topic. In fact, mainly refers to the supervision through semantic segmentation, can make the generated feature more focused on pedestrians, so as to facilitate the downstream pedestrian detection.

In addition, the better contribution of this paper is to set up the network is appropriate, that is, the semantic partition layer is set just right. Therefore, even using the semantic information of weak annotations is sufficient to improve performance. Background Introduction

There are two main methods in the computer vision of pedestrian detection: target detection and semantic segmentation . These two methods are highly correlated, but have their own pros and cons. For example, target detection can locate different objects, but rarely gives the boundary of objects. While semantic segmentation can locate objects ' boundaries by pixel, it is very difficult to distinguish the same kind.

Naturally we hope that knowledge from one task will make the other task easier. This has been achieved in the target detection of a research. So, in the pedestrian detection, but very little research. This is partly due to the lack of pixel-by-pixels in traditional pedestrian data sets .

For example, let's look at a few traditional datasets.

Caltech


Kitti

Introduction and use of the Kitti data set
This dataset provides a 3D border callout (using a LiDAR coordinate system) for moving objects in the camera's field of view. The dataset's annotations are divided into 8 categories: ' Car ', ' Van ', ' Truck ', ' pedestrian ', ' person (sit-ting) ', ' cyclist ', ' Tram ' and ' Misc ' (e.g, trailers, SEGW ays).

The above two datasets provide only the pedestrian's bounding box, which does not provide semantic information.

COCO


This data set is a general target detection data set, it can be seen from the graph that it provides a rich label, both location information and semantic information. Cityscapes



This dataset provides detailed city semantic labeling and, of course, pedestrian semantic tagging. The main purpose of this data set is to promote the application of semantic segmentation in pedestrian detection. This is where the core of this article is. simultaneous Detection & Segmentation

There are two ways to study simultaneous detection and segmentation. One is the simultaneous promotion of both tasks. such as our well-known instance-aware segmentation.
In fact, the above figure D is typical of this kind of task, it is different from the previous semantic segmentation, but on the basis of semantic segmentation also requires the separation of the same kind. is actually split + detection.

The second is to make the semantic segmentation as a strong clue to improve the target detection obviously. In fact, very early, I read a human face plus a facial feature points of an article,

Joint face Detection and Alignment using multi-task cascaded convolutional Networks

This article is based on the detection of human face feature points to promote human face detection. In fact, people recognize the face is often to identify the facial features of people to carry out face detection. In target detection, semantic information often provides powerful features to facilitate detection, and can suppress background interference. However, some jobs require a separate network to be detected first. Our framework overcomes this shortcoming by fusing semantic information into shared feature maps, and finally improving both accuracy and performance. Method Parsing



The above is the entire network structure, which contains two stages. The first stage uses the RPN network to propose the pedestrian detection candidate box and gives the preliminary score. In the second stage, hard sample is further excavated, and the corresponding thinning fraction is given. Since RPN has predicted that the pedestrian position is accurate enough, the second step is simply a classification, without bounding box's return. Moreover, the prediction scores of two networks were superimposed as the final classification score.

Let's take a closer look at the truth. RPN

RPN source from Fatser rcnn, used to propose a set of bounding with associated confidence scores around pedestrains.
RPN network in a feature map not a bit out of a certain scale and scale of the anchor box, equivalent to a pooled image space in a sliding window. Each proposal box I corresponds to a anchor (the scale and aspect ratio) and a position of the image space.

As shown, RPN uses VGG-16 's conv1-5 as the backbone (lifting feature), then two layers, one layer called segementation infusion layer, the other is the traditional proposal layer, Two output layers are used for classification and bounding box regression respectively.

From the network structure, the above segmentation infusion layer Ground truth is two white small squares made of mask, in fact, is labeled pedestrian detection of the inside of the Bouding box. Then after two layers of joint training, the final conv1-5 output of the feature map clearly highlights the pedestrian, that is, to illuminate the pedestrian.

The objective function of this RPN is as follows:



LC represents the loss of classification, using Softmax logistic loss over the classes. (Pedestrian vs background). The law that identifies a box as a pedestrian here is iou>=0.5.

LR is the return loss. Using Lr (ti,ti^) =r (ti−ti^) L_r (t_i,\hat{t_i}) =r (T_i-\hat{t_i}), R represents a robust L1 loss. Where the offset information of bounding box is defined as the offset on X, Y and on w,h, i.e. t=[tx,ty,tw,th] t=[t_x,t_y,t_w,t_h].

LS is the loss of semantic segmentation. Explained later.

In the experiment, λc=λs=1,λr=5 \lambda_c=\lambda_s=1,\lambda_r=5, apparently highlighting the importance of regression, because the regression is only once, and the classification and semantic segmentation is not so important. BCN (Binary classfication Network)

BCN mainly completes the pedestrian identification of the proposal proposed in RPN. As a general target detection, the recognition part of faster rcnn back end is generally used. However, according to "is faster rcnn doing well for pedestrian detection?", faster RCNN's backend will degrade pedestrian detection accuracy instead. So here's the choice to build a separate identification network using VGG-16.

This network is mainly to identify RPN forgotten hard example. Improve the occlusion, deformation and other pedestrian scores, so that it can be detected.
Of course, this section still adds semantic information to improve the recognition rate.

The objective function for this section is as follows:



The main features are as follows:
1. Lc L_c refers to a classification loss, where CI c_i is the category label for the I proposal, while ci^ \hat{c_i} incorporates fractions from RPN and BCN.

Specifically, for the I-proposal, the predictions for a given RPN are in the two categories of fractions {c^ri0,c^ri1} \left \{\hat{c}_{i0}^r, \hat{c}_{i1}^r \right \},BCN score {c^b

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.