Ssd:single Shot Multibox Detector

Source: Internet
Author: User

by Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian szegedy, Scott Reed, Cheng-yang Fu, Alexander c. Berg.

Introduction

SSD is an unified framework for object detection with a single network. You can use the code to train/evaluate a network for object detection task. For more details, please refer to our ARXIV paper.

SSD Framework

System VOC2007 Test MAP FPS (Titan X) Number of Boxes
Faster r-cnn (VGG16) 73.27300
Faster r-cnn (ZF) 62.117300
YOLO63.44598
Fast YOLO52.715598
SSD300 (VGG16) 72.1587308
SSD300 (VGG16, CuDNN V5) 72.1727308
SSD500 (VGG16) 75.12320097
Citing SSD

Cite SSD in your publications if it helps your:

@article {LIU15SSD,
Title = {SSD}: Single Shot Multibox Detector},
Author = {Liu, Wei and Anguelov, Dragomir and Erhan, Dumitru and Szegedy, Christian and Reed, Scott and Fu, Cheng-yang and Berg, Alexander C.},
Journal = {ArXiv preprint arxiv:1512.02325},
Year = {2015}
}
Contents

Installation
Preparation
Train/eval
Models
Installation

Get the code. We'll call the www.myqunliphoto.com/directory that's cloned Caffe into $CAFFE _root

git clone https://github.com/weiliu89/caffe.git
CD Caffe
git checkout SSD
Build the code. Follow Caffe instruction to install all necessary packages and build it.

# Modify Makefile.config According to your Caffe installation.
CP Makefile.config.example Makefile.config
Make-j8
# Make sure to include $CAFFE _root/python to your PYTHONPATH.
Make PY
Make Test-j8
Make Runtest-j8
# If You have multiple GPUs installed on your machine, make runtest might fail. If So, try following:
Export cuda_visible_devices=0; Make Runtest-j8
# If you have the error: "Check failed:error = = Cudasuccess (ten vs. 0) Invalid device ordinal",
# First make sure your specified GPUs, or try following if you have multiple GPUs:
Unset cuda_visible_devices
Preparation

Download fully convolutional reduced (atrous) vggnet. By default,www.egouyule.cn We assume the model is stored in$caffe_root/models/vggnet/

Download VOC2007 and VOC2012 dataset. By default, we assume the data was stored in $HOME/data/

# Download the data.
CD $HOME/data
wget http://host.robots.ox.ac.uk www.lxinyul.cc/pascalwww.zhenloyl88.cn/VOC/voc2012/ Voctrainval_11-may-2012.tar
wget http://host.robots.ox.ac.u www.ymzxrj.com k/pascal/voc/voc2007/voctrainval_06 -nov-2007.tar
wget Http://host.robots.ox.ac.uk/pascal/VOC/voc2007/VOCtest_06-Nov-2007.tar
# Extract the data .
TAR-XVF Voctrainval_11-may-2012.tar
tar-xvf voctrainval_06-nov-2007.tar
TAR-XVF VOCtest_06-Nov-2007.tar
Create the LMDB file.

CD $CAFFE _root
# Create The Trainval.txt, Test.www.wx1677.com/txt, and Test_name_size.txt in data/voc0712/
./data/voc0712/create_list.sh
# can modify the parameters in create_data.sh if needed.
# It'll create Lmdb files for Trainval and test with encoded original image:
#-$HOME/data/vocdevkit/voc0712/lmdb/voc0712_trainval_lmdb
#-$HOME/data/vocdevkit/voc0712/lmdb/voc0712_test_lmdb
# and make soft links at exam www.huacairen88.cn ples/voc0712/
./data/voc0712/create_data.sh
Train/eval

Train your model and evaluate the model on the fly.

# It'll create model definition files and save snapshot models in:
#-$CAFFE _root/models/vggnet/voc0712/ssd_300x300/
# and job file, log file, and the Python script in:
#-$CAFFE _root/jobs/vggnet/voc0712/ssd_300x300/
# and save temporary evaluation results in:
#-$HOME/data/vocdevkit/results/voc2007/ssd_300x300/
# It should reach 72.* MAP at 60k iterations.
Python examples/ssd/ssd_pascal.py
If you don't have the time to train your model, you can download a pre-trained model on here.

Evaluate the most recent snapshot.

# If you would as to test a model you trained, you can do:
Python examples/ssd/score_ssd_pascal.py
Test your model using a webcam. Note:press ESC to stop.

# If you would like to attach a webcam to a model you trained and you can do:
Python examples/ssd/ssd_pascal_webcam.py
Here are a demo video of running a SSD500 model trained on Mscoco dataset.

Check out the EXAMPLES/SSD_DETECT.IPYNB or examples/ssd/ssd_detect.cpp on what to detect objects using a SSD model.

To train in other datasets, please refer-data/otherdataset for more details. We currently add support for Mscoco and ILSVRC2016.

Models

Models trained on voc0712:ssd300, SSD500

Models trained on Mscoco trainval35k:ssd300, SSD500

Models trained on ILSVRC2015 trainval1:ssd300, SSD500 (46.4 MAP on Val2)


Preface
This is an article from ECCV 2016 this year, a new work of the Wei Liu god of UNC Chapel Hill (University of North Carolina at Chapel Hill), thesis code: HTTPS://GITHUB.COM/WEILIU89/CAFFE/TREE/SSD
Obeject Detection Summary Link:object detection
Current list of VOC 2012:

Write a picture description here
Abstract

This article in the case of both guaranteeing the speed, but also to ensure the accuracy of the situation, put forward the SSD object detection model, and now popular detection model, will be the detection process into a single deep neural network. Easy to train and optimize while improving detection speed. SSDs will output a series of discretization (discretization) bounding boxes, which are generated on boxes maps at different levels (layers) and have different feature aspect O.

In the prediction phase:

To calculate the object in each default box, it belongs to the probability of each category, that is, the score, the score. For the PASCAL VOC dataset, there are 20 classes in total, so it is possible to derive each of the 20 categories of objects in each bounding box.

Also, the shape of these bounding boxes should be fine-tuned to conform to the object's bounding rectangle.

Also, in order to handle the different sizes of the same objects, SSDs combine the predictions of feature maps of different resolutions.

The SSD method of this article completely cancels the proposals generation, pixel resampling, or feature resampling phases relative to the detection model that requires object proposals. This makes it easier for SSDs to optimize training, and it is easier to integrate the detection model into the system.

The experiments on PASCAL VOC, MS COCO, and ILSVRC datasets show that SSDs are much faster than the region proposals, while guaranteeing accuracy.

Compared to other single-structure models (YOLO), SSDs achieve higher accuracy, that is, when the input image is small. such as input 300x300 size PASCAL VOC test image, on Titan X, SSD at 58 frame rate, simultaneously achieved 72.1% MAP.

If the input image is 500X500,SSD, then a 75.1% MAP is obtained, which is much better than the State-of-art Faster r-cnn at present.


Introduction

The State-of-art detection system for cash prevalence is generally the following steps, and Mr. Bounding boxes into some hypothetical, then extracts the features in these bounding boxes, and then passes through a classifier to determine if it is an object or something.

This type of pipeline since IJCV, selective Search for Object recognition, to today in the PASCAL VOC, MS COCO, ILSVRC data set on the leading Faster r-c The ResNet of the NN. But this kind of method for the embedded system, the computation time is too long, not enough real-time detection. Of course, there is a lot of work moving towards real-time detection, but so far, it is time to sacrifice detection accuracy.

The real-time detection method proposed in this paper eliminates the process of intermediate bounding boxes, pixel or feature resampling. Although this article is not the first to do this article (YOLO), but this article has done some lifting work, not only to ensure the speed, but also ensure the detection accuracy.

Here is a very important sentence, basically summed up the core idea of this article:

Our improvements include using a small convolutional filter to predict object categories and offsets in bounding box locat Ions, using separate predictors (filters) for different aspect ratio detections, and applying these filters to multiple FE Ature maps from the later stages for a network in order to perform detection at multiple scales.
The main contributions of this paper are summarized as follows:

A new object detection method is proposed: SSD, faster than the original yolo:you only look Once method, but also accurate. At the same time, the result MAP can be compared to the method using region proposals technology (such as Faster r-cnn).

The core of the SSD approach is the Predict object (object), and the score (score) of its attribution category, and the use of small convolution cores on feature map to predict a series of bounding boxes box offsets.

In order to get high precision test results, the feature maps on different levels predict object, box offsets, and also get ratio of different aspect predictions.

The improved design of this paper can ensure the accuracy of the detection when the input resolution is low. At the same time, the overall end-to-end of the design, training also become simple. A good trade-off is obtained between the detection speed and the detection accuracy.

The model presented in this paper is tested on different data sets, such as PASCAL VOC, MS COCO, ILSVRC, and so on. The detection time (timing) and the detection accuracy (accuracy) are compared with the State-of-art detection methods in the field of object detection.


The single Shot Detector (SSD)

This section explains in detail the SSD object detection framework and the SSD training method.

Here, let's figure out what the default box and feature map cell are. See:

Feature map cell is a lattice that divides feature map into 8x8 or 4x4;

The default box is a set of fixed-size boxes on every grid, a series of boxes formed by dashed lines in the figure.

Write a picture description here

Model

SSDs are based on a forward-propagating CNN network, generating a series of fixed-size (fixed-size) bounding boxes, as well as the possibility of each box containing an object instance, namely score. After that, a non-maximal value suppression (Non-maximum suppression) is performed to obtain the final predictions.

The first part of the SSD model, called Base Network, is the standard architecture for image classification. After the base network, this article adds an additional auxiliary networking structure:

Multi-scale feature maps for detection
After the underlying network structure, additional convolutional layers are added, the size of which is decremented by layer, and can be predictions at multiple scales.

convolutional Predictors for Detection
Each added feature layer (or feature layer in the underlying network structure) can use a series of convolutional filters to generate a series of fixed-size predictions, see fig.2. For a size of MXN, with a P-channel feature layer, the use of the convolutional filters is 3x3xp kernels. The resulting predictions is either a score for the attribution category, or a shape offsets relative to the default box coordinate.
Using the above 3x3 kernel on each MXN feature map will produce an output value. The bounding box offset value is the relative distance between the output default box and the feature map location at this point (the YOLO schema replaces the convolution layer here with an all-connected layer).

Default boxes and aspect ratios
The position of each box relative to its corresponding feature map cell is fixed. In each feature map cell, we want to predict the offsets between the box and the default box, and each box contains the object's score (each class probability is calculated).
Therefore, for each box in a position k boxes, we need to calculate the C class, the score of each class, and the 4 offset value (offsets) of the box relative to its default box. Therefore, on each feature map cell in the feature map, there is a need for (c+4) XK filters. For a feature map of MXN size, a (c+4) XKXMXN output is generated.

The default box here is very similar to the Anchor boxes in Faster r-cnn, about Anchor boxes here, see the original paper in detail. But unlike Faster r-cnn, the Anchor boxes in this article is used in feature maps of different resolutions.

Write a picture description here

Training

In training, the difference between SSDs and those using the region proposals + pooling method is that the Groundtruth in the SSD training image needs to be assigned to boxes on those fixed outputs. As mentioned earlier, SSD output is predefined, a series of fixed-size bounding boxes.

For example, the dog's groundtruth is red bounding boxes, but when you label the label, the Red Groundtruth box is given a series of fixed output boxes in the figure (c), which is the red dashed box in figure (c).
Write a picture description here
In fact, the article points out that the definition of Groundtruth boxes like this is not only used in this article. In YOLO, the region proposal stage in the Faster r-cnn, and in Multibox, are used.

When the Groundtruth in the training image correspond to the boxes of the fixed output, it is possible to end-to-end the calculation of loss function and update the back-propagation calculation.

There are some problems in training:

Select a series of default boxes

Select the scales problem mentioned above

Hard negative mining

Strategies for Data augmentation

Here are some of the ways to solve these problems in this article, divided into the following sections below.

Matching strategy:

How do I pair the Groundtruth boxes with the default boxes to make a label?

At the beginning, use the best jaccard overlap in multibox to match each ground truth box with the default box, so that each groundtruth box and the only one default B Ox corresponds.

But unlike Multibox, the default box is then paired with any Groundtruth box, as long as the jaccard between the two overlap is greater than a threshold value, the threshold for this article is 0.5.

Training Objective:

The objective function of SSD training (training objective) is derived from the objective function of Multibox, but this article expands it so that it can handle multiple target categories. Use Xpij=1 to indicate that the first default box is matched to the J Ground Truth Box of category p, otherwise xpij=0 if it does not match.

According to the above matching strategy, there must be ∑ixpij≥1, meaning that for the J Ground Truth box, there may be more than one default box that matches it.

The total target loss function (objective loss functions) is summed by the weighted sum of localization loss (LOC) and Confidence Loss (conf):

Ssd:single Shot Multibox Detector

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.