A text with you getting Started video goal segmentation (with data sets)

Source: Internet
Author: User

Source: The heart of the machine

This article length is 2413 words, suggest reading 5 minutes

This paper begins with the introduction of video target segmentation, data set and Davis Challenge, and introduces two main methods of video target segmentation.


Recently Visualead research director Eddie Smolyansky on the Midum website to introduce the basic knowledge of video target segmentation, starting with the video target segmentation, data sets and DAVIS challenge, as well as introducing Visualead's latest released video dataset Gygo And the two main methods of video target segmentation since 2016: Masktrack and Osvos.



DAVIS-2016 Video Object segmentation data set with several correctly labeled frames


This paper introduces the problem of video target segmentation and the corresponding classical solution, briefly summarized as follows:


Issues, datasets and challenges;

The new dataset we are announcing today;

Two main methods used since 2016: Masktrack and Osvos.


The article assumes that the reader is already familiar with some concepts in computer vision and depth learning. I would like to make a clear and understandable introduction to the DAVIS Challenge, so that the novice can quickly enter the state.


Introduced


In the field of computer vision, there are three kinds of classical tasks related to target: classification, detection and segmentation. The classification is to tell you "what", the goal of the next two tasks is to tell you "where", and the split task will answer the question at the pixel level.



Classic Computer vision tasks (images from Stanford cs231n course slides)


The 2016 Semantic segmentation field has developed a mature technology, and even began to close to the existing data set saturation performance. At the same time, 2017 is also a variety of video processing task explosive growth year: Action classification, Action (timing) segmentation, semantic segmentation and so on. Here we will focus on video target segmentation.


Issues, datasets, challenges


There are two basic differences between video target segmentation task and semantic segmentation:


The segmentation of video target segmentation is a general and non semantic target.

Video target segmentation adds a sequential module: its task is to find the corresponding pixel of interest in each successive frame of the video.



Subdivision of the segmentation. Each leaf in the diagram has a sample dataset


Based on the characteristics of the video task, we can divide the problem into two subclasses:


Unsupervised (also known as video significance detection): Finding and segmenting the main goals in the video. This means that the algorithm needs to decide which object is the main.

Semi-supervised: in the input (only) give the video first frame of the correct partition mask, and then in each successive frame to split the target of the callout.


Semi-supervised cases can be extended to multiple object segmentation issues, which we can see in the DAVIS-2017 Challenge.



The main differences between DAVIS-2016 (left) and DAVIS-2017 (right): Multi-Object Segmentation (multi-instance segmentation)


We can see that DAVIS is a pixel perfect for matching the callout data set. Its goal is to recreate real-world video scenes, such as camera jitter, background clutter, occlusion, and other complex conditions.



DAVIS-2016 's Complexity properties


There are two main criteria for measuring segmentation accuracy:




Region similarity (Region similarity): The region similarity is the intersection over Union function between the mask M and the truth G




Contour accuracy (contour accuracy): The mask is viewed as a set of closed outlines, and a function of the F metric based on contour, i.e., the accuracy rate and the recall rate is computed. That is, profile accuracy is an F metric for profile based accuracy and recall rate.


Visually, regional similarity measures the number of error pixels, while contour accuracy measures the accuracy of the segmentation boundary.


The new dataset. Gygo: Electronic Business video target segmentation data set (by Visualead)


We will be in the next few weeks to release the various parts of Gygo, Gygo is a dedicated to electronic video object segmentation dataset, composed of about 150 short video.


DataSet Address: Https://github.com/ilchemla/gygo-dataset


On the one hand, the sequence of the video screen is very simple, almost no occlusion, fast moving or other properties to improve the complexity. On the other hand, objects in these videos have more categories than DAVIS-2016 datasets, many of which contain known semantic categories (human, automotive, etc.). Gygo specializes in collecting video from smartphones, so the frames are sparse (the video speed is only about 5 fps).


We publish datasets based on the following two purposes:


Currently, there is a serious lack of data on video target segmentation, with only hundreds of videos with annotations. We believe that every contribution is expected to help improve the performance of the algorithm. We analyze that the video target segmentation task can get better results in joint training on Gygo and DAVIS data sets.

To promote a more open and shared culture, other researchers are encouraged to join us. :) The DAVIS data set and the research ecosystem that promotes its growth provide us with a great help, and we hope that the community will benefit from it.


The two main methods in DAVIS-2016


With the publication of the DAVIS-2016 dataset for single goal segmentation, the two most important methods appear: Osvos and Masktrack. In the DAVIS-2017 Challenge team, each team wants to build a solution that transcends both, and they have become a classic. Let's see how they work:


Single video target segmentation (one Shot videos object Segmentation,osvos)


The concept behind Osvos is simple and powerful:



Osvos Training Process


1. Select a network (such as VGG-16) on the imagenet for classification training.

2. Convert it to a fully connected convolution network (FCN) to preserve space information:


Remove the FC layer at the end of the training.

Embedding a new loss function: Pixel-level sigmoid equilibrium cross entropy (pixel-wise sigmoid balanced cross entropy, used for HED). Now, each pixel can be categorized as a foreground or background.


3. Train a new fully connected convolution network on the DAVIS-2016 training set.

4. One-time training: At the time of inference, a new video input is given to segment and the real annotation is given in the first frame (remember, this is a semi-supervised problem), create a new model, initialize with the weights of training in [3], and adjust in the first frame.


The results of this process are the only and one-off models that apply to each new video, and because of the first-frame annotation, the model is actually fitted for the new video. Because the goals and backgrounds in most videos are not going to change dramatically, the results of this model are good. Naturally, if the model is used to process random video sequences, it behaves less well.


Note: The Osvos method splits each frame of the video independently, so the timing information in the video is useless.


Masktrack (Learning video target segmentation from static images)


Osvos separately splits each frame of the video, and Masktrack also needs to consider the timing information in the video:



Mask propagation module of Masktrack


Each frame feeds the prediction mask of the previous frame as an extra input to the network: now enter four channels (the mask of the previous frame of the rgb+). Initializes the process using the real callout for the first frame.

The network was originally based on Deeplab VGG-16 (modular) and is now trained from scratch in semantic segmentation and image-salient datasets. The Mask channel input of the previous frame is synthesized by converting the true annotation of each static image slightly.

Adding an identical second stream network based on the optical flow field input. The weight of the model is the same as the weight of the RGB stream. The output of two streams is fused by averaging two results.

Online training: An additional, specific video training data is synthesized using the real notation of the first frame.


Note: Both methods rely on the static image training (in contrast to the static image dataset, the video dataset is less and smaller).


To sum up, in this introductory article we understand the video target segmentation problem and the 2016 optimal solution.


P.S. Here I would like to thank the DAVIS data set and the team behind the challenge for their outstanding contributions.


Reference documents

The main documents mentioned and analyzed in the paper are as follows:

1. Benchmark Dataset and evaluation methodology for video Object segmentation F. Perazzi, J. Pont-tuset, B. McWilliams, L. Van Gool, M. Gross, and A. Sorkine-hornung, Computer Vision and Pattern recognition (CVPR) 2016

2. The 2017 DAVIS Challenge on video Object Segmentationj. Pont-tuset, F. Perazzi, S. Caelles, p. Arbeláez, A. Sorkine-hornung, and L. Van Gool, arxiv:1704.00675, 2017

3. Learning Video Object segmentation from Static Images F. Perazzi, A. Khoreva, R. Benenson, B. Schiele, A. Sorkine-hornu Ng CVPR 2017, Honolulu, USA

4. One-Shot video Object segmentation, S. Caelles, K.K. Maninis, J. Pont-tuset, L. leal-taixé, D. Cremers, and L. Van Gool , Computer Vision and Pattern recognition (CVPR), 2017


Original link: https://medium.com/@eddiesmo/video-object-segmentation-the-basics-758e77321914


Editor: Wen Yu


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.