Finding Action tubes-cvpr-2015

Last Update:2015-11-25 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Thesis topic Finding action tubes, Paper link This paper is CVPR 2015, mainly about the action tube localization.
Look directly at the picture and speak,the core ideas/steps of this paper can be divided into two components:1 Action detection at every frame of the video2 Linked Detection in time produce action tubesHere's a separate component for each. 1 Action detection at every frame of the videoPresumably the idea is to train spatial-cnn and MOTION-CNN could be free feature, and train linear SVM for each category on feature.     The steps are as follows:  A.find the interesting regions of each frame. Build a positive and negative sample based on the region and action label of Ground-truth.       here with the method of IOU: >0.5 for positive region, <0.3 for negative region.     Why do you do this? Personally think that the action tube in the paper is for the actor inside to get it,        This is the action tracking and action classification of an actor in the video.         The inevitable data set gives the action category and the corresponding actor for each frame of the video .  So how do you find these regions? and how to eliminate unnecessary regions?
    There are many ways to generate proposals, and the paper uses the selective search method to generate proposals (about 2K) per frame in the video.
    Apparently, a large part of these proposals is non- .discriminative, and will cause serious computational consumption, not conducive to real-time detection.
    in this paper, a very simple way to eliminate theseNot descriptive's regions:

    It is important to note that the RGB and motion images regions are the same,        That is, prososals is extracted in RGB using the above method, and then used directly on motion.  B. Trainingspatial-cnn and Motion-cnn
  this is where the framework of the two CNN models is unfolded, specifically the paper.
  They are trained in the same way as rcnn. Concrete can be a lab brother blog
    The main points of this training for individuals are two:
      I. Trained on a single frame.
    Ii. Initialization of the CNN model.
        as we all know, the initialization of deep model is very important.
          SPATIAL-CNN is initialized with a CNN model that is trained on the detection task of Pascal Voc 2012.
        MotionThe -cnn is theUCF101 The trained CNN model on the motion data set to initialize it.
  As for some of the details of the training, such as the study rate, data argumentation, etc., please crossing yourself read the paper ha.
C. ExtractionTraining The FC7 characteristics of spatial-cnn and motion-cnn  This is just a cnns of the fc7 features of the mosaic together, simple violence.  can look at the characteristics of this blog is how to further integration.
d. Training of the linear svms of the actions.

2 Linked Detection in time produce action tubes
This step is based on the component.  A.extracts the corresponding regions per frame, each region over SPATIAL-CNN and MOTION-CNN, to extract the fc7 characteristics,        after the SVMs, to obtain the corresponding action scores.  B. For each category, each video, use the formula to find out linked-action tubes.


    that is, by finding two regions (one frame) of the highest score (Score+iou) between two adjacent frames that belong to an action category
    these regions are then concatenated together to form the action tube.  So how do you calculate an action tube action Acore?


Of course, the paper is not so finished, based on action tube, the video Action ClassificationThis is very simple, see the formula:

As for the effect, it must be state-of-art to say.

The paper's Main contributions, the individual feels the following points: A. Combines appearance and motion signals.
B. Confirms theappearance and motion signals arecomplementary's.C. Using motion signal to eliminate those non-discriminative regions, this is relatively novel.

Of course there is. Insufficient: A. The dataset is mostly for an actor, and the method is very poor in the case of multiple actors.
B. Motion is pre-calculated rather than learned. C. The entire framework is very pipeline.
Well, no, no, no, no, no. Welcome to harass ...

From for notes (Wiz)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Finding Action tubes-cvpr-2015

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Finding Action tubes-cvpr-2015

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support