Source: AI Technology Base
Paper Author: Ranjay Krishna, Kenji hata,frederic Ren, Li Fei-fei, Juan Carlos niebles stanforduniversity
This article length is 3094 words, suggest reading 6 minutes
The first part of this article introduces you to the abstract of the thesis and the introduction translation, the second part is the interpretation (does not represent this article viewpoint).
A few days ago Li Feifei sent a tweet:
Push text content:
My students ' recent papers have been chosen by the TechCrunch website as one of the 10 papers in the forefront of computer vision, and I am really proud of them. After imagenet, computer vision is still constantly breaking through our imagination.
Since it is the Li Feifei of the great God, it must be a masterpiece worth reading. In fact, this paper as early as this May has been published, many know that netizens have given their own interpretation. In this article, the first part is a summary of the paper and the introduction of translation, the second part of the interpretation (does not represent the views of this article), I hope to help you.
Refer to the link at the end of the article if you want to see the original.
Summary
Most videos contain a large number of events. For example, in a piano playing video, it may include not only the pianist, but also the dancers, or the audience who applaud. This paper presents an intensive event description task--detecting and describing the events in the video. The author proposes a new model, which can identify all the events in one channel of a video, and also describe the events detected in natural language.
Our model introduces a description module that is different from existing methods, and it captures events that occur in minutes to several 10 minutes. To capture the relationship between the different events in the video, the model introduces a new description module (captioning module) that can be used to describe all events in conjunction with contextual information derived from past and future events. The author also presents Activitynet captions, a large benchmark dataset for intensive event description tasks. This dataset contains 20,000 videos (up to 849 hours) and 100,000 descriptive information with start and end times. Finally, the author reports the performance of the model in dense event description, video retrieval and location tasks.
Figure 1: The Intensive Event Description task requires the model to detect and use natural language to describe each event that occurs in the video. These events have their own start and end times, so events can occur simultaneously and overlap over time.
Introduction
With large, active datasets, models can classify events in a video into a series of unrelated behavior categories. For example, in Figure 1, such a model would output labels such as "playing the piano" or "Dancing." Although these methods have good results, they have an important limitation: the details.
In order to solve the problem of lack of detail in the existing behavior detection model, the author discusses how to use statement description to explain the video meaning. For example, in Figure 1, the model might focus on the old man playing the piano before the crowd. Although the model can be described by telling us who is talking about the piano and a group of viewers watching the show, it does not recognize and describe all the other events in the video. For example, at some point in the video, a woman begins to sing along with the performer, and then a man begins to dance with the music. In order for the model to recognize and describe all the events in the video in natural language, the author proposes a dense event description task that requires the model to generate a series of descriptions based on multiple events occurring in the video and to position the events in the video in time.
A dense event description task is similar to a dense image description task. The difference is that the former requires the model to describe the events in the video and the time, while the latter requires the model to describe the image blocks (regions) and the location of the space, the problems to be solved in these two tasks are different. Events in the video may occur in multiple time domains, and different events may overlap.
In the video, the piano recital may have happened from beginning to end, but the audience applauded the event only took more than 10 seconds. In order to capture all events, we need to encode the long video sequence and the short video sequence to describe the event. Previous methods used mean pooling or cyclic neural networks (RNN) to encode the entire video sequence, bypassing the problem. Such methods work well when working with short video, but if you encode long video sequences up to a few minutes or several 10 minutes, the problem of gradients disappears, resulting in a failure to train the model successfully. In order to overcome this limitation, the author applies the recent research results on action proposals generation to multi-temporal domain event detection tasks. In addition, the module introduced by the author processes each video in the forward channel so that the model can detect the event as it occurs.
The paper also found that there are often links between events in the video. In Figure 1, the audience applauded because the performer performed the piano play. So. The model must be able to capture each time using contextual information derived from the front and back events. A recently published paper attempts to describe events in video through multiple statements, but the paper uses "cooking" instructional video, which has a high correlation between events and objects, and has a certain sequence of occurrences.
The authors prove that their model does not apply to "open" Time-domain ("open" domain) video, where events occur as behavior-driven and different events may overlap. The author proposes a description module, which can generate descriptive statements for each event by using the contextual information of all events in the Action proposal module. In addition, the author gives a variant of the Description module (captioning module), which can only generate a description of the events in the streaming video (streaming) based on the events that occurred earlier. The models in this paper refer to the preceding and subsequent events, demonstrating the importance of using contextual information.
In order to evaluate the performance of the model in the intensive event description task and the improvement of the benchmark, we introduced the activitynet captions dataset. The activitynet captions contains 20,000 video captures from the Activitynet, each containing a series of descriptive statements for timing positioning. To verify the model's detection of long video sequences, the dataset contains 10-minute video with an average of 3.65 statements per video. These statements describe events that can occur concurrently and cause video fragments to overlap. Although this paper uses video about human activity, the description may involve Non-human events, such as: Two hours later, a piece of delicious cake. The author collects the description using the crowdsourcing method, in which there is a high degree of consistency between the sequential event video fragments. This finding validates the findings of some studies that brain activity is instinctively translated into semantically meaningful events.
With the help of activitynet captions, we are the first to draw results in the intensive event description task. We use the proposal module and the online description module to demonstrate that we can detect and describe events in long video or streaming video. Also, we proved that we were able to detect events in long video sequences and short video sequences. In addition, we demonstrate that using contextual information derived from other events can enhance the performance of the model in the dense event description task. Finally, we demonstrate that activitynet captions can be used to study video retrieval and event positioning.
Paper Address:
Http://openaccess.thecvf.com/content_ICCV_2017/papers/Krishna_Dense-Captioning_Events_in_ICCV_2017_paper.pdf
About this paper, the battalion commander does not do too much analysis, the following is the analysis of two students know, for reference.
Original address:
https://www.zhihu.com/question/59639334/answer/167555411
Know the answer to the main: Mbfel
The entire framework is divided into two main parts: Proposal module and captioning module.
The model is as follows:
1. Given the video, generate a feature sequence. In the experiment, 16 frames were used to input the c3d extraction features.
2.proposal module. The proposal module makes a little modification on the basis of DAPs, that is, the output of k proposals at each time step. Using LSTM structure, input the above c3d feature sequences, and extract feature sequences with different strides, strides={1,2,4,8}. The generated proposal will overlap over time. Each event is detected, and the current hidden layer status is described as a video.
3.captioning module. Use the context of adjacent events to generate event caption. Adopt LSTM structure.
Divide all events into two barrels for the current event: Past events and future events. Concurrent events are divided according to the end time past events and future events. The calculation formula is not the original.
4. The loss function consists of two parts:
All adopt Cross-entropy.
5. Experiment: Baseline:lstm-yt, S2VT, H-RNN, full model and online model. The full model is in this article models, online model is in the full model only the use of past events, but not the future events.
6. Evaluation: Proposal module and captioning module are evaluated separately.
Proposal Model:recall, depending on two conditions:
The number of proposals and
The IoU with ground truth events. It also tests the effect of different strides in event localization.
Captioning module: Using video retrieval. That is, given the description of different parts of the video, the correct video is detected in the test set.
This model solves the problem:
Video length is inconsistent.
The correlation between events.
I think the main contribution of this article has the following points:
Proposed proposals module+captioning module, which can generate both short and long event with only one video.
Use the context of neighboring events to generate the current event caption.
Proposed Activitynet captioning data set
Know the answer to the main: younger
The framework is roughly: Action segmentation proposal + video caption, first act fragment (or segmentation of interest bar) proposal, Then do the video caption on the proposal, where action segmentation proposal with ECCV16 "1".
"1" 2016-eccv-daps Deep action proposals for action understanding
About video caption, you can go to see friends @ Lin Tianwei Recent columns (The Tianwei column is full of dry goods, do video related research can be concerned about)
Video-Analysis-related field introduction videos captioning (to text description)
https://zhuanlan.zhihu.com/p/26730181
I feel the biggest contribution is to put forward this dense video caption dataset (or Task), the algorithm is only to do a baseline, (the dataset on the activitynet with the caption annotation, Activitynet is currently the most popular behavior recognition/Detection challenge video DataSet.
Activitynet this year's game as CVPR2017 's Workshop,feifei Li Group This paper presented this dense caption task this year, the Activitynet game of the five games, the big guys are interested to participate in the:
Http://activity-net.org/challenges/2017/index.html
Editor: Wang Xuan