"CV paper read" Detecting events and key actors in multi-person videos

Last Update:2016-08-15 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This paper mainly introduces a multi-person collaborative video event recognition method, using attention Model +RNN Network, recently studied RNN network, it is more suitable for processing sequence of the existence of the context of the data.

NCAA Basketball Data Set

This data set is the author's new build, an event 4 seconds long, in the paper altogether need to identify 11 events. And from the set of training sets, I learned a multibox detector by labeling the character's bounding box to identify bounding box for all the characters in the frame.

RNN model

The paper uses the lstm in the RNN model to process the frame sequence. The structure of the network, where blstm represents a bidirectional lstm structure

Each pi-blstm tracks the state in each character frame sequence, and the thickness of the box represents attention as the weight of the key character.

First, each frame extracts 1024-dimensional features, and for each player in each frame, extracts 2805-dimensional features (information from spatial 1440-D locations and 1365-D appearance information). First, use BLSTM to compute the hidden state, which holds information about the global context. The formula is as follows

The event state can then be computed using a one-way lstm

Finally, for each event K, a weight vector is defined, and their inner product is computed to determine the classification of the event. The error function can be defined as:

Where is the original label for the video, if it belongs to K is 1, otherwise-1.

Attention model

The main function of the attention model is to identify the main character and increase his role in computing the event state, where a softmax function is used to achieve the above function. This paper puts forward two kinds of ideas, namely, the model of tracking each character and the model of not tracking.

Tracking model

Use KTL Tracker and figure matching to find the corresponding characters for each frame, and set up a blstm network for each character to calculate the hidden state. Calculates the Softmax function to assign weights to each character at each frame, thus identifying key figures such as the following calculation

Which is a multilayer perceptron.

Non-tracking model

The direct use of substitution can be calculated by

"CV paper read" Detecting events and key actors in multi-person videos

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

"CV paper read" Detecting events and key actors in multi-person videos

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support