This paper mainly introduces a multi-person collaborative video event recognition method, using attention Model +RNN Network, recently studied RNN network, it is more suitable for processing sequence of the existence of the context of the data.
NCAA Basketball Data Set
This data set is the author's new build, an event 4 seconds long, in the paper altogether need to identify 11 events. And from the set of training sets, I learned a multibox detector by labeling the character's bounding box to identify bounding box for all the characters in the frame.
RNN model
The paper uses the lstm in the RNN model to process the frame sequence. The structure of the network, where blstm represents a bidirectional lstm structure
Each pi-blstm tracks the state in each character frame sequence, and the thickness of the box represents attention as the weight of the key character.
First, each frame extracts 1024-dimensional features, and for each player in each frame, extracts 2805-dimensional features (information from spatial 1440-D locations and 1365-D appearance information). First, use BLSTM to compute the hidden state, which holds information about the global context. The formula is as follows
The event state can then be computed using a one-way lstm
Finally, for each event K, a weight vector is defined, and their inner product is computed to determine the classification of the event. The error function can be defined as:
Where is the original label for the video, if it belongs to K is 1, otherwise-1.
Attention model
The main function of the attention model is to identify the main character and increase his role in computing the event state, where a softmax function is used to achieve the above function. This paper puts forward two kinds of ideas, namely, the model of tracking each character and the model of not tracking.
Tracking model
Use KTL Tracker and figure matching to find the corresponding characters for each frame, and set up a blstm network for each character to calculate the hidden state. Calculates the Softmax function to assign weights to each character at each frame, thus identifying key figures such as the following calculation
Which is a multilayer perceptron.
Non-tracking model
The direct use of substitution can be calculated by
"CV paper read" Detecting events and key actors in multi-person videos