Three contributions to the paper
(1) The Two-stream structure of CNN is proposed, which consists of a network of two dimensions of space and time.
(2) Using multi-frame dense optical flow field as training input, can extract the information of the action.
(3) A multi-task training method is used to unite two datasets.
The structure of the stream
The video screen can be divided into two parts of space and time, the space part refers to the surface information of independent frames, about objects, scenes, etc., while the time part information refers to the optical flow between frames, carrying the motion information between frames. Accordingly, the proposed network structure is composed of two deep networks, which deal with the dimensions of time and space respectively.
As you can see, each depth network will output a SOFTMAX layer, and finally the output of the two Softmax layers will be fused by a method: one is the average layer, and the other is to train a SVM with these softmax outputs as features.
Spatial convolution Network
Network input is a single frame, such a classification network actually has a lot, such as alexnext,googlenet, can now imagenet on the pre-training, and then the parameter migration.
Optical flow field convolution network (Time Dimension Network)
The Optical flow field convolution network input is stacking optical flow displacement fields between several consecutive frames (will not translate ... ), is the multi-layer two frames between the optical flow field, can be seen from. Because the optical flow field can describe the motion information of an object.
Simple Optical flow field superposition
The method is to calculate the light flow between each two frames and simply stack them together. Assuming you need to track l+1 frames (which will generate an L-frame optical flow), the optical flow will be decomposed into an X, y two direction of the optical flow, there will be 2 L channels.
Trajectory Tracking Optical Flow overlay
Assuming a pixel of a frame, you can trace its trajectory point in the video by the optical flow, thus calculating the amount of light flowing in its corresponding position at each frame. The same will decompose the optical flow into an X, y two direction of light flow, then there will be 2 L channels.
For this approach, I think of a problem in DT's paper: "Drift" of pixels, which is likely to occur after multiple frames are traced. It is assumed that this L-frame should not refer to all the frames of the training video, this method can be good for the region to distinguish the foreground and background.
Minus the average optical flow
This is mainly to eliminate the relative motion caused by the camera motion.
Multi-tasking training
For a spatial convolutional network, because it's just an image, and it's just a classification network, it has a large number of data sets available for pre-training, in response to an overfitting problem.
But for a time convolutional network, there are few video sets available for training. The author uses a multi-tasking approach to provide two Softmax output layers, but the network has only one. The paper is based on the provision of two Softmax output layer equivalent to the regularization process. In this way, when the network is trained with two datasets, one of the softmax layers classifies the video of one of the datasets, another Softmax layer classifies the other, and in the last BP algorithm, the output of the two Softmax layers is added, As a total error execution BP algorithm updates the weights of the network.
Some details.
1, the calculation of optical flow is pre-processing preservation, because this will affect the speed of the network.
2, when testing, for an input video, random sampling fixed number of frames, their time dimension interval is the same. For each frame, it calculates the superposition of its optical flow field. And each frame will be sampled in different locations, for a video error, is the total error of the average.
CV paper read "Convolutional Networks for action recognition in Vedios