[1] Z. Zhou, Y. Huang, W. Wang, L. Wang, T. Tan, Ieee, see the Forest for the Trees:joint Spatial and temporal recurrent Neural Networks for video-based person re-identification, 30th Ieee Conference on computer Vision and Pattern recognition, (Ieee, New York), pp. 6776-6785.
Summary:
Surveillance cameras are widely used in different scenarios. The need to identify people under different cameras is a pedestrian recognition. In the field of computer vision has received increasing attention recently, but the comparison with image-based pedestrian re-recognition method has little attention based on video. The existing work usually consists of two steps, called feature learning and metric learning. At the same time, many methods do not make full use of temporal information and location information. In this paper, we focus on the re-recognition of video-based pedestrians and the establishment of an end-to-end depth architecture to combine features and metrics learning. We propose a method that uses a time-of-attention model to automatically pick out the most differentiated frames from a given video. Not only that, when measuring the similarity to another video, a spatial loop model is used to combine the information around each location. Our approach uses a federated approach to process spatiotemporal information at the same time. Experiments on three public datasets show the effectiveness of each component of the deep network we propose, surpassing the performance of the most advanced algorithms.
Summarize:
In this paper, we propose an end-to-end deep neural network structure that, when measured in similarity, combines a time-attention model to select the attention of a differentiated frame and a space-cycle model to take advantage of contextual information. We carefully design experiments to demonstrate the effectiveness of each component of the proposed approach. Compared with the most advanced methods, our methods perform better, which indicates that the proposed time-attention model is effective for the learning of features and the spatial cycle model is useful for the measurement learning.
In recent years, many efforts have been made to improve the effect of pedestrian recognition. However, this is still a great distance from the actual application. The problems now include severe occlusion and light changes, irregular changes in human posture, and different characters ' costumes and textures. In addition, it is time to emphasize that the greatest limitation of pedestrian recognition research is the lack of very large data sets, which have many practical problems, especially in the growing popularity of deep networks. So our future job is to collect as much data as possible and to cover as many scenarios as possible.
Method Overview:
As shown in the overall network structure, the ternary sequence is used as the network input, the output of the FC7 is fed into the back time attention model, the time attention model accepts the input of the dimension, and then the output of the dimension is produced by the Alexnet extraction feature. Then use the output of this block to build triplet loss as a watchdog.
At the same time, the author chooses the output of the POOL5 layer to feed into the space cycle network, one input positive and negative samples, the goal of the network is to determine whether this pair belongs to the same person, so it is a two classification model.
The final overall loss is the superposition of the two, using the following formula as the basis for the test.
The following describes the author's specially designed time-of-attention model (TAM) and space-cycle model SRM.
The structure of the TAM, as shown below, is entered as a T feature map of the FC7 layer of the picture sequence X.
This input is first passed through a attention layer, which is structured as:
Can be seen as a matrix of a dimension, the final output is a dimension of the matrix, equivalent to this step, resulting in the initial attention to the original sequence, the author uses a plurality of attention blocks, and for the same input produced different output, From this you can see that the only non-shared weights for different attention blocks are the hidden states of the previous phase. The results of these initial concerns are then fed into the RNN. Each step produces a dimension output, followed by an average pool of time for the TAM output.
For SRM its goal is to process metric learning between videos, structured as follows:
It accepts the characteristics of the POOL5 layer as input, subtracting a pair of features, which is equivalent to roughly calculating the difference of two video sequences, and then using the subsequent structure to process the information.
First, the author passed 6 different directions of space Rnn, the author did not explain that the RNN structure here is only to use LSTM implementation, you can see the RNN input and output the same total dimensions, so infer that the lstm here should be the output of each loop body, and then stacked together, The author then stacks the six spatial rnn results, representing the information extracted from six directions at the depth of each position, and then uses a 1*1 convolution core to summarize the information in the six directions, which is called the contextual feature. The authors say this can be less sensitive to light changes and occlusion (??). )。
CVPR 2017:see The Forest for the Trees:joint Spatial and temporal recurrent neural Networks for video-based person re-ide Ntification