First, related theories
This post focuses on an article on image similarity calculation in 2015 CVPR: "Learning to Compare image patches via convolutional neural Networks", This article has improved the classical algorithm Siamese Networks. To learn this paper algorithm, you need to be familiar with Siamese Networks (the classic old document Signature verification Using a Siamese time Delay neural Network), And the He Keming of the space pyramid proposed by the Great God (2015 CVPR literature, "spatial Pyramid Pooling in deep convolutionalnetworks for Visual recognition"), Because the document is basically on the basis of Siamese networks to make changes, and then also with the help of spatial pyramid pool to achieve different size of the input network image.
Total network structure
As shown, our aim is to compare two images for similarity or similarity, so the input to the convolutional neural network model We build is: Two images, then the output of the network is a similar degree value. In fact, I think that the word "computational similarity" is a bit inappropriate, and I think it should be translated to a degree of matching. Because of the training data used in the literature, if two pictures match, the output value is labeled Y=1, if the two images do not match, then the training data labeled Y=-1, that is, the training data labeling method, is not a similarity value, but a matching value. For example, there are three objects: pens, pencils, schoolbags, so in the training data, the pens and pencils are marked as Y=1, instead of a similarity value, compared to the similarity of my pen and pencil, we mark it as y=0.9 ..., so the word used for similarity is a bit unreasonable, Even if our last calculated value is a number between -1~1 ...
Paper Main innovation: in terms of innovation, I think the main point is to put the Siamese network of dual branches, together, thereby improving the accuracy, as shown in
Siamese Network
Paper Algorithm 2-channel Network
First explain why the author calls it: 2-channel networks. Understanding the word 2-channel helps us understand the algorithm later. From the above Siamese network, we can see this network has two branches, because we are to compare two pictures patch1, patch2 similarity, so Siamese network of the general idea is to let Patch1, PATCH2 respectively through the network, to extract eigenvector, Then in the last layer of two two eigenvectors to do a similarity loss function, network training, this back in a more detailed explanation, in general Siamese for the two images patch1, PATCH2 feature extraction process is independent of each other, we can also Siamese The network is called "2-branches networks". So paper the proposed algorithm: 2-channel networks what is the meaning of it? Originally PATCH1, PATCH2 is two single-channel grayscale images, they are irrelevant, so the author's idea is to put Patch1, patch2 together, the two pictures, as a two-channel image. That is, two (1,64,64) single-channel data, put together, become a (2,64,64) two-channel matrix, and then the matrix data as the input of the network, this is called: 2-channel.
OK, this is the main innovation of the document, understand this, see paper on the back is easier, it can be said that the general idea of paper half, is not feeling seemingly very simple appearance.
Second, Siamese network related theory
This section is to understand the classic Siamese network, if you already know the structure of the Siamese network, please skip this section.
1, "Learning a similarity metric discriminatively, with application to the face verification"
Siamese network: This is a once used for signature recognition of the network, that is, we usually say handwriting recognition. This algorithm can be used to judge signature handwriting, an algorithm of n years ago. The principle of the algorithm uses the neural network to extract the description operator, obtains the characteristic vector, then uses two picture's characteristic vector to judge the similarity degree, this kind resembles sift, is only uses the CNN to carry on the extraction characteristic, and uses the characteristic vector constructs the loss function, carries on the network training. The following is a reference to the 2005 CVPR on a document "Learning a similarity metric discriminatively, with application to the face verification" for a brief explanation, This paper is mainly used Siamese network face similarity discrimination, can be used for face recognition oh, because I think this document network structure of the picture is relatively beautiful, easier to understand, so use this article, a simple explanation of the idea of Siamese network. Its network as shown, there are two branches to enter the picture X1, X2 (Note: These two branches are actually the same, the same CNN model, the same parameters, the literature is only for the convenience of reading, so only to draw into two branches, because they use weight sharing), including convolution, pooling and other related operations. The two branches are a bit difficult to understand, we still use a single branch to understand it, said the simple point, Siamese network into the first half, the second half part. The first half is used for feature extraction, we can have two pictures, enter the first half of our network, and then get an output feature vector GW (x1), GW (x2), and then we construct two eigenvector distance measurement, as two images of the similarity calculation function (as shown in Equation 1).
Siamese Network
As shown, we want to determine whether the image X1 and X2 are similar, so we built a network mapping function Gw (x), and then the X1, x2 as parameter arguments, we can get GW (x1), GW (x2), that is, to evaluate X1, X2 is similar to the eigenvector. And then our goal is to make the function:
Then using this loss function to train the network, we can discriminate the similarity of the two faces. In the above process, the two branches of the network use the same function, that is, the weights, the network structure is the same, we can completely consider the GW (x) as a feature extractor, so the Siamese network is actually a picture to extract the characteristics of the operator process, and then the last layer of the network, is a loss function that defines the similarity between eigenvectors.
2, Siamese Network
OK, let's go back to the topic of this blog post, below is the Siamese network (shared weights) used by paper. In the last top layer of the network, the output layer is formed by the linear full connection and the Relu activation function. In the paper, the last is to include two fully connected layers, each with 512 neurons in each hidden layer.
In addition to the Siamese network, the literature also introduces another kind of Pseudo-siamese network, the biggest difference between this network and the Siamese network is that the two branches are the real two-branch network model which is not shared by the weights. Pseudo-siamese on the two branches of the network, each branch is a different mapping function, that is, they extract the structure of the feature is not the same, the left and right two branches, there are different weights, or different network layers, etc., two functions are unrelated, but in the final full-connected layer, they are connected together. This network is equivalent to the training parameters of the Siamese network more than the training parameters nearly one times, of course, it is more flexible than the Siamese network.
In fact here, we mention the original Siamese, as well as Pseudo-siamese, are just to foil behind the author's proposed algorithm: 2-channel networks, because finally we want to do algorithm accuracy comparison, so the author is wordy so much. I think this part of the long-winded too much will be messy, so still not fine, think this is not the point. The following is to explain the 2-channel network is the main innovation point of paper, so we need to carefully read the part. Because paper's explanation method, is based on the step-by-step improvement of the network, so I based on the idea of the literature to explain, to Siamese network as the basis: paper first proposed to change Siamese to 2-channel, this is the first time the evolution of the algorithm, improve the accuracy. Then put forward Central-surround Two-stream network, the algorithm is only in the input of the web has changed, did not change the structure of the network, can be combined with 2-channel, Siamese, improve accuracy
Third, the first evolution, from Siamese to 2-channel (Innovation point 1)
The overall structure of the network,2-channel: With the previous Siamese network, the Pseudo-siamese network essentially each branch is equivalent to a feature extraction process, and then the last layer of the equivalent to calculate the similarity of the eigenvectors of the same function. Then the author puts forward the 2-channel network structure, skips the explicit feature extraction process of the branch, but directly learns the similarity evaluation function. The dual-channel network structure is equivalent to the two grayscale images of the input as a two-channel picture.
So the last layer of the algorithm is directly connected layer, the number of output neurons is directly 1, which directly represents the similarity of two images. Direct use of dual-channel picture training will be faster, more convenient, of course, CNN, if the input is a dual-channel picture, which is equivalent to the network input is 2 feature map, after the first layer of convolution net, the pixels of the two images were weighted together and mapped, which is to say, With the method of 2-channel, after the first convolution, two input pictures will not divide you me. And Siamese network is to the last full connection, the two pictures of the related neurons are linked together, this is 2-channel and Siamese give me the feeling of the biggest difference. The author later passed the experiment, validating the benefits of associating two images from the first layer, the author's exact words: This is something that indicates, it's important to jointly use information from both
Patches right from the first layer of the network.
This way, by the way. A network architecture detail for the documentation, convolutional kernel size :paper all convolution cores are 3*3 in size, because in the Very deep convolutional Networks for Large-scale Image recognition in the literature, for small convolution cores, larger than the size of the convolution core has more nonlinearity, the effect is more awesome, in short, after encountering CNN, the proposed convolution core is smaller, the benefits are many.
Iv. second evolution, combined with Central-surround Two-stream network (Innovation point 2)
This innovative point requires a slight modification of the network structure above. Suppose we enter a picture of the size of 64*64, then Central-surround two-stream network means the picture 64*64 pictures, processed into two 32*32 pictures, and then enter the network, then these two 32* How are 32 of the images calculated? This is the Central-surround method, that is, the first picture is a picture in the center of the image, cut out the 32*32 picture, that is, the light blue area of the picture.
So the second picture is how to calculate: This picture is directly through the whole picture under the method of sampling, that is, directly to the 64*64 picture under the sample to get 32*32 picture. So why did the author take a picture of 64*64 and split it into two 32*32 pictures? In fact, this is like multi-scale, in the field of image processing often use multi-resolution, multi-scale, such as what sift, there is what Gauss pyramid what, in short, the author said, multi-resolution can improve the match effect of two pictures. This central-surround Two-stream network can be combined with the above mentioned 2-channel and Siamese to improve accuracy. Here is the network structure in which Siamese and central-surround Two-stream networks are combined:
Above is the Siamese network model, the use of Central-surround Two-stream network to build a new web model, that is, in the network input section, the input image is changed to multi-scale input.
Five, the third evolution, combined with the spatial pyramid pool of SPP
Spatial pyramid Pooling Sampling: This is also known as the SPP (spatial pyramid pooling) pooling, what's the use of this? This is a bit similar to the above, this is similar to multi-image multi-scale processing, we know that the existing convolutional neural network, the size of the input layer of the image is generally fixed, this is what I have understood before a neural network. Until I know the SPP, I feel a lot more open, and the novice has a long experience. We know that in many of the algorithms now, the size of the training data picture is, what is the size of the 32*32,96*96,227*227, that is, the training data must be normalized to the same size, then suppose my training data is a variety of image size? Do I have to cut it all the size of a picture to get into convolutional neural network training? This is the SPP algorithm to solve the problem, training data pictures do not need to be normalized, and the legend of the river, the effect than the traditional method of the effect is OK. Here is the network structure in which Siamese and SPP are combined:
is to add a SPP layer to the front of the fully connected layer.
With regard to the method of SPP pooling, Geming published several articles "spatial Pyramid Pooling in deep convolutional Networks for Visual recognition", "spatial Pyramid P Ooling in the deep convolutional Networks for Visual recognition, the relevant implementation of the SSP after the explanation, a blog post in one breath if too much, watching will feel tired.
In short this step is to make the network can input various sizes of pictures, improve the usability of the network, robustness and so on.
OK, after the above three evolution, the last structure of the network, that is, the highest precision, paper the last algorithm is: 2-channel+central-surround TWO-STREAM+SPP network structure, because the literature did not put this network structure drawing out, I do not bother to draw, so I can not provide you with the last network structure diagram.
VI. Network Training
First paper uses the following loss function:
The first part of the formula is the regular term, which uses the L2 regular term. The second part of the error loss part is the network I to the training picture output neuron, then Yi's value is 1 or 1, when the input picture is matching, is 1, when non-matching is 1.
The parameter training Update method uses ASGD, its learning rate is constant 1.0, the momentum parameter is selected 0.9, and then the weight attenuation size is selected:
The Min-batch size of the training is selected 128. Weights are used in random initialization methods.
OK, above are some network parameter settings. Then is the data processing part, the general is the expansion of information, that is, to add the image rotation, mirroring, and other operations. The data extension used by the paper includes a horizontal flip, a vertical flip, and a rotation that contains 90, 180, and 270 angular rotations. Training iteration Termination method is not to use what early stop, but start to let the computer run a few days of time, until idle time, come back to see the results, do contrast (PS: this a bit low). If you're just getting started with CNN and haven't heard of data extensions, you can look at: http://blog.csdn.net/tanhongguang1/article/details/46279991. Paper is also used in the training process, the method of the expansion of random data.
Reference documents:
1, "Learning to Compare Image patches via convolutional neural Networks"
2, "discriminative learning of the Local Image descriptors"
3, "Signature verification using a Siamese time delay neural Network_bromley"
4, "Learning visual similarity for product design with convolutional neural Networks"
5, "Learning a similarity metric discriminatively, with application to the face verification"
Discrimination of image similarity based on 2-channel network