1. Why is this article introduced?
One of Triplet net improvement work, the main idea is the difficult sample mining in the big Data set (face recognition). The work of face recognition is very useful for the image matching, which is the extraction of feature and the mining of sample data.
Tripnet net originates from the deep metric learning using triplet network, and the triplet loss for training three images are also presented in this paper. Many similar face recognition and matching work is done on large datasets, which requires efficient use of data. The reason is that most samples no longer have gradient contributions at the end of the training period, such as loss functions with margin (triplet loss, SOFTPN (loss function in Pnnet, improved triplet loss)), and most samples are easily satisfied with this margin, In this case, the loss in the loss function no longer changes, causing the training to stall. So how to effectively carry out difficult sample mining (Ohem) becomes the key, the difficult sample should include a sample of the same distance from the larger and non-homogeneous (non-matched samples) in the homogeneous (matched sample).
2. Contribution of this article
The main improvement of this paper is difficult sample mining. There are now two types of ohnm, but the authors believe that in a very large number of identities (a very large number of data sets) is not enough to effectively dig difficult samples, it is easy to fall into the local minimum, can not continue to update. So the author proposes that the data should be clustered according to similarity, then implement batch ohem, so that the search space is reduced and the difficult sample is easier to dig out . So why use triplet loss rather than Softmax loss? As the number of categories increases, the full connection to the Softmax loss is increased, resulting in GPU memory load and longer training cycles. The second is that if each class has very little data then training Softmax will be difficult, as shown in:
However, it is not easy to use triplet loss, how to efficiently optimize in large-scale case? Using triplet, you will find that the amount of data is exploding and the composable triples are much more, if it is unrealistic to traverse all combinations, or extremely inefficient. There are two main methods, one is to convert triplet loss to Softmax loss (triplet net article, triplet loss combined with the output of Softmax and MSE), and the other is OHNM or batch ohnm. But the second way to dig a sample is to consider a batch in all of the sample spacesdirectly, and if the batch has a non-satisfying margin, it is considered a difficult sample. However, this does not guarantee that sampling is just a lot of very similar samples, and (a,p,n) image Group once selected is fixed, no other combinations, so inefficient. So the author thinks that looking for similar individuals (categories) is the key to improve triplet net .
My understanding is that for human face data with 10 individuals (people), it is difficult to distinguish between those who are very much alike (e.g., A and b). Prior to the difficult excavation is the first in these scattered overall space randomly set (a,p,n) Triples and then batch to train, divide the wrong as a difficult sample. This makes you these triples are already fixed , the difficult sample is not necessarily the real difficulty, unless the good luck makes the ternary group has many A, b in the sample. The author's idea is to first cluster all the samples, to see which samples are closer, (A and B look like the cluster results are very close), then I will be in a good class of sub-space selection (a,p) two-tuple composition batch,n in this batch to select, Then this negative sample n will be a real hard sample.
3. Three methods of Ohem
1) Triplet with OHNM
This formula actually does not reflect the difficulty of digging, because it is considered to violate the margin limit, that is, the positive sample distance is no more than the negative sample distance of small alpha (margin) is identified as a difficult sample. T represents the entire sample space, so the triples in this section are selected directly from the overall sample , which is the practice of most of us.
Look at this illustration above, red for anchor, light green for positive, blue for negative, dark blue for hard negative.
all matching pairs and negative in this method are selected in the overall sample . as can be seen, most negative are far away from positive matching pairs, although negative are also divided into one batch during training, but the real difficult negative are difficult to select.
Conclusion: The ternary group is randomly selected in the scattered whole sample space.
2) Triplet with Batch ohnm
This is done in order to dig as many difficult samples as possible in a batch, when negative is not selected in the overall sample but in a batch. That means I only select the matching pairs in the whole sample and then train in batch and then choose Negative. This is equivalent to narrowing down the search scope . From a global search to a local search, the difficulty sample is more likely to be searched.
You can see that the first is a match to the composition batch after the network to get the distance, according to the distance to determine the similarity, and then composed of ternary batch, input to the loss function to optimize.
It is worth noting that choosing not to directly select and match the negative of the nearest distance as a difficult sample, may lead to poor training, should choose to compare the nearest neighbor sample. My understanding is that the three images that are particularly similar are not trained because they are inherently fuzzy to a class, which instead creates a training sample of "error markers" that causes the network to fall into a pattern collapse.
Summary: Randomly select two-tuple in scattered whole sample space, and then compose ternary group when calculating loss.
3) Triplet with subspace learning
In this paper, the author thinks that it is not appropriate to select a match to compose batch, which is a step back. Because this batch is also selected in the overall sample immediately after the break-up, so the author believes that the identity (very many individuals face) is particularly long, so similar to a , b two people face matching samples to enter a batch at the same time the probability of very low . So since people who are similar are not in a batch, there are no particularly difficult samples to dig up. A simple idea is to generate a representation for each identity, and then Kmeans clustering based on this representation . Finally, the result of clustering is a subspace. SubSpace, which are close to each other, can be seen as difficult to distinguish or very similar samples. In order to accelerate the training, using the pre-trained network to initialize, although feature extraction and clustering time-consuming, but subspace learning greatly reduces the search space and training time .
It is important to note the choice of the identity representation and the number of clusters (number of sub-spaces m). If M is too small, then the sub-space division will be very vague, not so similar to be clustered into a category, it is difficult to dig very difficult. If M is too large, many of the sub-spaces contain fewer independent identity numbers, and in extreme cases each class becomes a separate subspace, the mining efficiency is low. The author experiment contains 10k identity for each subspace.
Summary: In the overall sample space first clustering into subspace, and then iterative use of batch ohnm.
4. Summary and thinking
Summarize three ways. For example, there are now 100,000 different people's face data, of which a, b look like, then we must hope that these two faces can be combined as the triples. The Ohnm method is randomly selected in all samples, so the probability of selecting a B into a ternary group is minimal. The batch Ohnm method is that I only randomly select two tuples in all the samples, and then in this batch to choose negative according to the distance, then if there is a B in batch, it is possible to find out, this is relative to the law one flexibility. Method Three is the first to the whole sample cluster, similar people must be gathered for a class, then I in this similar subspace excavation is not easier? A b must be divided into a subspace, which is a similar face in the subspace, digging more efficiently.
We can see that the method reduces the search space of Fayi, and the author's method reduces the search space of the law two. That is to further solve the source of the problem, the traced way of thinking is worth learning.
How to Train Triplet Networks with 100K identities?