Paper notes: contrastive Learning for image captioning_ images description

Source: Internet
Author: User

Original link: contrastive Learning for Image captioning Introduction

The contrastive Learning (CL) proposed in this paper is mainly to solve the problem of caption lack distinctiveness generated in Image caption task.

The distinctiveness here can be understood as uniqueness, meaning that for different pictures, their caption should also be unique and easy to distinguish. That is, in all pictures, this caption is the highest match to this picture. However, most models now generate caption that are very rigid, especially for images that belong to the same class, and the generated caption are very similar, and caption does not describe the differences in other aspects of these images. Contrastive Learning for Image captioning empirical Study

This paper presents a self-retrieval study to show the problem of lack of distinctiveness. The author randomly selected 5000 pictures from the Mscoco test set I1,... I5000 I 1, ... I 5000, and with the trained NEURALTALK2 and adaptiveattention respectively to these pictures to generate the corresponding 5,000 caption C1,..., c5000 c1,...,c5000. Using PM (:, θ) P m (:, θ) to represent the model, for each caption ct c t, calculate its conditional probability pm for all pictures (ct| I1),..., pm (ct| I5000) P m (C t | I1),...,pm(ct| I 5000, and then do a sort of these probabilities to see if this caption corresponds to the original picture in the top-k of these sorted results. The following figure is visible.

After the addition of CL to training, the accuracy of the model is significantly improved, and the scores of rouge_l and cider are increased, and the accuracy is positively correlated with the scores of the two evaluation criteria. It is indicated that improving distinctiveness can improve the performance of the model. Contrastive Learning

To introduce the usual way of using Maximum likelihood estimation (MLE), here is a picture of the show and tell paper:

After entering a picture, we will get the probability pt (ST) p T (S t) of the next target word, we need to maximize this probability, and the training goal is to minimize L (i,s) =−∑nt=1logpt (ST) L (I, S) =−∑t = 1 N l o G P T (S) to achieve this goal.

The use of MLE training leads to a lack of distinctiveness, which was explained by the author in his previous article towards diverse and Natural Image descriptions via a Conditional gan. Everyone can read it.

And the central idea of CL is to use a reference Model (reference model, such as State-of-the-art, this article takes Neuraltalk2 and adaptiveattention as an example) as baseline, On this basis, the distinctiveness can be improved and the quality of the generated caption is preserved. The reference model is fixed during the training process.

CL also needs positive and negative samples as input, positive and negative samples are both pictures and Ground-truth caption pair, but the positive samples of the caption and pictures are matched, and negative samples although the picture is the same as the positive sample, but the caption is the description of other pictures.

Specific symbols:
Goal model Target MODEL:PM (:, θ) P m (:, θ)
Reference Model Reference MODEL:PN (:, Φ) P N (:, Φ)
Positive Sample Ground-truth pairs:x= ((c1,i1),..., (CTM,ITM)) X=((c1,i1),...,(ctm,itm))
Negative sample mismatched pairs:y= ((C/1,I1),..., (c/tn,itn)) Y = ((C/1, I 1), ..., (c/t n, i T N))

Both the target model and the reference model give the estimated conditional probability pm for all samples (c| i,θ) P m (c | I, θ) and PN (c| i,θ) P N (c | I, Theta)
(The PM here (c| i,θ) P m (c | I, theta) should be the input of the picture, then enter the word in caption S0,..., sn−1 s 0, ..., s n−1, and sequentially the resulting next target word probability P1 (S1),..., PN (SN) P 1 (S 1),. . . , the P N (S N) is multiplied by the resulting. It will be clearer when combined with the above picture. And I hope for all positive samples, PM (c| i,θ) P m (c | I, θ) is greater than PN (c| i,θ) P N (c | I, θ); for all negative samples, PM (c/| i,θ) P m (c/| I, theta) is less than PN (c/| i,θ) P N (c/| I, θ). This means that the target model is given a higher conditional probability than the reference model for a positive sample, and a lower conditional probability for a negative sample than the reference model.

Define PM (c| i,θ) P m (c | I, θ) and PN (c| i,θ) P N (c | (I, θ) The difference is D ((c,i); θ,ϕ) =pm (c| i,θ) −PN (c| i,θ) D ((c, I); θ,ϕ) = P m (c | I, theta) −p N (c | I, Theta)

And the loss function is l′ (theta; X,y,ϕ) =∑tmt=1d ((ct,it); θ,ϕ) −∑tnt=1d ((c/t,it); θ,ϕ) l′ (theta; X, Y, Φ) =∑t = 1 T M D ((c t, I t); θ,ϕ) −∑t = 1 T n D ((c/t, I t); θ,ϕ)

This should be the maximum loss to solve.

In fact, however, there are a few problems:

First PM (c| i,θ) P m (c | I, θ) and PN (c| i,θ) P N (c | I, θ) are very small (~ 1e-8) and may produce numerical problem. Therefore respectively to PM (c| i,θ) P m (c | I, θ) and PN (c| i,θ) P N (c | I, theta) take the logarithm,

Using G ((c,i); θ,ϕ) =LNPM (c| i,θ) −LNPN (c| i,θ) G ((c, I); θ,ϕ) = L n P m (c | I, theta) −l n p N (c | I, θ) to replace D ((c,i); θ,ϕ) d ((c, I); θ,ϕ).

Secondly, because the negative samples are randomly sampled, the different positive and negative samples produced by D ((c,i); θ,ϕ) d ((c, I); θ,ϕ) are not the same size, some d may be much larger than 0, some d is smaller, and updating smaller D is more effective in maximizing loss, so the author uses A logistic function (in fact, sigmoid) rv (z) =11+νexp (−z) R V (z) = 1 1 +νe x P (−z) to saturate these effects, where ν=tn/tmν= T N/ T m and tn=tm t n = t m to balance the number of positive and negative samples. So D ((c,i); θ,ϕ) d ((c, I); θ,ϕ) becomes again:

H ((c,i); θ,ϕ) =rν (g ((c,i); θ,ϕ))) H ((c, I); θ,ϕ) = Rν (g ((c, I); θ,ϕ))

Because H ((c,i); θ,ϕ) ∈ (0,1) h ((c, I); θ,ϕ) ∈ (0, 1), loss function becomes

L (Theta; X,y,ϕ) =∑tmt=1ln[h ((ct,it); θ,ϕ)]+∑tnt=1ln[1−h ((c/t,it); θ,ϕ)] L (theta; X, Y, Φ) =∑t = 1 T m l N [H ((c T, I t, Θ,ϕ)] +∑t = 1 T n l N [1−h ((c/t, I t); Θ,ϕ)]

The first term of the equation guarantees the probability of ground-truth pairs, the second suppresses the probability of mismatched pairs, and forces the model to learn distinctiveness.

In addition, in this paper, the x is copied k, to correspond to k different negative sample Y, so as to prevent the fitting, the text select K=5.
The final loss function:j (theta) =1k1tm∑kk=1l (theta); X,yk,ϕ) J (θ) = 1 K 1 T m∑k = 1 k L (theta; X, Y K, Φ)

These transformations are mainly inspired by noise contrastive estimation (NCE).

Ideally, the upper bound of J (θ) j (θ) is 0 when the positive and negative samples can be perfectly distinguishable. That is, the target model will pair positive sample p (ct| IT) p (c t | I t) gives a high probability of negative sample p (c/t| It) p (c/t | I t) gives a low probability. Now

G ((ct,it); θ,ϕ) =→∞g ((c t, I t); θ,ϕ) =→∞, G ((c/t,it); θ,ϕ) →−∞g ((c/t, I t); θ,ϕ) →−∞,

H ((ct,it); θ,ϕ) =1 H ((c T, I t); θ,ϕ) = 1, h ((c/t,it); θ,ϕ) =0 H ((c/t, I t); θ,ϕ) = 0,
J (θ) j (θ) obtains an upper bound of 0.

But in fact, when the target model gives a positive sample a maximum probability of 1 o'clock, I think that G ((ct,it); θ,ϕ) should be equal to LNPN (c| i,θ) G ((c T, I T), Θ,ϕ) should be equal to L n p N (c | I, θ), so the upper bound of H ((ct,it); θ,ϕ) <1 H ((c T, I T), Θ,ϕ) < 1, j (θ) should be less than 0. Experimental results


As shown above, we can see that the performance of the model has improved greatly after joining CL.


The above diagram shows some visualization results of CL and the original model.


The article also contrasts the difference between CL and Gan, IL (introspective Learning): IL takes target model itself as reference, and is through comparison (i,c), (i/,c) (I, C), (I/, c) to enter Line of study.
Negative samples of IL (i/,c) (I/, c) are usually predefined and fixed, while CL negative samples are dynamically sampled. The evaluator in Gan directly measure distinctiveness, but cannot guarantee its accuracy.

In addition, the accuracy of the model was reduced after adding IL and Gan, indicating that the model sacrificed accuracy to improve distinctiveness. But CL can improve distinctiveness while keeping accuracy.

The above chart also contrasts the training with positive and negative samples respectively. As you can see, the performance of the model is only slightly elevated in the case of a positive sample. I think that this is because the reference model is fixed, the probability of each positive sample is fixed, and the positive sample has no negative sample of the random sampling process, all the samples are also determined, so the reference model gives the probability and is constant, The loss function that is removed from the negative sample is equivalent to the MLE loss function minus a constant, which is equivalent to MLE, so it is equivalent to a bit more training on the basis of the original model. In the case of a negative sample, the performance of the model is significantly reduced (since no positive sample is specified and the negative sample is randomly sampled).

And only two samples are involved in training to give the model a big boost.


The above diagram tests the generalization ability of CL. "-" indicates that the model uses MLE training. It can be seen that by choosing a better model (AA) as a reference,nt ascension is greater. (But there is no more than AA itself, which is arguably not better than the reference model.) )


In addition, we can improve the lower bound of the model by periodically using the training target model as a better reference model. However, when run 2 was replaced for a second time, the promotion was already small, proving that there was no need to replace it multiple times. Summarize

In general, the main contribution of this paper is to put forward the method of contrastive learning, the construction loss function takes the negative sample to participate in the training, enhances the model distinctiveness. In addition, the Self-retrieval experimental ideas presented in this paper are very special in similar papers.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.