AK compares its similarities to this model in its Open-source neuraltalk and NEURALTALK2 projects and admits that "but the Google release should work significantly as a" Better CNN, some tricks, and more careful engineering. " So let's start today and compare the NIC (Neural Image Caption) model with what's good.
Project code: Im2txt.
Overall, the two are not very different, are end to end,cnn extraction special town, RNN generated statements. And the difference is reflected in the nuances:
1, Nic from sequence to sequence machine translation Method Middle School came up with a brim name called Encoder-decoder model, encoder is Cnn,decoder is RNN. The point is that CNN and RNN here are very different from neuraltalk. The NIC model uses a better feature extractor–googlenet (2015), Batchnorm (2016), which enriches the image information obtained, uses more complex lstm, and increases the number of layers and cell numbers in 2016, Make the decoder more complex, also achieved better results.
2, feature input mode. The extraction of features in neuraltalk as bias combined with other inputs, entered directly into the first cell of RNN, feeling a bit hasty, while the NIC left the first moment entirely to the feature input, does not make predictions, there is a warm meaning. It is also mentioned that the author's experience confirms that it is not good to enter a picture each time, so it is only entered once.
3. In other details:
Nic uses CNN's pretrained model and fix it, does not train it, only trains the lstm part. Until the model is stabilized, the overall model is trained on the Coco DataSet. The advantage of this is that because Coco's training focuses on a lot of words about color description, the new model has the predictive stability of the original model and makes the annotation statement more specific and accurate.
LSTM random initialization, model using ensemble comprehensive prediction, using dropout further prevent fitting to increase generalization ability, beam search to try more, use schedule sampling and so on, more details please refer to the original.
Overall, the whole article is not in the spit because of the small training set caused by the problem of fitting. Also true, the image caption problem is faced with a great challenge is the emergence of caption is always training focused, and the details are not in place. This article says that if the training set is larger, the overfit problem will be alleviated and the effect will be better. But I personally think that simply relying on a larger training set is not a long-term solution, from smaller data sets to find more general rules to bring more long-term development space, we may try to use other means to solve the overfit problem (BN over lstm), Or the robustness of the model can be improved by structural adjustment.