The network structure and parametric solution algorithm for recurrent neural nnetwork and Long short-trem Memory (Recursive neural network (recurrent neural networks,rnn), lstm Network (L Ong short-term Memory)), this article will list some RNN and LSTM applications, RNN (LSTM) sample can be the following form: 1) input and output are sequence, 2) input is a sequence, output is a sample label, 3) input a single sample, output as a sequence. This article will list some of the applications of RNN (LSTM) to see how RNN (LSTM) is used in various machine learning tasks, and before explaining specific tasks, take a look at some of the general RNN (LSTM) structures, all of which have a detailed paper description:
The blue node in the graph is input, red is output, green is hidden node or memory block, (a) describes the traditional MLP network, i.e. the data is independent, regardless of the timing of the data, and (b) the input sequence is mapped to a fixed-length vector (category label) that can be used for text, Video classification; (c) input is a single data point, output is a sequence of data, the typical representative is Image captioning; (d) This is a task of structural sequence-to-sequence, often used for machine translation, two sequence lengths are not necessarily equal ; (e) This structure will generate a model of the text, each of which will predict the character of the next moment.
Machine translation
machine translation has long been a problem of NLP task, the first difficulty lies in the text representation method, in NLP tasks, if a word-level sequence is processed, when the output layer of the RNN is $softmax $, the output of each moment produces a vector $y ^t \in \mathbb{r}^k $, $K $ is the size of the thesaurus, $y the ^t$ vector in $k $ c3> represents the probability of generating the first $k $ words, each time the input is also a word, used to be often used One-hot (Bag-of-word) method, now more commonly used is distribute representation (Glove, Word2vec). If RNN deals with character-level sequences, this is usually the way of ont-hot. Generally, the text classification method uses the One-hot way (regardless of the word order) to obtain the good result, but the word order is very important in the text translation, for example "scientist killed by raging virus" and "virus killed By raging scientist "The ont-hot means the results are identical, so machine translation is very necessary to consider word order. Sutskever et al. [2014] proposed a two-layer LSTM structure of the machine translation model, in the translation of English into French in the task of excellent performance, the first layer LSTM to input English sentence (phrase) encoding, the second layer is used to decode the French sentence (phrase), The model looks like this:
The LSTM1 is encoding LSTM, the LSTM2 is decoding LSTM, the blue and purple dots are input, the pink node is the output:
1) The source statement $x ^t$ Set the end flag (when implemented can be arbitrarily set the symbol in the statement now), each time a word into the encoding lstm, and each network is not output.
2) When reaching the end of the $x ^t$ (<EOS> in the figure), it means that the target statement will begin to be sent to decoding LSTM, in <EOS> corresponding to "J ' Al", and decoding LSTM will encoding LSTM output as a Into (Figure LSTM1 $\rightarrow $ LSTM2 junction), decoding LSTM at each moment output for the same size as the Thesaurus $softmax $ layer, the component represents the probability of each word.
3) When inferring, use beam search at every moment to find the most likely word. Until the output reaches <EOS> for the end state.
During the training phase, the input statements are thrown into the encoder, the translation statements are thrown into the decoder, the loss from decoder to the beginning, the entire model through the maximum training to focus on the probability of the label sample, when inferred, run a left-to-right beam search algorithm to find the optimal sequence, The original author uses SGD to train, did not run the epoch, the learning rate halved, run 5 epoch, BLEU value (used to measure the quality of the model) reached Start–of-the-art, the model itself through 8 GPUs ran 10 trained--!, every A LSTM memory block contains 1000 memory cells, the English vocabulary is 160,000, the hair and vocabulary is 80,000, the weight is initialized to ( -0.08,0.08) the uniform distribution.
Image captioning
Some recent work has been about describing an image in natural language ([Vinyals et al., Karpathy and fei-fei,2014, Mao et al., 2014].).
Image Captioning is a supervised learning task that input data for an image $x $, the output data is a statement describing the image $y $. Vinyals et al. [2015] is similar to the machine translation just now , but the original encoder changed to convolution neural Network, decoder or LSTM layer, K Arpathy and Fei-fei [2014] uses a bidirectional CNN with a attention mechanism to encoding the image, and uses the standard RNN to decode the description of the image, using the vector of words generated by the Word2vec. This model produces the captioning of the whole picture, and also produces the correspondence between the image area and the text fragment.
When the printing image is inferred, the process is similar to machine translation, decoding one word at a time, and getting the best words as input for the next moment, knowing to reach <eos>. Karpathy and Fei-fei are tested on three datasets: [flickr8k, flickr30k, and COCO, respectively, 50MB (8000 images), 200MB (30,000 images), and 750 MB (328,000 images)], and the CNN used by encoder is trained on Image Net.
some of the other tasks, like handwriting recognition, in this task, bidirectional LSTM achieved the State-of-the-art effect, HMM's quasi-rate is 70.1%, and bidirectional LSTM reached 81.5%. And in recent years people have put the above successful methods and applications to unsupervised video encoding [Srivastava et al., 2015] with video captioning [Venugopalan et al., 2015] very To program execution [Zaremba and Sutskever,], video captioning the author uses CNN to get pictures of the characteristics of the frame and then throw into the LSTM to encode, decoding the corresponding words generated.