This chapter is a total of two parts, this is the second part:
14th-cyclic neural networks (recurrent neural Networks) (Part I) chapter 14th-Cyclic neural networks (recurrent neural Networks) (Part II)
14.4 Depth RNN
Stacking a multilayer cell is very common, as shown in 14-12, which is a depth rnn.
Figure 14-12 Depth Rnn (left), expanded over time (right)
To implement depth rnn in TensorFlow, you need to create multiple cells and stack them into a multirnncell. The following code creates three identical cells (you can also create three cells with a number of different neurons):
N_neurons = 100n_layers = 3basic_cell = Tf.contrib.rnn.BasicRNNCell (num_units=n_neurons) Multi_layer_cell = Tf.contrib.rnn.MultiRNNCell ([Basic_cell] * n_layers) outputs, states = Tf.nn.dynamic_rnn (Multi_layer_cell, X, dtype= TF.FLOAT32)
14.4.1 Multi-GPU distributed training depth RNN
Skip first.
14.4.2 Application Dropout
If a deep rnn is created, it may cause overfitting. To prevent overfitting, the usual technique is dropout (introduced in the 11th chapter). You can simply add a dropout layer before or after rnn, but if you want to use dropout between RNN layers, you need to use Dropoutwrapper. The following code applies a dropout,drop probability of 50% per layer of input to the RNN.
Keep_prob = 0.5cell = Tf.contrib.rnn.BasicRNNCell (num_units=n_neurons) Cell_drop = Tf.contrib.rnn.DropoutWrapper (cell , input_keep_prob=keep_prob) Multi_layer_cell = Tf.contrib.rnn.MultiRNNCell ([cell_drop] * n_layers) rnn_outputs, states = Tf.nn.dynamic_rnn (Multi_layer_cell, X, Dtype=tf.float32)
If you are using dropout in the output, you can set the Put_keep_prob.
There is a big problem with the code above, which is to apply dropout during training and testing (Recall Chapter 11, dropout only used during training). Unfortunately, the Dropoutwrapper does not yet support the is_training placeholder. So either implement a dropoutwrapper yourself, or create two graphs (one for training and one for testing).
The difficulty of 14.4.3 training time too much
Training RNN on long sequences requires several moments to run, making the RNN a very deep model. Like any other model, it suffers from a gradient vanishing (explosion) problem (Chapter 11). The previously mentioned techniques are also valid for deep RNN: Proper parameter initialization, unsaturated activation functions (such as Relu), Batch Normalization, Gradient Clipping, faster optimizers. However, if you use RNN to deal with sequences that are very long (such as 100), the training will become extremely slow.
The simplest and most common solution is to only part of the time when training, which is called truncated backpropagation through time. In TensorFlow, it is possible to truncate only a subset of the input sequence. But there is also the problem that models cannot learn long-term patterns (long-term patterns). One workaround is to make the shortened training data include both the latest and the stale training data (for example, a sequence containing monthly data for the first five months, data for the first five weeks, and data for the first five days). But the solution is limited: What if last year's detailed data is really important? What if there is a clear event that must be taken into account (e.g. election results) the year before last?
In addition to the long training time, RNN faced another problem is the long-running, early memory forgotten. In fact, as the data passes through the RNN, some information is lost at each moment. Not long after, RNN's status could not find the first time the trace of the data entered. This may be fatal. For example, do emotional analysis on film reviews. The first sentence is "I love this movie", but the rest of the problem is in the accumulation of the film can improve the place. If RNN forgets the first few words, it is likely to misunderstand the comment. Unresolved, many types of cells with long-term memory (long-term memory) function are introduced, and the most famous is lstm cell.
14.5 LSTM Cell
Long short-term memory (short-term memory,lstm) cell was presented by Sepp Hochreiter and Jürgen Schmidhuber in 1997. Then went through a lot of researcher's improvements, such as Alex Graves,ha?im Sak,wojciech Zaremba and so on. If you think of the lstm cell as a black box, it's about the same as a basic cell, but it's better: training is easier to converge and more prone to discovering long-term dependencies in your data. In TensorFlow, you can simply use Basiclstmcell to replace Basicrnncell:
Lstm_cell = Tf.contrib.rnn.BasicLSTMCell (num_units=n_neurons)
LSTM cell to manage two state vectors, they are separated by default for performance reasons. You can change this behavior by setting State_is_tuple=false when creating Basiclstmcell.
Figure 14-13 is a basic lstm cell:
Figure 14-13 LSTM Cell
If you do not look in the middle of the yellowish box, the LSTM cell and the regular cell are similar, except that its state is cut into two vectors: $\textbf{h}_{(t)}$ and $\textbf{c}_{(t)}$ (c for cell). $\textbf{h}_{(t)}$ can be viewed as a short-term state, $\textbf{c}_{(t)}$ represents long-term memory.
Now look at what the logic is in the box. The core idea is that this neural network can learn what should be kept in a long-term state, what should be lost, and what should be read. When the long-term state $\textbf{c}_{(t-1)}$ from left to right through a neural network, it first loses some memory through a forgotten gate (Forget gate), then adds some new memory through the addition operation, and the added memory is filtered through the input gate. The result of $\textbf{c}_{(T)}$ is directly output without any transformations. All, at each moment, some information is discarded and some information is added. In addition, after the addition operation, long-term memory will be copied a copy, the first application of the Tanh function, and then through the output gate filter, participate in generating short-term memory $\textbf{h}_{(t)}$ (with this moment output $\textbf{y}_{(t)}$ Equal). Then let's take a look at where the new memories come from and how the doors work.
First, the current moment of input $\textbf{x}_{(t)}$ and the previous time of the short-term memory $\textbf{h}_{(t-1)}$ is supplied to 4 different fully connected layers. These 4 fully connected layers have different purposes:
- The most important layer is the one that outputs $\textbf{g}_{(t)}$. It has a basic cell-like analysis of the current time input $\textbf{x}_{(t)}$ and the previous moment short-term memory $\textbf{h}_{(t-1)}$ role. The basic cell,$\textbf{y}_{(t)}$ and $\textbf{h}_{(t)}$ are output directly. However, this layer of the Lstm cell does not output directly, and is partially stored in long-term memory.
- The other three layers are gate controllers (gate controllers). They use the logistic activation function, and the output range is 0 to 1. Its output is used to multiply operations by element points. So if the output is 0, the door is closed and if Output 1, the door is opened. Clearly speaking:
- The Forgotten Gate (Forget gate, controlled by $\textbf{f}_{(t)}$) determines which long-term memories should be forgotten.
- The input gate (input gate, controlled by $\textbf{i}_{(t)}$) determines what content $\textbf{g}_{(t)}$ should be added to the long-term memory.
- The output gate, which is controlled by $\textbf{o}_{(t)}$, determines which long-term memory should be read and output.
LSTM The calculation formula for an instance output:
which
- $W _{xi},w_{xf},w_{xo},w_{xg}$ is a weight matrix of 4 fully connected layers about the input vector $\textbf{x}_{(t)}$.
- $W _{hi},w_{hf},w_{ho},w_{hg}$ is a weight matrix of 4 fully connected layers on short-term memory $\textbf{h}_{(t-1)}$.
- $b _{i},b_{f},b_{o},b_{g}$ is the offset of 4 fully connected layers. TensorFlow will initialize the $b_{f}$ to a matrix of all 1, not all 0, which will make nothing forgotten at the beginning of the training.
14.5.1 peephole Connections
In the basic lstm cell, the state of the control gate is determined only by the input $\textbf{x}_{(t)}$ of the current moment and the short-term memory $\textbf{h}_{(t-1)}$ of the previous moment. It may be a little better to have long-term memory involved in the management of the control gate. This idea was presented by Felix Gers and Jürgen Schmidhuber in 2000. They propose a LSTM variant that adds a connection called peephole connections: The previous time long-term memory $\textbf{c}_{(t-1)}$ also as an input to the forgotten gate and output Gate controller, the current moment of long-term memory $\textbf{c The}_{(t)}$ also acts as an input to the output gate controller.
To implement peephole connections in TensorFlow, you can use Lstmcell instead of Basiclstmcell and set Use_peepholes=true:
Lstm_cell = Tf.contrib.rnn.LSTMCell (num_units=n_neurons, Use_peepholes=true)
There are a number of other lstm cell variants, the most famous of which are GRU cells.
14.6 GRU Cell
The Gated recurrent Unit (GRU) cell, presented in a 2014 paper, also presented the Encoder–decoder neural network we mentioned earlier.
Figure 14-14 GRU Cell
The GRU cell is a simplified version of Lstm cell, but behaves equally well (2015 's paper lstm:a Search Space Odyssey shows that all lstm variants behave roughly the same). The main simplified sections are as follows:
- The two state vectors are combined into separate $\textbf{h}_{(t)}$.
- A gate controller ($\textbf{z}_{(t)}$) controls both the forgotten gate and the input gate. If the door controller outputs 1, the input door is opened while the forgotten door is closed. If the controller outputs 0, the input door is closed while the forgotten door is opened. In other words, if a memory needs to be stored, the original memory of that location will be erased.
- The output gate is no longer used and the entire state matrix is output. No, there is a new gate controller ($\textbf{h}_{(r)}$) to control which memory previously needed to be passed to the main layer.
GRU the calculation formula for an instance output:
Create the GRU cell in TensorFlow:
Gru_cell = Tf.contrib.rnn.GRUCell (num_units=n_neurons)
LSTM and GRU cells are important reasons for the success of Rnns in recent years, especially in the field of natural language processing.
14.7 Natural Language Processing
Most of the most advanced NLP applications, such as machine translation, automatic summarization, grammar analysis, sentiment analysis, and so on, are based on (or partly based on) Rnns. This section needs to take a look at the Word2vec and seq2seq tutorials for TensorFlow.
14.7.1 Word embeddings
The first thing to solve is the problem of the word expression (for Chinese speaking, the general first step is participle.) English has a natural space to divide the word. Of course, Chinese does not make word segmentation is also possible, such as a word or two strings as a feature. The word representation of a scheme is the one-hot vector. Assuming that the thesaurus has 50,000 words, the nth word is represented as a 50000-dimensional vector, the nth position is 1, and the other position is all 0. However, the vocabulary is so large that this sparse expression is inefficient.
More Ideally, we want words of the same meaning to have similar representations so that the model can extend the patterns it learns to all similar words. For example, if the model learns that "I drink milk" is a valid sentence and knows that "milk" and "water" are approximate but differ greatly from "shoes", then the model can know that "I drink water" is also a valid sentence, while "I Drink Shoes" It's probably not.
A common solution is to use a smaller, more dense vector (such as 150-D) to represent each word in the thesaurus, which is called embedding. and need a neural network through training to find the best embedding for each word. At the beginning of the training, embedding are randomly selected, but the reverse propagation will get better. This means that similar words converge to the same vector, and that the dimensions of the vectors may have practical implications. For example, different dimensions of a vector may represent gender, singular/plural (English plural), adjective/name, and so on. (More information can be found in Christopher Olah's famous blog, as well as a series of blogs from Sebastian Ruder)
In TensorFlow, you need to create a variable to represent the embedding of each word in the thesaurus (which is randomly initialized):
Vocabulary_size = 50000embedding_size = 150embeddings = tf. Variable (Tf.random_uniform ([Vocabulary_size, Embedding_size],-1.0, 1.0))
Then assume that you want to give the phrase "I drink milk" to the neural network for training. The first step in preprocessing is to represent the sentence as a list of known words. For example, remove unnecessary special characters, express the word outside the dictionary as a predefined tag (such as "[UNK]"), replace the number with "[NUM]", replace the URLs with "[URL]", and so on. If it is a word in a dictionary, it is represented as its ID in the dictionary (from 0 to 49999), such as [72, 3335, 288]. At this point, you can use the Embedding_lookup () function to get the corresponding embedding:
Train_inputs = Tf.placeholder (Tf.int32, Shape=[none]) # from ids...embed = Tf.nn.embedding_lookup (embeddings, Train_ Inputs) # ... to embeddings
If your model learns a good word embeddings, it can be used effectively in all NLP applications.
14.7.2 machine translation based on Encoder–decoder neural network
First we look at a simple machine translation model that translates English sentences into French, 14-15:
Figure 14-15 A simple machine translation model
English sentence as input of encoder, decoder output French translation. The true translation of French is the input of decoder, but pushes back a moment (the input of the first moment is <go> the input of the second moment is the JE). In other words, the input of the decoder at the current moment should be the output of its previous moment (although it is not actually the output). The input of the decoder begins with a statement start symbol (such as <go>), and the output ends with a statement terminating symbol (such as <eos>).
The English sentences entered as encoder are reversed. For example, "I drink milk" translates into "milk drink I". This ensures that the beginning of the English sentence is entered at the end of the encoder, and is first translated by decoder.
At each step, decoder outputs the score of each word in the translated dictionary (in this case, the French dictionary) and then converts the score to probabilities by the Softmax layer. For example, in the first step, the probability of "Je" may be 20%, the probability of "Tu" is 1%, and so on. The word with the most probability will be output. This is similar to regular classification tasks, so you can train the model with the Softmax_cross_entropy_with_logits () function.
When predicting with the model (after training), there is no target statement input to decoder. Simply put the output of the previous period as input to the current period. 14-17 (The embedding lookup is omitted from the figure):
Figure 14-16 Prediction period, the output of the previous step as input to the current step
Now, we already know the overall architecture of machine translation. However, if you look at TensorFlow's sequence-to-sequence tutorial and learn the source code for rnn/translate/seq2seq_model.py (located in TensorFlow models), you will find some differences:
- First, we assume that all input sequences (including encoder and decoder) are fixed-length. But it is clear that the length of the sentence is uncertain. There are several ways to do this-for example, the STATIC_RNN () and DYNAMIC_RNN () functions use the Sequence_length parameter to describe the length of a sentence. However, another scenario is used in the tutorial (possibly for performance reasons): Cutting a sentence into different groups with the same length (for example, a matrix of 1-6 words divided into groups, 7-12 words divided into another group, and so on). Shorter groups are populated with special tags (such as "<pad>"). For example "I drink milk" is converted to "<pad> <pad> <pad> Milk Drink I" and then translated to "Je bois du lait <eos> <pad>". Of course, we want to ignore the post-EOS content. The implementation in the tutorial is to use a target_weights vector. For example, for the target statement "Je Bois du lait <eos> <pad>", this amount is always [1.0, 1.0, 1.0, 1.0, 1.0, 0.0] (where the fill is 0.0).
- Secondly, because the vocabulary is large, the probability of outputting each word to calculate the cross-entropy is very slow. One solution is to decoder output a much smaller vector, such as 1000-D, and then use the sampling technique to estimate the loss. This sampled Softmax technology was presented in 2015. You can use the Sampled_softmax_loss () function in TensorFlow.
- Second, the implementation in the tutorial uses the attention mechanism (attention mechanism). RNN's attention mechanism is beyond the scope of this book, but you can refer to machine translation,machine reading,image captions.
- Finally, the implementation in the tutorial uses the Tf.nn.legacy_seq2seq module, which makes it easier to create multiple encoder–decoder models. For example, the Encoder–decoder model created by EMBEDDING_RNN_SEQ2SEQ () automatically makes Word embeddings.
14th-cyclic neural networks (recurrent neural Networks) (Part II)