I. Why sequence models?
(1) sequence models are widely used in speech recognition, music generation, sentiment analysis, DNA sequence analysis, machine translation, video behavior recognition, Named Entity recognition, and many other fields.
(2) The above problems can be viewed as supervised learning using (x, y) as the training set, but there are many combinations of mappings between input and output,
For example, one-to-one, many-to-many, one-to-many, multiple-to-one, and many-to-many (different numbers) scenarios are applicable to different applications.
Ii. mathematical symbols notation
Suppose there is a sentence like this:
X: Harry Potter and Hermoine grantee Ted a new spell.
The purpose is to identify the acronyms in a sentence. The acronyms include personal names, place names, and Organization Names.
It can be seen that the input sentence can be regarded as a sequence of words, so our expected output should be the following output:
Y: 1 1 0 1 0 0 0 0
1 represents "entity"; 0 represents "non-entity"
(Of course, the actual name Entity recognition is much more complex than the output. It also needs to indicate the end position and start position of the acronyms, in this example, we will use the above simple output format to explain)
Obviously, the number of input x sequences is the same as the number of output y sequences, and the index position is the same. We use the following symbol to represent the input and output:
\ (T \) indicates the input at the T moment;
\ (T_x \) indicates the sequence length of the sample \ (x \); \ (t_y \) indicates the word length of the output sequence after the sample \ (x \) input model, in this example, the length of the output sequence is equal to 9;
\ (X ^ {(I) <t >}\), \ (y ^ {(I) <t> }\)
There are often many samples, which are represented by the following symbols: the input and output of the nth sample t time:
Use the following symbol to indicate the length of the input sequence and the output sequence of the I sample:
\ (T ^ {(I)} _ x \), \ (t ^ {(I)} _ Y \)
3. What is the recurrent neural networks (1) Why is it called recurrent neural networks?
- 1. the input and output can have different lengths in different samples. Even if the maximum value of input and output can be found, the input and output can be filled to the maximum length of a sample, however, this representation is not good enough;
- 2. standard neural networks cannot share features learned from different vertices. For example, if Harry is a person's name at the first location, if Harry appears at another location, it cannot be automatically recognized as a person's name and must be recognized again
(2) What is a recurrent neural network?
Still use this sentence as an Example
X: Harry otter and Hermoine grantee Ted a new spell.
First, use the first word Harry as the first input \ (x \). The middle goes through a pile of hidden layers and then outputs \ (Y \);
Then, the second word Potter is used as the second input, and the output is obtained through the same hidden layer structure. But this time, not only does the input come from the second word Potter,
There is also an information from the previous word hidden layer (usually called the activation value) \ (A \) as the input;
Similarly, enter 3rd words and, and the activation value from the second word, and so on until the last word;
In addition, an activation value is also required before the first word, which can be artificially fabricated, either a 0 vector or a value randomly initialized using some methods.
Words are input one by one. They can be input once every time,All input hidden layers share the Parameter. Set the input layer to the hidden layer to the parameter \ (W _ {ax} \), and the activation value to the hidden layer to the parameter as \ (W _ {AA }\).
According to the above structure, it is obvious that the words entered for the first time will affect the prediction of the next word through the activation value, and even affect the prediction of all subsequent words. This is a recurrent neural network.
(3) Forward Propagation
\ (A ^ {<0 >}\) is manually initialized;
\ (X ^ {<1 >}\) is the input at the time t = 1;
Input layer weight \ (W _ {ax }\);
The weight of the activation layer is \ (W _ {AA }\);
The weight of the output layer is \ (W _ {y1 }\);
The activation value \ (a ^ {<t >}\) and output value \ (y ^ {<t >}\) at each time point must be calculated }\)
\ (A ^ {<1 >}\) calculation:
Calculation of \ (\ hat {y} ^ {<1> }\)
G () is an activation function, usually using Tanh, and sometimes using Relu to avoid gradient dispersion. If the output is a binary classification, sigmoid is often used as the activation function, because in this example, the sigmoid function is used to determine whether it is a physical store.
The general extension \ (a ^ {<t >}\) and output value \ (\ hat {y} ^ {<t >}\) are calculated as follows:
(4) Simplified rnn Representation
Where \ (w_a \) is the left and right stitching of \ (W _ {AA} \) and \ (W _ {ax} \), if \ (W _ {AA }\) the latitude is 100.10000, \ (W _ {ax} \) latitude 10010000, the combined \ (W _ {A} \) latitude is 100*10100
Similarly, the second formula is simplified:
Iv. Reverse propagation of time in rnn (1) review Forward Propagation
Review the Forward Propagation algorithm. There is a sequence with a length of \ (t_x \).
The input \ (x \) allows you to find the activation value \ (A \) at each time \):
Looking back at the previous section, the activation value on time \ (T \) \ (A \) is defined by \ (t-1 \) at the moment \ (A \), and \ (T \) input \ (x \) at the moment multiplied by the parameter:
\ (A ^ {<t >}= g (W _ {AA} a ^ {<T-1 >}+ W _ {ax} x ^ {<t >}+ B _a) \) ==>>\( a ^ {<t >}= g (W _ {A} [A ^ {<T-1> }, x ^ {<t >}] + B _a )\)
Each operation on time is shared with a group of parameters \ (w_a \), \ (w_ B \), As follows:
Then, calculate the output y of rnn:
Similarly, review y is multiplied by the activation value at the current time by the parameter:
Computing at each time point also shares a set of parameters \ (w_y \)
The Forward propagation is complete.
(2) loss function
The Forward propagation process is the basis of reverse propagation. To calculate reverse propagation, a loss function needs to be defined.
Used hereCross entropyLoss Function
- Calculate the loss at each time t, that is, the loss of Y after a single input X in the computing sequence ()
- Then calculate the total loss, that is, the sum of the loss in all time.
(3) reverse Propagation
Evaluate the parameters of the loss function and update the parameters using the gradient descent method. The red arrow indicates the reverse propagation process.
Back Propagation in rnn is also called back propogation through time
V. Different types of rnn (1) many-to-many
In named object recognition, the input sequence and the output sequence have multiple elements with the same length. These elements are called the sequence-to-Sequence Structure and have multiple-to-multiple structures.
1] The length of input and output sequence is consistent
2] multi-to-many can also achieve the length of the input and output sequence, the most common is machine translation, first look at the structure:
The model consists of two parts, the first part is calledEncoder (encoding)The text sequence to be translated is input one by one, but no output is generated. The latter part is calledDecoder)The output of the encoder is the input of the decoder,
However, there will be no input at any time, but there will be an output at each time. The decoder output sequence is the translation of the output sequence of the encoder.
(2). Many-to-one
InText sentiment analysisThe input is a text sequence. Each word is an input, and the output is usuallyCATEGORY tagFor example, the rating of a movie is divided into five grades, or two grades that determine the positive and negative emotions of the article.
Y output is available in the last time, but y output is not available in other time periods. This structure is called the replicate-to-one structure, multiple-to-one structure.
(3). One-to-many
For exampleMusic generationIn this example, the output is a sequence of notes, and the input can be an integer, indicating the music style you want, or the first note, indicating that you want to start with this note, or do not input anything, so that the model can be used freely.
First, only input X at the first time, and none at other times;
Second, there will be output at every moment, and the total output forms a sequence;
Note that there is a technical detail. When rnn is used to generate a sequence, the output from the previous moment is Y. It is also used as the T-moment input, which is indicated by the red arrow above.
(4). Summary
6. Language Model and sequence generation language model and sequence generation (1) What is language model?
For example, speech recognition Speech Recognition refers to Speech Recognition as text (the domestic speech recognition is done well by KEDA xunfei ).
If you say "the Apple and pear salad", you can recognize the following two sentences:
Pair and pear have the same pronunciation. Obviously, the second sentence is the intention to correctly identify the speaker. Then how should we select the most correct sentence? This requires the language model.
The language model calculates the probability of two sentences:
Obviously, the probability of the second sentence is nearly 100 times that of the first sentence. Therefore, select the second sentence. That is to say, the language model will give you the probability of any sentence, so that you can choose the most likely correct sentence based on the probability.
Currently, the application of the language model has two main categories. The first is what we mentioned above.Speech Recognition; SecondMachine TranslationSimilarly, find the most correct Sentence Translation by calculating the sentence probability.
(2) how to establish a language model using rnn?
1] First, you must have a training set: a huge corpus. The larger the corpus, the better.
2. Mark the corpus and improve it in the previous sections. sort out a dictionary based on the corpus and use one-hot word vectors to represent each word. Note:
A. Generally To indicate the end of a sentence.
BFor words not in the corpus in the new text, use Indicates
CAs for whether or not to use punctuation marks as part of the dictionary, this can be determined by specific requirements and problems.
3) create an rnn model.
Suppose there is a sentence like this:
When the first time \ (t = 1 \), the output \ (x ^ {<1 >}\) is 0 vector, because there is no word before the first word; the activation value \ (a ^ {<0 >}\) is also initialized to the 0 vector. After calculation, the obtained \ (y ^ {<1> }\) is a vector in softmaxt,
The length is the length of the dictionary. The value at each position corresponds to the word probability in the dictionary. Note that the dictionary also includes \ (<EOS> \), and \ (\)
Input \ (x ^ {<2 >}\) at the second time \ (t = 2 }\) it is actually the first word \ (y ^ {<1 >}: cat \). Using softmax, the probability distribution vector of the dictionary length is also obtained, this probability is \ (P (word/CAT) \), that is, when the first word is cat,
Conditional probability of each word in the dictionary.
Input \ (x ^ {<9 >}= y ^ {<8 >}= "day" \) until the last time "\).
4. Calculate the loss function
The predicted value of the output probability distribution is compared with the actual value to calculate the loss. The loss formula is as follows:
\ (I \) traverses the dictionary and summarizes the values of each word in the dictionary.
The total loss is the sum of all time losses:
[Sequence model] Lesson 1 -- circular Sequence Model