Attention is all you need and its application in TTS close to Human quality TTS with Transformer and Bert

Source: Internet
Author: User
Tags cos

Paper Address: Attention is need

Sequence encoding

Deep learning to do the NLP method, the basic is to first sentence participle, and then each word into the corresponding word vector sequence, each sentence corresponds to a matrix \ (x= (x_1,x_2,..., x_t) \), where \ (x_i\) represents the first \ (i\) words vector, dimension is D dimension, so \ (x\in r^{nxd}\)

    • The first basic idea is the RNN layer, which is carried out recursively:
      \[y_t=f (y_{t-1},x_t) \]
      RNN structure itself is relatively simple, but also suitable for sequence modeling, but one of the obvious shortcomings of RNN is that it cannot be parallel, so the speed is slow, and RNN can not learn the global structure information very well, because its essence is a Markov decision-making process.

    • The second idea is the CNN layer, where the CNN scenario is a window-type traversal, such as a convolution of size 3, which is:
      \[y_t=f (x_{t-1},x_t,x_{t+1}) \]
      CNN is easy to parallel, and easy to capture some of the overall structure of information, CNN can only obtain local information, which through the cascade to expand the sense of the wild.

    • Pure attention

      Access to 全局信息 the problem, rnn need to gradually recursive, so the general two-way Rnn is better, CNN can only get local information and through the cascade to expand the wild. Attention:
      \[y_t=f (x_t,a,b) \]
      Where a, B is another sequence (matrix), if take a=b=x, it is called self_attention , it means directly to \ (x_t\) and the original of each word compared to finally find out \ (y_t\)

Attention Layer

The definition of general attention is described in the Spectral Prediction Network (TACOTRON2) Attention mechanism section.

  • Attention definition

    The general attention idea is also a coding sequence scheme, so it is a sequence-coded layer like RNN and CNN:
    \[attention (q,k,v) =softmax (\frac{qk^t}{\sqrt{d_k}}) v\]
    which\ (q\in r^{nxd_k},k\in r^{mxd_k},v\in r^{mxd_v}\)。 Ignore\ (softmax\)And\ (\sqrt{d_k}\), which is actually\ (nxd_k,d_kxm,mxd_v\)Multiply three matrices, and the final result is\ (nxd_v\)The matrix, which can also be thought of as a attention layer will\ (nxd_k\)The sequence\ (q\)Encoded as a new\ (nxd_v\)The sequence.
    \[attention (q_t,k,v) =\sum_{s=1}^{m}\frac{1}{z}exp (\frac{<q_t,k_s>}{\sqrt{d_k}}) v_s\]
    which\ (z\)is a normalization factor,\ (q,k,v\)respectively is\ (query,key,value\)The shorthand,\ (k,v\)is one by one corresponds, the above means to pass\ (q_t\)This one\ (query\), through with the individual\ (k_s\)Inner product and Softmax way to get\ (q_t\)With each\ (v_s\)The similarity, then the weighted sum, to get a\ (d_v\)Vector of the dimension, where the factor\ (\sqrt{d_k}\)Play the role of regulation, so that the inner product will not be too large (too large, after Softmax is not 0 that 1, not enough soft).

  • Multi-head Attention

    will \ (q,k,v\) through the parameter matrix mapping and do Attention, repeating this process span class= "Math inline" >\ (h\) times, and the results are stitched together. Specifically,
    \[head_i=attention (QW_I^Q,KW_I^K,VW_I^V) \]
    where \ (w_i^q\in r^{d_qx\tilde{d_q}},w_i^k\in r^{d_kx\tilde{d_k}},w_i^v\in R^{d_vx\tilde{d_v} }\) . After that,
    \[multihead (q,k,v) =concat (head_1,head_2,..., head_h) \]
    finally get \ (NX (H\tilde{d_v}) \) , the so-called Bulls (Multi-head), which is to do the same thing several times. Note that the \ (q,k,v\) does not share the parameter matrix between each header when mapping.

  • Self Attention

    Assuming reading comprehension,\ (q\) can be the word vector sequence of the article, take \ (k=v\) as the problem of the word vector sequence, then the output is the so-called aligned Question embedding.

    In this paper, most of the attention are self-attention, or internal attention.

    The so-called self Attention is actually \ (Attention (x,x,x) \), where \ (x\) is the input sequence. That is to do attention within the sequence, and to seek a link within the sequence. One of the important contributions of the paper is to show that the internal attention is very important in the sequence coding of machine translation (even seq2seq), and the previous research about SEQ2SEQ is basically to apply the attention mechanism to the decoding end. More accurately, this article uses the self multi-head attention.
    \[y=multihead (x,x,x) \]

  • Position Embedding

    The models presented above are not suitable for capturing sequence sequences. In other words, if \ (k,v\) is scrambled in rows, which is equivalent to disrupting the word order in a sentence, the attention result is still the same. This shows that the aforementioned attention model is at most a sophisticated "word bag" model.

    Position embedding, which is also the position vector, numbers each position, and then each number corresponds to a vector. By combining the position vector and the word vector training, so that each word can be introduced to a certain location information, attention can also distinguish between different positions of words.

    The

    Position embedding is the only source of attention-mechanism location information in this article, building the formula for position embedding:
    \[\left\{\begin{ Matrix}pe_{2i} (P) =\mathop{sin} (\frac{p}{10000^{\frac{2i}{d_{pos}}}) \ Pe_{2i+1} (p) =\mathop{cos} (\frac{p}{10000 ^{\frac{2i}{d_{pos}}}) \end{matrix}\right.\]
    Maps the location of the ID \ (p\) to a \ (d_{pos}\) , the number of the element I of this vector is \ (Pe_{i} (P) \) . In this paper, the position vectors obtained by the comparison training and the position vectors calculated by the above formula have similar effect. However, it is obvious that the position embedding calculated by the formula is easier to obtain.

    Position embedding itself is an absolute positional information, but in the language, the relative position is also very important, the use of the position vector formula in the upper equation is an important reason:
    \[sin (\alpha+\beta) =sin\alpha *cos\beta+cos\alpha *sin\beta\cos (\alpha+\beta) =cos\alpha *cos\beta-sin\alpha *sin\ Beta\]
    This indicates that the position \ (p+k\) vector can be represented as a linear transformation of the position \ (p\) and the position \ (k\) vector, which provides the possibility of expressing the relative position. The combination of position vectors and word vectors can be used in 元素加 拼接 two ways, and intuitively, element addition will bring about loss of information, but the experiment in this paper shows that the difference is not very big.

The shortcomings

The benefit of the attention layer is the ability to capture global information one step because of its direct bar sequence 22 comparison (computational complexity \ (O (n^2) \)). But because it is a matrix operation, the computational amount is not a very serious problem. By contrast, RNN needs a step-by-step recursion to capture global connections, and CNN needs to cascade to broaden the field of perception.

    • This article specifically names a position-wise Feed-forward Networks, the fact that it is a one-dimensional convolution of Windows 1
    • Although attention and CNN are not directly linked, but full reference to CNN's ideas, such as multi-head attention is attention do many times and then splicing, this similar to the multi-convolution core of CNN idea is consistent, and the residual structure is also from CNN.
    • It is not possible to model the location information well, it is good to train a text classification model or machine translation model with this pure attention mechanism, but the effect should not be too good to train the sequence labeling problem.
    • Not all problems are long-range, global dependencies, and many problems depend only on the local structure, and the restricted version of the self attention mentioned in the paper assumes that the current word is only associated with the R words, so attention only occurs between the 2r+1 words. This will also capture the local information of the sequence (which is actually the convolution window in the convolution kernel)
Transformer

In this paper, transformer completely abandons the recursive structure, relies on the attention mechanism, and digs the relationship between input and output, which makes the computation parallel.

    • The Encoder of the Encoder:transformer model is stacked by 6 basic layers, each of which contains two sub-layers, the first sub-layer is a attention, the second is an all-connected forward neural network, and the residuals and layers are introduced to the two sub-tiers. Normalization.
    • The Decoder of the Decoder:transformer model is also stacked by 6 basic layers, each of which adds a layer of attention (maksed Multi-head Attention) In addition to the two structures in the encoder. residual edges and layer normalization are also introduced.
Summarize

The focus mechanism is simply given a lookup (query) and a key-value table (Key-value), which maps the query to the correct input process, and the output is often a weighted sum, while the weights are determined by the query, key, and value together.

    • In the Encoder-decoder attention layer, query comes from the decoder layer of the previous time step, and key and value are the output of encoder, which allows each location of the decoder to focus on all locations of the input sequence.
    • Encoder contains the self-attention layer, and all key, value, and query in the self-attention layer are from the encoder of the previous layer, This allows each position of the encoder to focus on all the positions of the previous Layer encoder output.
    • The Self-attention layer is also included in the decoder

Forward Neural Network: This is a position-wise feedforward neural network, each layer of encoder and decoder contains a forward neural network, the activation order is linear, ReLU, linear.
\[FFN (x) =max (0,xw_1+b_1) w_2+b_2\]
Location Coding: Introducing Location information

Close to Human quality TTS with Transformer

Paper address: Close to Human quality TTS with Transformer

The current speech synthesis system still has two problems to be solved: 1) Low efficiency in training and inference, 2) It is difficult to model long-time dependencies using RNN. Inspired by the transformer network in machine translation studies, the "full attention" is embedded in the Tacotron2. In this paper, a multi-focus mechanism is used to replace the RNN structure and the original attention mechanism in Tacotron2. With the multi-head self-attention, the encoder and decoder can be trained in parallel, thus improving the training efficiency. At the same time, the two inputs at different times are directly related to the attention, which effectively solves the long-time dependence. This Tacotron2 improved version adds a phoneme preprocessing structure as input, using wavenet as the Sound encoder.

The whole structure of the paper, such as, before encoder pre-net, added a text-to-phone convertor, the paper's explanation is: English words, different letters of the pronunciation may be different, such as the letter ' a ', may be a/ei/tone, may also send/?/or/a :/tone, before relying on neural network to learn the pronunciation rules, but when the data set is small may not be effective learning, here is a front-end, with rules directly into the phoneme into the model.

  • Scaled positional Encoding

    The location code in Attention all need, unlike the Attention all need, which introduces positional information directly and the word vector 元素加 , instead adds a trained weighting factor to the position vector:
    \[attention\ is\ all\ you\ need:\ x_i=prenet (phoneme_i) +pe (i) \this\ version:\ x_i=prenet (phoneme_i) +\alpha PE (i) \]
    where\ (\alpha\) is the weight factor to be trained. The reason given in the paper is that the original space is the text, and the target space is the Mel spectrum, the use of fixed position vectors will give the encoder and decoder pre-net a very large limit.

  • Encoder pre-net
    \[embedding (dim:512) \to 3conv (dim:512) \to batch\ normalization\to relu\to linearprojection\]
    Note: A linear mapping is added after the last Relu layer, mainly because the value of the Relu output is \ ([0,+\infty]\), and each dimension of the position vector is in \ ([ -1,+1]\), and the different interval centers damage the performance of the model.

  • Decoder pre-net

    During training, the Mel Spectrum first enters the 2-tier fully connected network, which is designed to embedding the Mel Spectrum, as text maps to sub-spaces, making \ (<phoneme,mel\ frame>\) To be able to be in the same space, to facilitate the attention mechanism to play a role. In the experiment, the size of the whole connecting layer increased from 256 to 512, but not the obvious ascension, but the convergence slowed down, which may be due to the Mel Spectrum has small and low dimensional subspace, 256 of the neural network layer is enough to map, too large to make the model complex difficult to converge. Similarly, after the decoder pre-net, a linear map is added to maintain the same interval center as the position vector.

  • Encoder and Decoder

    Replace the encoder and decoder in the Tacotron2 with the codec Bi-directional LSTM 2-layer LSTM in transformer, respectively. The paper mentions the desire to turn the point-multiplication focus into location-sensitive a focus, but finds that the training time doubles and the memory is easily exploded.

  • Mel Linear, Stop Linear and Post-net

    Because the sample of stop token is unbalanced, the solution is to add a weight to the cross-entropy loss of stop token 5.0~8.0.

  • Experiment

    On the experimental effect, increasing the number of attention in transformer and the number of heads in the head can improve the model performance, but will slow down the training speed.

Bert:pre-training of Deep bidirectional transformers for Language understanding

Paper address: bert:pre-training of deep bidirectional transformers for Language understanding

The paper claims that the model and code will be released this month (2018/10): BERT model&code

Nature is a pre-trained model with astonishing results. In this paper, the pre-training model is divided into two types, one is similar to Word2vec, the extraction of effective features to the model, and another similar to the ResNet on imagenet as faster R-CNN backbone network, as a direct downstream model of the skeleton networks, A model all-in-one task. The approach presented in this paper is that the latter, the pre-trained Bert uses an additional output layer for fine-tuning to achieve the best current performance on many tasks.

    • Model structure

      Bert's model structure is a multilayer bidirectional Transformer encoder (based on the implementation of the above attention all need).

    • Model input

      Model input: Token embedding + segmentation embedding + Position Embedding

    • Pre-training tasks

      Instead of using traditional left-to-right or right-to-left language models to pre-train Bert, the use of two new unsupervised predictive tasks is not the case.

      • Masked LM

        Randomly obscures certain portions of the input token, and then predicts the masked token, which is referred masked LM(MLM) to as a task in other literature Cloze . Since [mask] has never been seen in fine-tuning, each time it is determined to replace the token to be obscured with [mask]. Instead of randomly selecting 15% tokens, the probability of 80% is obscured as described above in the 15% token, and the probability of the token being replaced by another random token;10% is the same. In this way transformer must learn a distributed context representation for each token (distributed contextual representation)

      • Next sentence prediction

        Pre-Training the next sentence of the two classification tasks, namely select sentences A and B as training samples, B has 50% probability is a of the next sentence, 50% probability is not, requires model learning judgment. This is because many downstream tasks such as QA and natural language Inference (NLI) are based on an understanding of the relationship between two text sentences, so train a model to understand the relationship between sentences.

    • Loss function

      \[loss=masked\ lm\ likelihood\ +\ mean\ next\ sentence\ likelihood \]

Reference documents

"Attention are all Need" (Introduction + code)

"Attention is all need" reading notes

How to evaluate the BERT model?

The strongest NLP pre-training model! Google Bert sweeps 11 NLP Mission Records

Attention is all you need and its application in TTS close to Human quality TTS with Transformer and Bert

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.