Dry Goods | Application of deep learning in machine translation

Source: Internet
Author: User

Click on the "ZTE developer community" above to follow us

Read a first-line developer, a good article every day

about the author

The author Dai is a deep learning enthusiast who focuses on the NLP direction. This article introduces the current status of machine translation, and the basic principles and processes involved, to beginners who are interested in deep learning.


This article only gives a brief introduction to the related application, does not involve the formula derivation (this part of the picture originates from the network).


1. The development of machine translation


Before the 80 's, machine translation depended mainly on the development of linguistics, parsing syntax, semantics and pragmatics.

Later, the researchers began to apply statistical models to machine translation, which was based on the analysis of existing text corpora to generate translation results;

Since 2012, with the rise of deep learning, neural networks have started to be used in machine translation and have achieved great results in just a few years.


2. Neural network MT (Neural machine translation)

In 2013, Nal Kalchbrenner and Phil Blunsom presented a new end-to-end encoder-decoder architecture for machine translation. In 2014, Sutskever developed a method called sequence-to-sequence (seq2seq) learning, and Google used this model to give a concrete implementation method in the tutorial of its deep learning framework tensorflow, and achieved good results (see https:// WWW.TENSORFLOW.ORG/TUTORIALS/SEQ2SEQ).


2.1 (PRE) Fastest speed introduction neural network

Deep Learning (the name is very tall), refers to the multilayer neural network. The image above.


This is a single layer of neural network, multi-layer neural network is inserted in the middle of a number of hidden layers, each hidden layer has a number of nodes. But the input and output layers are only one layer.

The traditional programming is to give the input, determine each step, and finally get the output. The practice of a neural network is to give a known set of input and output, called a training sample, to do the steps (i.e. the model) is unknown, then how to determine the step (model) it. "Regression/Fit", using the simplest equation model to make an analogy ... directly on the formula.

The training process of neural network is similar, and some coefficients in hidden layer nodes are determined by training. But the neural network model itself is non-linear, more complex. Feedforward, error reversal propagation, and gradient descent are all methods used in the training process.


2.2 Basic SEQ2SEQ Model

The basic SEQ2SEQ model consists of three parts, encoder, decoder, and the intermediate state vectors of the two, encoder by learning input, encoding it into a fixed-size state vector c, and then passing C to decoder, The decoder is then output by learning the state vector c.


2.2.1 Rnn and Lstm

Encoder, decoder codecs generally use a variant of the recurrent neural network (recurrent neural network,rnn)-the Long-term Memory neural network (a length short-term memory,lstm). The difference between lstm and ordinary rnn is that it has good effect on long-distance state storage. See figure below.

(a) general RNN

(b) LSTM

The state information of the hidden layer in the Common Multilayer Neural Network (DNN) is independent of the output form of the hidden layer node.

RNN at the present time, the hidden layer state information HT is influenced by the hidden layer information from the previous time, that is, RNN can save some of the previous memory ht-1. For machine translation, for example, enter "My coat is white, hers is blue", using the RNN model, the first half of the sentence, the "coat" provides some information. But this memory can be greatly weakened as the sequence interval increases. The specific principle is not explained in detail here.

Lstm in each hidden layer of the unit using the adder (gated thinking) to achieve the memory of the selective storage, similar to our childhood memory is also a choice to remember the same, thus greatly avoiding the use of rnn produced problems. "My coat is white, hers is blue", translated to "hers", before the "my Coat" information on the adder through the gating step down.

2.1.2 Encoder-decoder Model

The above diagram is the basic structure of the basic SEQ2SEQ model in machine translation, and you can see that the Encoder encoder accepts input (for example: I am a student) and obtains the status information c through the passing of the state between the sequences. Then input c into the decoder to get the translated output.

There is a problem in the application of this model to machine translation, that is, the information received in the decoder has only one global. If the translation "I am a student", when translated to "student", actually does not need to pay attention to before "I am", moreover the translation sentence if very long, C is a limited quantity, difficult to keep all the information. So we want encoder to be able to output to decoder with some emphasis. Similar to the following figure.

The decoder can receive different state information at different times in the translation sequence. This is the attention mechanism.


2.2 Attention Mechanism

Google's TensorFlow framework uses the attention mechanism proposed by Luong in 2015, which can be represented as a weighted sum of each hi in the encoder. The weight parameter WI can be used to train a small neural network. The attention mechanism raises the accuracy of machine translation significantly.


3.Facebook vs. Google

In May 2017, Facebook first rolled out the convolutional neural Network (CNN), which is now in computer vision, and behind it is a bunch of principle formulas ... For machine translation, using CNN's parallelism, and then a bunch of RNN, the model (named Fairseq) is fast (up to 9 times times faster), and the translation is accurate and good (metric bleu).

Immediately after, one months later, Google began to face, offering an article "Attention is all need". This paper proposes a new attention mechanism, and abandons CNN and RNN, and builds the translation model directly, Bleu continues to improve.

The principles of these two models can be searched for related papers.


4. The process of a complete SEQ2SEQ model

The first three parts are a brief introduction to the relevant research of current NMT. The following steps illustrate a complete SEQ2SEQ model, which still does not involve formula derivation, but because machine translation is part of natural language Processing (NLP), some knowledge about NLP is mentioned.

1) Obtain the original data set, as a training sample, the data set contains a large number of English-Chinese translation sentences, divided into train_source,train_target two files.

I am a student I am a student

You're so clever.

...

2) Constructs a mapping table in a single word, or it can be called a dictionary library, key:value format, and key as index.

{

0:I,

1:am

}

{

0: Me,

1: Yes

}

3) This can be converted into an index vector form similar to A=[2, 45, 2, 5, 6]. The vectors themselves are independent, so it is also necessary to find out the correlations between the various words in the training sample, that is, embedding,embedding is a matrix, the original vector is transformed into another vector by the mapping of the embedding matrix.

There is a correlation between the mapped embed vectors (as for the principle, this is a bunch of crazy formulas). For example, "go" in the training sample, "went", "walk" can be expressed as a correlation of three vectors, the correlation measurement has many (cosine similarity ...) )。

4) to obtain a quantitative training sample, the computer can be recognized and processed. Because the sample is too large, there may be millions of groups, one-time all training, too long, and the effect is not good. So the sample is divided into groups, each group called a batch, according to batch training.

5) The training process is as follows:

For I in total training wheel number (100,000 times):

For j in Batches (64):

Encoder (Encoder) Entry: Training sample input train_source, hidden layer node number, hidden layer layer

Output: Encoder output, hidden layer status vector c

Decoder (Decoder) Entry: Training sample input train_target, hidden layer state vector c

Output: Predictive output

Calculate loss function (error, for training correction)

Gradient Descent method (a specific method for error-biased guidance, global extremum points, and correction of model parameters)

Constant search for optimal model parameters

Get the optimal model parameters

Save the Model

6) Input test sample, "You is so Handsome", get translated output "you really handsome".


5. Summary

The above is the basic SEQ2SEQ model of the translation process, SEQ2SEQ can not only be used in the field of machine translation, language generation and other fields also have a good application effect.

This paper tries to introduce the deep learning in the field of machine translation in a concise way, and there may be some inappropriate expressions in the text.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.