From Cold War to deep learning: An Illustrated History of machine translation
Selected from vas3k.com
Ilya Pestov
English Translator: Vasily Zubarev
Chinese Translator: Panda
The dream of high quality machine translation has been around for many years and many scientists have contributed their time and effort to this dream. From early rule-based machine translation to today's widely used neural machine translation, the level of machine translation has been improved to meet the basic application needs of many scenarios. Recently, Ilya Pestov in Russian written machine translation of the article by Vasily Zubarev translated after the publication to the vas3k.com. The heart of the machine has been authorized to translate it into Chinese. I hope that one day, the machine can help us to accomplish such a task.
- Russian version: http://vas3k.ru/blog/machine_translation/
- English version: http://vas3k.com/blog/machine_translation/
I opened the Google Translate frequency is twice times the opening of Facebook, the price tag of the instant translation for me is no longer cyberpunk grams. This has become a reality. It is hard to imagine that this is the result of a century of research and development of machine translation algorithms, and that in half of that time there was no obvious success.
The exact developments I discussed in this article will be based on all modern language processing systems-from search engines to voice-activated microwaves. I will explore the evolution and structure of today's online translation technology.
P. P. Troyanskii's translation machine (Pictures drawn according to the description.) I'm sorry no photos left. )
Initially
The story began in 1933. Peter Troyanskii, a Soviet scientist, presented to the Soviet Academy of Sciences the machine for selecting and printing words when translating one language into another. The invention is very simple-it has cards in four languages, a typewriter and an old film camera.
The operator first takes the first word of the text, and then finds the corresponding card, taking a photo, and typing its morphological features (noun, plural, gender, etc.) on the typewriter. The keys of this typewriter encode one of the features. The type band and camera film are used simultaneously to get a set of frames with words and their patterns.
Although it looks good, it is thought to be "useless", as is the case with many Soviet events. Troyanskii spent 20 years trying to complete his invention and then died of angina. No one in the world knew about the machine until two Soviet scientists found his parents in 1956.
It was the time when the Iron curtain of the Cold war was just lowered. On January 7, 1954, IBM launched the GEORGETOWN-IBM experiment at its headquarters in New York. For the first time ever, the IBM 701 computer automatically translated 60 Russian sentences into English.
"A girl who doesn't know any of the words of a Soviet language has knocked out these Russian messages on IBM cards. The brain, at an astonishing speed of two lines per second, has made its English translation on an automatic printer. "--IBM's press release
IBM 701
But a small detail was hidden in the headline News of victory. No one mentioned that the samples obtained were carefully selected and tested, thus excluding ambiguity. For everyday use, the system is no better than the common language manual in the pocket. Nonetheless, the arms race has started: Canada, Germany, France and (in particular) Japan have all joined in the machine translation contest.
Machine Translation Contest
The futile effort to improve machine translation lasted for 40 years. In 1966, US Alpac said in its famous report that machine translation was expensive, inaccurate and hopeless. Instead, they suggested that the focus should be on dictionary development, which left American researchers out of the competition for nearly 10 years.
Even so, the foundations of modern natural language processing have been built on the basis of scientists and their attempts, research and development. Thanks to these surveillance countries, all of today's search engines, spam filters, and personal assistants appear.
Rule-based machine translation (RBMT)
The earliest rule-based machine translation idea appeared in the 70 's. Scientists have studied the work of translators, trying to make computers that are still extremely slow to repeat these behaviors. These systems include:
- Bilingual dictionary (e.g., Russian-English)
- A set of linguistic rules for each language (for example, nouns that end with a specific suffix, such as-heit,-keit,-ung, are negative)
This is the whole of this system. The system can also be supplemented if needed, such as adding name lists, spell-correcting and transliteration functions.
PROMPT and Systran are the most famous cases in the RBMT system. If you want to feel the soft smell of the golden age, try Aliexpress.
But even if they have some nuances and subspecies.
Direct machine Translation
This is the most straightforward type of machine translation. It divides the text into words, then translates the words, corrects the pattern a little, and finally coordinates the syntax to get the result; more or less sounds fine. When the sun goes down, well-trained linguists are still writing rules for each word.
Its output returns some type of translation result. Usually, the results are bad. It was as if these linguists had wasted their time in vain.
Modern linguists are deeply indebted to this method, which is not used at all.
Migration-based machine translation
Compared with direct translation, we have to prepare for translation-first of all to determine the grammatical structure of the sentence, as the teacher taught at school. Then we can manipulate the whole structure, not the words. This helps to get a fairly good word order conversion in translation. Theoretically, so.
But in practice, this will still get the result of word-by-phrase translation and will make linguists exhausted. On the one hand, it brings a simplified general grammatical rule. But on the other hand, because the number of word structures is much more than a single word, it becomes more complex.
Inter-language Machine translation
In this approach, the source text is converted to intermediate representations and is used uniformly for all languages (intermediate languages) throughout the world. This is the middle language that Descartes dreamed of: a meta-language that follows pervasive rules and translates translation into a simple "back and forth" task. Next, the intermediate language can be translated into any target language, and that's the singularity!
Because of this transformation, inter-language machine translation is often confused with migration-based systems. The difference between inter-lingual machine translation is that linguistic rules are for each individual language and intermediate language, not for language pairs. This means that we can add a third language to the inter-lingual system and translate each other between the three of them. And we can't do that in a migration-based system.
It looks perfect, but it doesn't really. Creating such a generic intermediate language is extremely difficult-many scientists have spent a lifetime on it. They haven't been successful yet, but thanks to them, we now have a morphological, syntactic, and even semantic representation. But only the semantic-text theory (meaning-text theory) cost a fortune!
The thought of the middle language will come back again. Let's wait and see.
As you can see, all the rbmt are very dumb and scary, so they are rarely used, except for some specific cases (such as weather Report translation, etc.). The most commonly mentioned advantages of RBMT are morphological accuracy (no confusion of words), reproducibility of results (all translators have the same results) and ability to adjust to specific disciplines (e.g. to teach an economist or programmer-specific terminology).
Even if someone really succeeds in creating a perfect rbmt, linguists reinforce it with all the spelling rules, but there are some exceptions: irregular verbs in English, sub-prefixes in German, suffixes in Russian, and differences in the way people express themselves. Any attempt to cover all the nuances will cost millions of hours of working time.
And don't forget the polysemy. The same word may have different meanings in different contexts, which results in different translations. You can try to understand several meanings from this sentence: I saw a man on a hill with a telescope?
Language does not evolve according to any fixed rules-linguists like the fact. The impact of aggression in the past 300 years on language has been very large. How can you explain this to the machine?
The 40-year-old Cold War failed to help find any definitive solution. Rbmt is dead.
Instance-based machine translation (EBMT)
Japan is particularly interested in machine translation competitions. The reason is not the Cold War, but others: there are very few people in the country who understand English. This is a serious problem in the context of the coming globalization. So the Japanese are very active in trying to find a viable machine translation method.
Rules-based English-Japanese translation is extremely complex. The language structure of the two languages is completely different, and almost all the words need to be rearranged, and new words need to be added. In 1984, the long tail of the University of Kyoto really put forward a thought: use a ready-made phrase instead of repeating the translation.
Suppose we want to translate a simple sentence--"i ' m going to the cinema." and we have previously translated a similar sentence--"i ' m going to the theater." and we can also find the word "cinema" in the dictionary.
Then we just need to find the differences between the two sentences, the translation of missing words, do not make a mistake. The more examples we have, the better the translation will be.
It is in this way that I built the following foreign language phrases I am not familiar with!
EBMT let scientists around the world see the direction: it turns out that you can enter an existing translation directly into the machine without having to spend years building rules and exceptions. The revolution has not yet taken place, but it is clear that the first step has been taken. The revolutionary statistical Machine translation invention will be born within just five years after that.
Statistical machine translation (SMT)
At the beginning of 1990, the IBM Research Center first showed a machine translation system that knew nothing about rules and linguistics. It analyzes similar text in both languages and tries to understand the patterns.
This is a concise and graceful thought. The same sentence in both languages is divided into words and then matched. This operation was repeated nearly 500 million times, recording many patterns, such as the number of "das haus" translated into "house" or "building" or "construction" words.
If most of the time the source word is translated into "house", then the machine will use this result. Note that we do not use any rules or dictionaries--all the conclusions are made by the machine, and the guidelines are the statistical results and the logic, "if people translate like this, so do I." This is the birth of statistical translation.
This method is more effective and accurate than all previous methods. And there is no need for linguists. The more text we use, the better the translation we get.
An example of the internal situation of Google's statistical translation. It not only gives the probability, but also shows the inverse translation result statistics
There is still one remaining question: How can a machine match "das haus" with "building"-How do we know the translation is correct?
The answer is that we can't know. At first, the machine assumes that the word "das haus" is equally related to any word from a translated sentence. Next, when "das haus" appears in other sentences, the number associated with "house" increases. This is the word alignment algorithm, which is one of the typical tasks of university-level machine translation.
The machine needs millions of thousands of bilingual sentences to collect the relevant statistical results for each word. How do we get this data? Well, we decided to take extracts from the European Parliament and UN Security Council meetings, which are available in all member languages and are available for download:
- UN corpora:https://catalog.ldc.upenn.edu/ldc2013t06
- Europarl Corpora:http://www.statmt.org/europarl
Word-based SMT
At first, the earliest statistical translation system worked by dividing the sentences into words, because the method was intuitive and logical. IBM's first statistical translation model is known as Model 1. It's a pretty classy name, isn't it? Guess what their second model is called?
Model 1: Word-wise correspondence
Model 1 uses a classic method to divide sentences into words and record statistics. This process does not consider word order. The only trick to use is to translate a word into multiple words. For example, "der staubsauger" may become "vacuum cleaner", but it does not mean that it can be reversed.
Here are some simple Python-based implementations: HTTPS://GITHUB.COM/SHAWA/IBM-MODEL-1
Model 2: Consider the word order in a sentence
Lack of language word order knowledge is a problem in Model 1, and this problem is important in some cases.
Model 2 solves this problem: it remembers where the output sentence morphemes usually appears, and arranges the words into a more natural form through an intermediate step. The result has been better, but still unsatisfactory.
Model 3: Extra additions
New words are often found in the translation, such as German-language articles or "do" in English negative sentences. For example "ich would keine Persimonen"→"i do not want persimmons." to solve this problem, Model 3 adds two more steps:
- Insert a NULL tag if the machine considers it necessary to add a new word
- Select the appropriate particle or word for the alignment of each marker word
Model 4: Word alignment
Model 2 takes word alignment into account, but does not know anything about sequence rearrangement. For example, adjectives often exchange positions with nouns, so no matter how well the word order is remembered, it will not make the output better. Therefore, Model 4 takes into account the so-called "relative order"--if two words are always swapped for positions, the models can learn.
Model 5: Fix Errors
There's nothing new here. Model 5 has more parameters to learn, and it fixes the problem of word position conflict.
Although word-based systems are inherently revolutionary, they still cannot handle lattices, sex, and synonyms. Each word has a single translation. Now we are no longer using this system, as they have been replaced by more advanced phrase-based methods.
Phrase-based SMT
This approach is based on all word-based translation principles: statistics, reordering, and lexical analysis. However, when learning, it not only divides the text into words, but also divides it into phrases. Specifically, these are n-gram, which are contiguous sequences of n words that are joined together.
Therefore, this machine can learn to translate a stable word combination, which can significantly improve the accuracy.
The trick is that the phrase here is not always a simple syntactic structure, and if someone understands linguistics and interferes with the sentence structure, the quality of the translation will fall sharply. Frederick Jelinek, a pioneer in computational linguistics, once joked: "Every time I fire a linguist, the performance of the speech recognizer rises a bit." 」
In addition to improving accuracy, phrase-based translation offers more choices in choosing the bilingual text to learn. For word-based translation, the exact match between source text is crucial, which excludes any literary translation or free translation. Phrase-based translation can be learned from it. To improve the quality of translation, researchers have even begun to parse news sites in different languages.
Everyone has started to use this method since 2006. Google Translate, Yandex, Bing and other famous online translation tools to use the phrase-based method for 2016 years. You probably remember when Google got the wrong translation, or did it get meaningless results? This nonsense comes from a phrase-based function.
The older generation rule-based approach always gets predictable but bad results. Statistical methods always get unexpected and confusing results. Google Translator will not hesitate to turn "three hundred" into "300". This is the so-called statistical anomaly (statistical anomaly).
Phrase-based translation has become very popular, when you hear people say "statistical machine translation", most of it refers to it. Before 2016, all studies praised phrase-based translation as the best performing. At the time, no one even thought that Google was already in flames, ready to change the entire machine translation landscape.
Syntax-based SMT
This approach should be briefly mentioned. Many years before the advent of neural networks, syntactic-based translation was considered "the future of translation", but this thought did not come to a leap.
Proponents of syntactic-based translation believe it is possible to fuse with a rule-based approach. It requires a fairly accurate syntactic analysis of the sentence-to determine the subject, predicate, and other parts of the sentence, and then construct a sentence tree. The machine can use it to learn to convert syntactic units between languages and to translate the rest according to words or phrases. It should be possible to solve the problem of word alignment once and for all.
An example from Yamada and Knight [2001] and this great slide (http://homepages.inf.ed.ac.uk/ Pkoehn/publications/esslli-slides-day5.pdf)
The problem is that syntactic analysis works poorly, although in fact we think this has been resolved before (because we have many languages available in the library). I have tried using syntactic trees to solve more complex tasks than simply parsing subjects and predicates. But I give up every time, and then I use another method.
If you have succeeded at least once, please let me know.
Neural machine translation (NMT)
In 2014, an excellent paper on neural networks for machine translation was published: Https://arxiv.org/abs/1406.1078. The internet did not pay attention to the study, except for Google--they rolled up their sleeves and dried up. Two years later, in September 2016, Google released a notice to change the field of machine translation, see the blockbuster | Google Translate integrates neural networks: machine translation for disruptive breakthroughs.
This idea approaches the style migration between photos. Do you know an application like Prisma? It can use the style of a famous artwork to render pictures. But it's not magic. It was the neural network that learned to recognize the artist's paintings. Next, the last layer containing the network decision is removed. The resulting stylized image is just an intermediate image obtained by the network. This is the network's own fantasy, and we think it is beautiful.
If we could migrate the style of a photo, would we be able to apply another language to the source text? We can look at the text as having some kind of "artist style," and we want to move this style while ensuring that the text remains the same.
Imagine if I wanted to describe my dog--average head, pointy nose, short tail, always barking. If I give you the characteristics of the dog and the description is accurate, you can draw it, even if you have never seen it.
Now, imagine that the source text is a collection of specific features. Basically, this means that you can encode it, and then let the other neural networks decode it back into the text--but the text in another language. The decoder only knows its own language. It has no knowledge of the origins of these features, but it can be expressed in languages such as Spanish. Continue with the analogy above, no matter how you draw the dog (with crayons, watercolor or your fingers), you can draw it out.
Again: A neural network can only encode a sentence into a specific set of features, and the other neural network can only decode it into text. Neither of them knew each other, and they only knew their own language. Do you remember anything? The "middle language" is back!
The question is, how do we find these traits? For dogs, the feature is certainly obvious, but what is the character of the text? Scientists have been trying to create common language code 30 years ago, but eventually failed.
Nonetheless, we now have deep learning. Looking for features is its basic task! The main difference between deep learning and classical neural networks is the ability to search for these specific features without any understanding of the nature of these features. If the neural network is large enough and thousands of graphics cards are available, you can find these features in the text well.
Theoretically, we can give the characteristics of these neural networks to linguists so that they can open up a new horizon for themselves.
But the question is, what kind of neural network should be used for encoding and decoding? convolutional Neural Networks (CNN) are perfect for images because they can manipulate independent blocks of pixels.
But there is no separate block in the text-each word depends on its own context. Text, speech, music are continuous. So the cyclic neural network (RNN) is the best choice for dealing with them, because they can remember the previous results-in this case, the previous words.
Many applications now use RNN, including Siri's speech recognition (parsing sound sequences, where the latter sound depends on the previous sound), keyboard hints (remembering previous experiences, guessing the next word), music generation, and chat bots.
To technology like me: In fact, the architecture of neural translators is very diverse. The conventional RNN, which was used at first, was later upgraded to two-way RNN, where the translator not only had to consider the words before the source word, but also the words to be considered thereafter. It's much more efficient. Then it uses multi-layer RNN with LSTM unit, which can realize the long-term storage of translation context.
In just two years, the neural network has performed more than the last 20 years in translation. There are 50% fewer word-order errors in neural translation, 17% fewer lexical errors, and a 19% reduction in grammatical errors. Neural networks have even learned to coordinate the sex and lattice of different languages. And no one has taught them to do so.
The most noteworthy progress in this area has never been the use of direct translation. The statistical machine translation method can always use English as a key source. Therefore, if you want to translate Russian into German, the machine will first translate the Russian into English and then translate the English into German, which will cause double losses.
Neural translators don't have to do this-just need a decoder. Direct translation is also possible between languages without a common dictionary, which is the first time ever.
Google Translate (since 2016)
In 2016, Google turned on neural translation for 9 languages. They developed a system called Google Neural Machine translation (GNMT). It is composed of 8 encoders and 8 decoder RNN layers, plus a note connection from the decoder network.
They will not only divide the sentences, but also divide the words. This is the way they solve a major problem in NMT-the rare word problem. But when there are words that are not in their lexicon, NMT is powerless. such as "vas3k". I guess no one let the neural network learn to translate my nickname. When a rare word is encountered, GNMT attempts to break the word into a word fragment and then get the translation results based on those fragments. Very clever way to do it.
Tip: Google Translate for website translation in the browser still uses an old phrase-based algorithm. Google did not upgrade, and its translation results and online version of Google translation is actually quite large.
The online version of Google Translate uses the crowdsourcing mechanism. People can choose the version they think is the most correct, and if many users agree, Google will always translate the phrase this way and label it as a special case. This works well for everyday phrases such as "let" go to the cinema"or"i "M waiting for you". Google's English conversation level than I am OK, not Kai Sen ~
Microsoft Bing works the same way as Google Translate. But Yandex is not the same.
Yandex Translate (since 2017)
Yandex launched its own neuro-translation system in 2017. The company claims that its main feature is the hybrid nature (hybridity). Yandex combines neural and statistical methods to perform translations, and then uses its favorite catboost algorithm to choose the best one.
The problem is that neural translation often makes mistakes in translating, because it needs to use context to choose the right word. If a word appears in the training data very few times, it is difficult to get the right result. In this case, simple statistical translation can easily and quickly find the right word.
After adding a period at the end of the sentence, Yandex's translation was better because it enabled the neural network machine translation.
Yandex did not share specific technical details. It used the marketing press release to stall us. All right.
It appears that Google has used SMT to translate words and phrases. They do not mention this in any article, but if you look at the difference between short and long expressions, you will be able to notice quite clearly. In addition, SMT is used to show the statistics of the words.
Conclusions and future
Everyone is still excited about the idea of a "bar-and-fish" (instant voice translator). Google has taken a step in this direction with Pixel Buds headphones, but in fact it is still not the effect of our dreams. Instant voice translation differs from the usual translation. The system needs to know when to start translating and when to shut up and listen. I haven't seen any way to solve this problem. Maybe, Skype is ok ...
And there's more to it than that: all learning is constrained by the collection of parallel text blocks. The deepest neural network is still learning in the parallel text. If you do not provide resources to the neural network, it will not be able to learn. And humans can expand their vocabulary by reading books and articles, even if they don't translate them into their native language.
If humans can do that, neural networks can do the same. Theoretically, so. I've only found a prototype. Attempts to develop a network that allows knowledge of a language to gain experience by reading text in another language: https://arxiv.org/abs/1710.04087. I'd like to try it myself, but I'm stupid. All right, that's it.
Useful Links
Philipp Koehn: Statistical machine translation: https://www.amazon.com/dp/0521874157/. I found the most complete collection of methods.
Moses:http://www.statmt.org/moses/. A very popular library that can be used to create your own statistical translation system.
opennmt:http://opennmt.net/. Another library, but the one used to create a neural translator
My favorite blogger of the article, explaining the RNN and lstm:https://colah.github.io/posts/2015-08-understanding-lstms/
Video "How to make a language translator":https://youtu.be/nrbnh4qbphi. " This guy is funny, explaining it well, but not enough.
A text tutorial from TensorFlow to teach you how to create a neural translator: Https://www.tensorflow.org/tutorials/seq2seq. Anyone who wants to see more cases and try out the code can refer to.
From Cold War to deep learning: An Illustrated History of machine translation