"AI Technology Base camp" in-depth study of text, voice and vision influence the new trend of the future

Source: Internet
Author: User



"AI Technology Camp Guide" Alphazero self-taught, robotic atlas hard Flip ... In the 2017, the new advances made by artificial intelligence were so overwhelming. All these developments are inseparable from the deep study of a year in the bottom of the research and technical development of a new breakthrough. Around Christmas, Statsbot's data scientist, Ed Tyantov, specifically assesses the depth of the year's research in the direction of text, speech, and vision, and further attempts to summarize some of the new trends that may affect the future.


What exactly are these? Let's look at the article.




Text


Google Neural Network translation machine


About a year ago, Google announced the launch of a new model of Google online translation and described in detail the structure of the technology core-the recursive neural network.


The biggest breakthrough in this technology is that it narrows the gap between machine translation and human 55-85%. It must be pointed out that without Google's huge database as a support, the translation model of this recursive neural network is difficult to achieve such a good translation effect.


Negotiations, will be successful transaction.


You may have heard of a stupid news that Facebook shut down the chat robot, and then the robot lost control and wrote its own language.


The chat robot was created by Facebook and used to negotiate deals. Its purpose is to negotiate with another agent and conclude a deal: How to split an item (such as a book, a hat, etc.) into one. Each agent has his or her own goals in the negotiations, and they do not know each other's thoughts beforehand.


To train the needs of robots, they collected a database of human negotiations and trained the recursive neural network model in a supervised way. These chat bots then train themselves in a learning-intensive way, and learn to talk to themselves, as much as they can to ensure that the language is as similar as humans.


Slowly, these robots have learned a real negotiating strategy to mislead each other by showing a false interest in the negotiation process and to benefit from the choice of the actual goal.


Creating such an interactive robot is a new and very successful attempt. Future details about it and the code will be open source.


Of course, the news that the robot invented a new language is a bit of a mystery. This is not a special thing to do when training (when negotiating with the same agent), abandoning the limitation of similarity to humans, and modifying the language used in the interaction by means of an algorithm.


In the past year, cyclic neural network models have been widely used, and the structure of cyclic neural networks has become more complex. But in some areas, a simple feedforward network DSSM can get similar results. For example, the "Smart Reply" feature of Google Mail has the same performance compared to the previous application of the LSTM architecture. In addition, Yandex has launched a new search engine based on such a network.


Voice


WaveNet: Generating model of original audio


DeepMind's staff recently reported on the results of generating audio in the article. In short, based on previous image generation methods (pixel-level RNN model and pixel-level CNN model), this paper presents a autoregressive convolution wavenet model.



The network achieves End-to-end training: from input text to output audio. Compared with the human level, the study reduced the difference of 50% and achieved good results. However, the main disadvantage of the network is low production efficiency. As a result of the autoregressive process, the sound is generated sequentially, which takes approximately 1-2 minutes to create 1 seconds of audio.




It was a bit disappointing to hear the result. If you can eliminate the dependence of the network structure on the input text and leave only the dependence on the previously generated note, then the network will produce notes similar to the human language, but this does not make sense.


This is an example of using the model to generate sounds. This same pattern applies not only to speeches but also to music creation.


Imagine the audio generated by the build model, using a piano database that also does not depend on input data for music teaching.


If you are interested in this, please read DeepMind's complete introduction to this study.


An interpretation of lip language


Lip-reading is another great manifestation of depth learning beyond human beings. Google DeepMind, in collaboration with the University of Oxford, has published a paper on how the performance of models trained with television datasets is more than a professional lip reader in the BBC channel.



There are 100,000 sentences with audio and video in the dataset. Using audio data Training lstm model, video data training cnn+lstm model. The trained model vectors in both states are fed into the final LSTM model to produce the final results.



Use different types of input data during training: including audio, video and audio + video combination data. In other words, this is a "full channel" training model.




Synthesis of Obama: sync his lip movement in audio


The University of Washington has done a rigorous study to generate the lip-language movements of former President Barack Obama. He was chosen as a research object because of the long duration of the online recording and the sheer volume of data (17 hours of high-definition video).



Since they could not get more data, the researchers put forward a few more technical things to improve the final results. If you are interested, you can try it.




You can see that the results of the study are amazing. In the near future, you will not even be able to believe the president's speech video.


Computer Vision


Ocr:google Street Map


Google Brain team has reported in their blogs and articles how they are introducing a new OCR (Optical character recognition) engine on the map to identify road signs and store signs.



In the process of developing this technology, they developed a new Fsns (French street name Flag), which contains a number of complex examples. In order to identify each sign, the network uses up to four pictures, using CNN to extract the image features, supplemented by spatial attention mechanism, and finally feeds the results into the LSTM model.




The same approach applies to tasks that identify store names on a sign (this can be interfered with by a lot of "noise" data, and the model itself needs to focus on the right place). This algorithm has been applied to the identification of 80 billion photos.


Visual reasoning


The Visual reasoning task requires the neural network to use photos to answer the questions. For example, "Is there a rubber material of the same size as a yellow metal cylinder in the picture?" "This is really a very important issue, and it was only recently that the problem was solved, with an accuracy rate of only 68.5%."


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.