DeepMind publishes the latest original audio waveform depth generation model wavenet, which will bring countless possibilities for TTS

Source: Internet
Author: User

Wavenets is a convolution neural network that simulates any kind of human voice, and the generated speech sounds more natural than the existing optimal text-speech system, reducing the difference between analog and human voices by more than 50%.

We will also prove that the same network can synthesize other audio signals, such as music, and can automatically generate refreshing piano pieces. a machine that can speak

Allowing people to talk freely with machines has long been a dream in the field of human-computer interaction. In the past few years, applications of deep neural networks (such as Google Voice search) have revolutionized the ability of computers to understand natural sounds. However, the use of computer-generated speech-often used to refer to speech synthesis or text-speech (TTS) systems-relies heavily on stitching Tts,tts to include a database that contains a very large record of a single speaker's short speech fragments, which are then synthesized to form a complete discourse. Without recording a new full database, this method of speech synthesis makes it difficult to modify the sound (for example, to convert to a different speaker, or to change the emphasis or emotion conveyed in the voice).

In order to solve the problem of speech synthesis, it is urgent to use a kind of parameter TTS, in this text-speech system, all the information needed to generate data is stored in the parameter of the model, and the content and speech feature conveyed by voice can be controlled by the input information of the model. However, the current parameter (parametric) TTS model generates sounds that are not as natural as the voice generated by the stitching (concatenative) TTS model, which appears at least in syllable languages such as English. The existing parametric model usually uses the signal processing algorithm vocoders to compute the obtained output information to generate the audio signal.

WaveNet to change this paradigm by modeling the original waveform of the audio signal and modeling an audio sample one at a time. As with generating voice that sounds more natural, using the original waveform means that wavenet can model any type of audio, including music. wavenet

Researchers usually avoid modeling the original audio because the original audio often changes instantaneously: Typically, 16,000 or more audio samples appear per second, and important structures appear in many time scales. It is obvious that building a fully automated regression model is a challenging task, in which the predictions for each audio sample are influenced by all previous audio samples (each predictive distribution, statistically speaking, is based on all previous observations), in this model.

However, the PIXELRNN and PIXELCNN models we have published this year show that it is possible to generate complex natural images in one pixel at a time, or even each color channel at a time, which will require thousands of predictions for each image. This also inspires us to transform the original two-dimensional pixelnets into a one-dimensional wavenet.

The above animation shows the internal structure of a wavenet model, usually a complete convolution neural network, with various expansion factors in the convolution layer, allowing the depth of the accepted domain to multiply and cover thousands of time steps.

In the training time period, the input sequence is the real waveform which is recorded from human speaker. After training, we can sample the network to generate synthetic discourse. In each step of sampling, a value is extracted from the probability distribution computed by the network. The extracted values are then fed back into the input information, thus completing the next new forecast. As such, sampling each time you make a forecast increases the cost of computing, but we have found that this sampling method is critical to generating complex and sound audio. improve the optimal text-Speech conversion model

We have used some of Google's TTS datasets to train wavenet to evaluate the performance of wavenet. The following figure shows the quality of the wavenets on the ruler (1-5) compared to the current optimal TTS system (parametric TTS and splicing TTS) and the mean Opinion Scores (MOS: Method for evaluating the quality of voice communication systems). MOS is a standard method for the measurement of subjective speech quality and blind test in the crowd. We can see that the gap between the wavenets of the optimal model generation and the human natural voice (US English and Mandarin) has decreased by more than 50%.

In terms of Chinese and English, Google's current TTS system is considered to be the optimal text-speech system worldwide, so a single model to improve the quality of Chinese and English speech will be a major achievement.

The Church WaveNet said it meant something.

In order to realize the use of wavenet to convert text into speech, we have to tell wavenet what the text is. We complete this process by converting the text into language and phonetic features (including phonemes, syllables, words, etc.) and providing the transformed features to WaveNet. This means that the network's predictive steps are based not only on the previously obtained audio samples, but also on what the text conveys.

If we want to train the network out of text sequences, the network will still be able to generate voice, but it has to invent what it wants to convey. The audio generated under this condition is simply gibberish, and a meaningful word is cut off like a sound of words.

Note that sometimes wavenet also generate some type of voice, such as breathing or oral movement, which reflects a high degree of flexibility in the original audio model.

A single wavenet has the characteristics of learning many different sounds (men and women). To ensure that wavenet can know what sounds match any given discourse, we train the network to learn to get the speaker's identity. Interestingly, we found that training the network with features such as the language, voice, and identity of many speakers, compared to the single trainer of individual speakers, allows wavenet to better simulate a single speaker, which implies a form of migration learning.

By changing the speaker's identity, we can use WaveNet to express the same discourse in different voices.

Similarly, we can provide additional input information for the model, such as emotion or accent, making the generated speech more diverse and more interesting. Generate music

Since wavenet can be used to simulate any kind of audio signal, we think it would be fun to try to generate music with wavenet. Unlike the TTS experiment, our network is not based on an input sequence, telling it how to play music (such as a sheet music); instead, we simply allow wavenet to produce a single piece of music. When we use a classical piano data set to train wavenet, it will produce a wonderful piece of music.

Wavenets will bring a myriad of possibilities for TTS, and generally there are two classes of generating music and simulating audio. In fact, it would be amazing to use a depth neural network to generate music in a single time step, which is suitable for all 16kHZ audio. We are very much looking forward to the wavenets the future will bring to everyone's surprise.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.