A text to take you to understand the DeepMind wavenet model and Keras realization of deep learning

Source: Internet
Author: User
Tags keras

This article is mainly about the basic model of WaveNet and Keras code understanding, to help and I just into the pit and difficult to understand its code of small white.

Seanliao

blog:www.cnblogs.com/seanliao/

Original blog post, please specify the source.

I. What is WaveNet?

Simply put, WaveNet is a generation model, similar to VAE, GAN, etc., wavenet the biggest feature is the ability to directly generate raw audio models, presented by the 2017 DeepMind, in TTS (text to Speech) The task can achieve the State-of-art effect.

In addition, wavenet can also be used to generate text, generate pictures, speech recognition and so on.

WaveNet can refer to the following information for specific reference:

REF:

DeepMind wavenet Blog

Keras Implementation Code

WaveNet paper

To think that learning a network structure should be combined with paper and code, while understanding the basis of a model is first to know the input and output of the model. But this code is the best place to hang Daddy! Didn't! Yes! Note! Release!

The DeepMind blog has a very clear picture of the working process of the model. Be sure to look at it! Experience!

WaveNet's network structure is not complex, in fact, is a kind of variant CNN. But introduce WaveNet's various articles only to WaveNet's structure rhetoric, does not involve the model input output exactly is what, is very unfriendly to the small white.

This article focuses on the organization of input data in WaveNet Keras implementation code.

Two. Model Operation process

The principle and structure of the model is not discussed here (in fact, it is very easy to understand wavenet as long as it understands CNN). Let's talk first about what WaveNet "did"?

Since I do not know how to get on the transmission diagram, you can go to the DeepMind blog to view the map. Combined with the original text of the paper:

Simply put, the core of the model is to predict x_n for a given input sequence (x_1, x_2, X_3, X_4, ..., x_1), each time according to the previous X_n ~ x_n+1. The x_n+1 is then added to the input sequence and then X_2 ~ x_n+1 to get the x_n+2.

This allows us to create a sequence of any length by a primitive sequence (x_1, x_2, X_3, X_4, ..., x_n) as input, built by this model !

so the question comes, how is the code implemented? Bloggers really hated the lengthy tensorflow code, so they found the more star Keras version code on GitHub and started analyzing it after cloning and running successfully.

three. Organization of data in Keras code

  The Keras code given at the beginning of this article is written very well, especially in the custom Layer section. Because the package is more advanced, it is a little difficult for small white to understand. I'll take the code for a simple analysis.

First of all, we have to pay attention to this part of the paper.

In this case, the code should be very clear. Still don't understand? That's okay, here's a simple example.

For a sequence "I'm a super stud". If the length is 2 Step 1 as the data input and training.

The data of the model is entered for the first time during training: x1 = I am, output y1 = is super

Second time: x2 = is super, y2 = super;

Third time: x3 = super, y3 = level Large

......

And when the training is finished, using a well-trained model to generate data, feed x1 = I am, the ideal situation is to get Y1 = is super

As long as the output is constantly sent to the input and then predicted, it is possible to get a complete sequence "I am a super stud"!

(Don't tell me you're not here to understand what the model's input and output is ...) )

So what about the picture and the text? No explanation here (because I haven't seen it yet.) You can look at the original text of the paper and the TensorFlow code on GitHub.

If this part is understood, OK, actually the code implementation is also very easy to understand, readers please read through all the code, the following on the input and output data organization to do a simple analysis summary.

Input to the model

audio Data for a length of Fragment_length , set to Audio[begin:begin + fragment_length]

output of the model

The next segment of the Fragment_length-long audio sequence that is adjacent to the input data . Step is fragment_stride, Output is audio[begin + Fragment_stride:begin + Fragment_stride + Fragment _length]

processing of input data 1. The processing of individual audio is:

A) read audio.

b) channel processing ( The original code is mono audio ).

c) convert to floating point number (here should be a zoom to 0~1 ), the code is as follows:

1 defwav_to_float (x):2     Try:3Max_value =Np.iinfo (x.dtype). Max4Min_value =Np.iinfo (x.dtype). Min5     except:6Max_value =Np.finfo (x.dtype). Max7Min_value =Np.iinfo (x.dtype). Min8x = X.astype ('float64', casting='Safe')9X-=Min_valueTenX/= ((max_value-min_value)/2.) OneX-= 1. A     returnX

d) converted to ulaw encoding , using ulaw encoding because the original audio data is 16bit , At the time of generation, one frame audio has 2^16 Output value (number of output nodes), then softmax value, so the cost is high and the data point value range is too general to affect the accuracy rate. Therefore, the original audio data is ulaw encoded (refer to:79398055)

e) Resample to the target sample rate.

f)   converted to uint8 data,   code is as follows:

1 def float_to_uint8 (x): 2     x + = 1. 3     X/= 2. 4     Uint8_max_value = Np.iinfo ('uint8'). Max5     x *=  Uint8_max_value6     x = X.astype ('uint8')7      return x

The complete code for individual audio processing is as follows:

1 defprocess_wav (desired_sample_rate, filename, use_ulaw):2     #print (' reading wavfile ... ', filename)3 With warnings.catch_warnings ():4         #warnings.simplefilter ("error") # Raise the warning level? Will cause np.fromstring error .5Channels =scipy.io.wavfile.read (filename)6File_sample_rate, audio =Channels7Audio =Ensure_mono (audio)8Audio =wav_to_float (audio)9     ifUse_ulaw:TenAudio =Ulaw (audio) OneAudio =ensure_sample_rate (desired_sample_rate, file_sample_rate, audio) AAudio =float_to_uint8 (audio) -     returnAudio
2. input data organized into a model

A) splicing the same speaker's audio after 1 processing to form a full_sequences.

b) divide the test set , the code will be full_sequences by 9:1 , after 0.1 part of the test set is added.

c)   full_sequences fragment_stride to split with a length of fragment_length (This records only the coordinates of the beginning of the sequence.) A number of audio sub-sequences are obtained. The code is as follows:

1 def fragment_indices (full_sequences, Fragment_length, Batch_size, Fragment_stride, nb_output_bins): 2      for inch Enumerate (full_sequences): 3          for  in range (0, sequence.shape[0]- fragment_length, fragment_stride):4              Yield  seq_i, I5         #  I is the beginning of the input sequence seq_i the ID of the audio file

d) generate batch. a List of all sub-sequences of a full_sequence is obtained by C. Then press batch_size to divide.

Batches = cycle (Partition_all (batch_size, indices))#indices as a list forBatchinchBatches:ifLen (Batch) <batch_size:Continue    yieldNp.array ([One_hot (full_sequences[e[0]][e[1]:E[1] + fragment_length]) forEinchBatch], dtype='uint8'), Np.array ([One_hot (full_sequences[e[0]][e[1] + 1:e[1] + fragment_length + 1]) forEinchBatch], dtype='uint8')

e) When the data point value is 0~255, converted to onehot encoding , the input data becomes a two-dimensional tensor, when the model is fed into. The first two sub-sequences are input and the latter is the output.

The definition and principle of model are to be continued.

There is also a simple implementation Keras WaveNet demo is working on ... Upon completion, you will be attached

In fact, the most frustrating part of implementing a model with code is the organization of the input data, which is already covered in this article. It's easy for Keras to simplify the construction of the network structure. As long as you understand the principles of the model, it is easy to implement.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.