A review of deep learning and its application in speech processing

Last Update:2018-07-26 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. Preface

AI is a current hot topic, from the current Google's Alphago to smart cars, artificial intelligence has entered all aspects of our lives.

Machine learning is a method of implementing artificial intelligence, which uses algorithms to analyze data, then learn from it, and finally make predictions and decisions about reality. Deep learning, however, is a technology of machine learning. The emergence of BP algorithm in the 780 's and its application in neural network have greatly promoted the development of machine learning. This algorithm is based on gradient descent method and is suitable for multilayer neural networks. This phase contains only a layer of hidden layer nodes, so this stage is called shallow learning. By the 2006, as the research continues to deepen, the model contains more and more layers, the application of deep learning in engineering has been greatly developed.

2 Deep Learning

Deep learning, compared with shallow learning, as the name implies, the number of layers containing hidden nodes tends to be above 5 layers, and it is to represent the original data by extracting the characteristics of each layer and changing the characteristics of the sample in the original space to a new feature space.

Deep learning is mainly divided into the following categories:

(1) Supervise learning. is to use the label data to adjust the weights and thresholds for all layers, and then fine-tune the network.

(2) Non-supervised learning. In contrast to supervised learning, it uses untagged data for each layer of pre-training, and then uses its training results as input to the top level.

(3) Semi-supervised learning. As the name implies, is the supervision of learning and non-supervised learning combination, some of the use of supervised learning, some of the use of unsupervised learning. This type of application is most widely used in practice.

The most commonly used deep learning models currently include:

(1) convolutional neural Network (CNNs). This is a feedforward neural network, where each neuron is arranged hierarchically, each neuron is connected to a neuron in the previous layer, receives the output from the previous layer, and outputs it to the next layer. It includes convolutional layers and pool layers. At present, it is mainly used to identify displacement, scaling and other forms of two-dimensional graphics.

(2) Recurrent neural Network (Rnns). It is divided into two kinds, one is the time recurrent neural network, its neuron is connected to form the graph, the other is the structure recurrent neural network, and the similar neural network structure is used to construct the more complex depth network recursively. In recursive neural networks, not only the feedforward connections, but also the self-connection between the units or the connection to the front layer, can be used as short-term memory, so that the network remembers the past things.

(3) Restricting Boltzmann machines (RBM). Restricted Boltzmann machine is a unsupervised learning model, the sub-module has two layers, each node in each layer is not connected, the first layer is the visible layer, the second layer is hidden layer, the relationship is shown in Figure 2.1. An RBM contains three model parameters, the weights, the visual layer bias, and the hidden layer bias.

(4) Automatic encoder (AE). It is also a unsupervised learning model, which is evolved from an automatic correlation device. An auto-association is an MLP structure in which the output, input dimensions are the same, and the output equals input is defined. To be able to regenerate input at the output layer, MLP has to find the best representation of the input in the hidden layer. Once the training is complete, the first layer from the input to the hidden layer acts as an encoder, while the value of the hidden layer element forms a coded representation. From the hidden unit to the second layer of the output unit acts as a decoder, the original signal from the encoding representation of the original signal reconstruction.

3 Application of deep learning in speech processing

With the development of artificial intelligence, the free interaction between man and computer becomes more and more important, and speech processing is an important part of it. At present, speech processing mainly includes speech recognition, speech synthesis and other technologies.

Speech recognition is a human language to translate into text technology, at present, many well-known science and technology enterprises at home and abroad, such as Google, Microsoft, flying and so on in this field have in-depth research, in life, such as Apple Siri, Microsoft Cortana has been widely used, greatly facilitated the lives of people.

The process of speech recognition is shown in Figure 3.1. The first is the preprocessing and extracting features of the input training speech signal, and the training of the acoustic model, while the language model estimates the probability of the hypothetical word sequence by learning the interrelationship between the words or sentences from the training corpus. Decoding search is to calculate the acoustic model fraction and the language model score of the Test speech after preprocessing and feature extraction, and finally the word sequence with the highest total output score as the recognition result.

Speech synthesis is the technique of producing artificial speech by means of mechanical and electronic methods. In March 2017, Baidu introduced the real-time speech synthesis neural network system (real-time neural text-to-speech for Production), named Deep Voice, which consists of 5 parts: A segmentation model for locating phoneme boundaries ; A model for the conversion of a character to a phoneme; a predictive model for determining how long a phoneme can last; a base-frequency prediction model; an audio synthesis model. On the same CPU and GPU, the system is 400 times times faster than Google DeepMind's wavenet. The process is shown in Figure 3.2.

The first step is to convert the elements into phonemes and use a simple phoneme dictionary to convert each sentence directly into the corresponding phoneme; the second step is the prediction of the duration, since the phonemes should be based on context to determine their duration or long or short durations, and the basic frequency prediction, the F0 in the diagram, is also required. The final step is to combine phonemes, durations, and frequencies to produce the output sound.

4 concluding remarks

The field of artificial intelligence is now very popular and more and more people are involved in AI-related fields. I had a preliminary understanding of deep learning by reading a number of related papers and watching the machine learning video of Professor Andrew Ng in these days. Wunda I realize that the future of deep learning in many areas, including speech processing, image processing will be more extensive development, the prospects are very broad. By writing this review, I have compiled some of my own notes, I hope that in the three years of postgraduate life in the heart, and strive to study, and make progress.

Note: This review refers to a number of related papers in the process of writing, and is grateful to the relevant researchers.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

A review of deep learning and its application in speech processing

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

A review of deep learning and its application in speech processing

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support