Principle of Speech Synthesis Technology

Source: Internet
Author: User
State Key Laboratory of intelligent technology and system, computer science, Tsinghua University

Wu Zhiyong Cai lianhong

---- Before that, the study of the speech synthesis has gone into the text-language conversion (TTS) section, the module can be divided into three main modules: Text Analysis, rhyme law creation, and phoneme. In these terms, the phoneme is the most fundamental and important module in the TTS System. In general, the main function of the speech synthesis is: the result of the creation of the model based on the rhyme law, obtain the corresponding language base element from the original language base, the adjustment and modification of the rhyme characteristics of the phoneme base element using the specific speech synthesis technique, the final combination is the phoneme to be obtained.

---- The speech synthesis technique experienced a step-by-step development process, from the combination of parameters to the combination, when the combination of the two ends step by step, the dynamic force of the continuous development is that people know that the water level and the need to increase. Previously, the commonly used phoneme synthesis techniques should include: Co-vibration peak combination, LPC combination, psola combination and LMA acoustic pattern combination. They each have their own advantages and disadvantages. In the process of application, people will combine multiple techniques with machines, or use the advantages of one technique to another, to take advantage of the disadvantages of another technique.

Joint vibration peaks

---- The theory of phoneme synthesis is based on the mathematical model generated by phoneme. The speech process of this model is triggered by the excitation signal, and the acoustic wave passes through the harmonic vibration cavity (acoustic channel), which emits the acoustic wave from the mouth or the nose. Because of this, the number of acoustic parameters and the characteristics of acoustic resonance are the important points of research. In the frequency response diagram of a language sound shown in figure 1, FP1, fp2, fp3... ... It is the base point of the frequency response. In this case, the sound transmission rate of the sound channel should be very large. In terms of habits, it is called a common vibration peak when the audio signal transmission rate is sounded, and the common vibration peak frequency (the frequency of the pole) of the language sound) the sub-uniqueness determines the sound color of the voice.

---- The phoneme with different sound colors has a different mode of common vibration peak. Because of this, each common vibration peak frequency and Its band width are used as parameters, it can be used to form a shared vibration peak filter. Then, if a combination of these filters is used to simulate the transmission characteristics of sound channels (frequency response ), tune the letter issued by the exciting source, and then use the over-spoke mode to generate a speech. This is the basis of the joint vibration peak technology. Based on the theory of common vibration peaks, there are three practical models.

---- Cascade mode in this mode, the sound channel is considered as a group of two-step series vibrators. This model is mainly used for the combination of most sub-audios.

---- The co-modal co-vibration mode is recognized by many researchers as a non-homophone sub-audio and major sub-audio, the cascade mode model above cannot be well described and modeled. As a result, the joint mode is constructed and created.

---- The hybrid mode is connected to the first and end of the synchronous peak wave filter in the cascade mode; in the combined mode, the input signal is divided by the frame adjustment section and then added to each common vibration peak filter, then, the routes are stacked and added. Compare the two operators, and combine them into the phoneme (a large number of audios) at the end of the acoustic channel ), cascade operators combine the acoustic theory of phoneme production, and do not need to set a frame adjustment section for each filter; however, for the speech that is combined into the sound source and located in the middle of the sound track (a large number of clear and sticky sounds), the combined form is more suitable, however, its amplitude adjustment is complex. Based on this kind of consideration, people combine the two into one, and propose a hybrid and co-vibration mode, as shown in figure 2.

---- In reality, the above three kinds of common vibration mode have been used in practice. For example, the ove System of Fant adopts the cascade mode of common vibration peak mode, and the Holmes is a parallel mode of common vibration peak mode; the most typical and most efficient klatt composite is built on the basis of the hybrid mode.

---- In the aspect of the integration of Chinese language and language, researchers developed a series of achievements based on the Common Vibration Mode. For example, the SiFs synthesizer of the Institute of Social Science and Technology, and the kx 1 System of the institute of acoustics of the Institute of Chinese Science and Technology, and the parallel joint-mode based on Holmes, the same sample is based on the klatt in the kx FSS, which was developed by the acoustic Science Institute of the Chinese Science and Technology Institute.

---- The mode of the common vibration peak is based on a more accurate model of the sound track. Because of this, it can be combined to produce a speech with a higher degree of self-naturalness, in addition, due to the fact that the number of common vibration parameters has a clear meaning of things, it is directly connected to the number of acoustic parameters. Because of this, you can use the common vibration peaks to describe the current images in the natural language stream and describe the overall acoustic rules, the final application is the combination of common vibration peaks into a system.

---- However, when people are at the same time, they also find that this technique is obviously weak. The first reason is that it is built on the model of the sound track. Because of this, the inaccuracy of the sound track model will affect the quality of the model. In addition, the actual work shows that although the Common Vibration Mode describes the most basic and most important part of the phoneme, however, it is not possible to express the sub-segmentation of the sub-voice. In addition, the co-vibration peak is combined into a Device Control System with 10 points of complex noise. For a good device, the number of control parameters reaches dozens, which is difficult to implement.

---- Based on these factors, the investigator continues to seek and discover his new integrated technologies. From the direct recording and broadcasting of the wave shapes to the startup, people developed the combination technique based on the wave shapes, LPC and psola are the representative tables. Unlike the common vibration peak combination technique, the wave-shaped combination is based on the recorded combination of the base element and the wave-shaped combination, it is not based on the model of the sound process.

Combine LPC Parameters

---- The development secrets of wave-shaped splicing techniques and the compilation and decoding techniques of phoneme cannot be divided, among them, the LPC Technique (linear pre-test code-encoding technique) has produced a huge impact on the wave-shaped splicing technique.

---- LPC combination technology is essentially an inter-temporal wave-shaped code-making technique. It aims to reduce the transmission rate of domain signals.

---- The research on the combination of Chinese and Chinese language and the conversion of Chinese language and text with LPC Technology, the acoustic Institute of the Chinese Science and Technology Institute has done a lot of work in this area. In 1987, they introduced multi-pulse exciting LPC technology, and in 1989 they introduced vector quantum technology. After that, in 1993, they introduced code exciting technology, their work paid a great tribute to the use of the LPC combination technique in the Chinese-language combination.

---- The advantage of LPC combination technology is simplicity and simplicity. The combination of process quality is only a simple solution code and fight over the process. In addition, the base element formed by the wave-shaped splicing technique is the wave-shaped data of the phoneme, saving the whole message of the phoneme, however, for a single dollar combination, it is enough to gain a high degree of self-sufficiency.

---- However, there is a huge distinction between the voice in the Self-ran language stream and the voice in the isolated form, for example, if only the standalone voices of the standalone scripts are attached together, the quality of the entire language stream must be unreasonable. In essence, LPC technology is only a recording + replaying method, the effect of integrating LPC into a whole continuous language stream into a technology is unreasonable. Therefore, the combination of LPC and other technologies must be combined to demonstrate the quality of LPC.

---- A typical text-to-speech conversion system principle diagram based on the single-audio festival and vqlpc (vector-based LPC) technique is shown in figure 3.

Psola Fixation

---- The psola combination technique (the same-step heap technique) proposed at the end of 1980s added a new active force to the wave-shaped combination technique. Psola technique is designed to control the frequency, duration, and intensity of voice signals. These parameters are essential to the rhyme control system and revision of the phoneme, compared with LPC, psola has better adaptability and adaptability, and can combine to produce a high degree of speech.

---- The main feature of psola technology is: Before splicing the phono-form piece, the first request is based on the next article, the psola algorithm is used to tune the rhythm feature of a single dollar, so that the waveform not only retains the main frequency feature of the original voice, in addition, it can combine the rhyme characteristics of a single dollar with the requirements of the next article, and get a high degree of clarity and self-satisfaction.

---- If he used the psola Technique in the conversion of Chinese characters to the system, in China, many schools and scientific research units have conducted extensive and in-depth research. On the basis of psola technical research, such as Tsinghua University, Beijing Jiao Tong University, and China Science and Technology Institute of Acoustics, first, I developed a Chinese language translation system based on a wave-shaped combination, and completed the skill well, for example, he has made some concrete measures to improve the self-naturalness of the speech.

---- Psola Technology retains the advantages of traditional wave-shaped splicing technology. It is simple and straightforward, and the amount of computing is small, however, the system can also easily control the number of phoneme parameters, which can be combined into a self-contained continuous language stream, it is widely used.

---- However, psola also lacks skills. First, psola is a phoneme analysis/synthesis technique based on the same step, first, you need to determine the baseline for the week and the start point of the week. The determination error of the baseline cycle or its start point will affect the effect of psola. Psola technology is a simple form of ing and splicing, is this competition sufficient to maintain a stable crossing and what effect does it have on the number of parameters in the frequency domain, during the combination, unexpected results will be produced.

LMA Mode

---- The higher the demand for the Self-naturalness and sound quality of the speech, the psola algorithm shows the weak tuning ability of the parameter Number of the rhyme Law and the difficulty in handling the defects of the same voice, people have come up with a method based on the LMA acoustic channel model. This method combines the parameter numbers with the parameter values that can be used to optimize the number of parameter values, at the same time, there is a better phoneme than the psola algorithm.

---- In front of the project, the main phoneme synthesis technique is the combination of the common vibration peak and the psola algorithm. These two techniques have a long history, and they are more familiar with the common vibration peak technique. A large amount of research results can be used for profit, the psola technique is a newer technique and has a good foreground.

---- These two technologies are independently developed based on each other. Now, many students are exploring the relationship between them, the test chart effectively combines the two into a more natural language stream. For example, the researcher of Tsinghua University conducted the research on applying the common vibration peak repair and modification techniques to the psola algorithm, in addition, it was used in the sonic system transformation, and developed a Chinese-language conversion system with a higher degree of self-sufficiency.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.