Network speech technology

Source: Internet
Author: User
Document directory
  • 1. Voice collection
  • 2. Encoding
  • 3. Network Transmission
  • 4. Decoding
  • 5. Audio Playback
  • 2. DENOISE
  • 3. jitter buffer JitterBuffer
  • 4. Voice detection VAD
  • 5. audio mixing Algorithm

When we use tools like Skype and QQ to smoothly chat with our friends over voice and video, have we ever wondered what powerful technologies are behind it? This article will give a brief introduction to the technologies used in network voice calls.

1. Conceptual Model

Network voice calls are usually bidirectional. At the model level, this bidirectional call is symmetric. For the sake of simplicity, we can discuss a channel in the direction. When one party speaks, the other Party hears the voice. It seems simple and fast, but the process behind it is quite complicated. We have simplified all the major steps it has taken into the conceptual model shown below:

This is the most basic model, which consists of five important links: Collection, encoding, transmission, decoding, and playback.

1. Voice collection

Voice acquisition refers to the acquisition of audio data from the microphone, that is, the conversion of audio samples into digital signals. It involves several important parameters: sampling frequency, number of sampling digits, and number of channels.

To put it simply, the sampling frequency is the number of acquisition operations in one second. The number of sampling digits is the Data Length obtained by each collection operation.

The size of an audio frame is equal to: (sampling frequency × number of sampling digits × number of channels × time)/8.

Generally, the duration of a sample frame is 10 ms, that is, every 10 ms of data forms an audio frame. Assume that the sampling rate is 16 k, the number of sampling digits is 16 bits, and the number of audio channels is 1. The size of a 10 ms audio frame is: (16000*16*1*0.01)/8 = 320 bytes. In the formula, 0.01 is the second, that is, 10 ms.

2. Encoding

Assume that the collected audio frames are directly sent without encoding, then we can calculate the required bandwidth. For example, 320*100 = 32 KBytes/s, if it is converted to bits/s, it is 256kb/s. This is a huge bandwidth usage. Through network traffic monitoring tools, we can find that the traffic for voice calls using IM software such as QQ is 3-5 kb/s, which is an order of magnitude less than the original traffic. This is mainly due to audio encoding technology.

Therefore, encoding is indispensable in actual voice calls. Currently, many common speech coding technologies are available, such as G.729, iLBC, AAC, and SPEEX.

3. Network Transmission

After an audio frame is encoded, it can be sent to the caller over the network. For Realtime Applications such as voice dialogs, low latency and stability are very important, which requires smooth network transmission.

4. Decoding

After the recipient receives the Encoding Frame, it will be decoded to restore the data that can be directly played by the sound card.

5. Audio Playback

After decoding, you can submit the audio frame to the sound card for playing.

Ii. difficulties and solutions in practical application

If the above technology alone can achieve a sound application in the Wide Area Network voice conversation system, then there is no need to write this article. There are many practical factors that have introduced many challenges to the above conceptual model. This makes the implementation of the network speech system not that simple, and involves many professional technologies. Of course, most of these challenges have mature solutions. First, we need to define a "sound" speech conversation system. I think the following measures should be achieved:

(1) low latency. Only with low latency can both parties have a strong sense of Realtime. Of course, this mainly depends on the speed of the network and the distance between the physical locations of both parties. From the perspective of software alone, the possibility of optimization is very small.

(2) low background noise.

(3) The sound is smooth, there is no card or pause.

(4) No response.

Next we will talk about the additional technologies used in the actual network voice conversation system one by one.

1. Echo elimination AEC

Now, almost all of you are used to the sound distribution function of a PC or notebook during voice chat. I don't know how many challenges this little habit has posed for speech technology. When the external play function is used, the sound played by the speaker is collected by the microphone and sent back to the other party, so that the other party can hear its own response. Therefore, the echo elimination function is required in practical applications.

After obtaining the collected audio frame, the gap before encoding is the time for the echo elimination module to work.

In simple terms, the echo cancellation module performs some similar offset operations in the collected audio Frames Based on the newly played audio frames to remove the echo from the captured frames. This process is quite complex, and it is also related to the size of the room where you chat, and the location of your room, because the information determines the time of acoustic reflection. The Intelligent Echo elimination module can dynamically adjust internal parameters to better adapt to the current environment.

2. DENOISE

Noise suppression is also called noise reduction processing. It identifies the background noise and filters out audio Frames Based on the characteristics of Speech data. Many encoders have built-in functions.

3. jitter buffer JitterBuffer

The jitter buffer is used to solve the network jitter problem. The so-called network jitter means that the network latency will be a little longer. In this case, even if the sender regularly sends data packets (for example, one packet is sent every ms ), the recipient cannot receive the same time. Sometimes a packet in a cycle cannot receive it, and sometimes several packets are received in a cycle. In this way, the voice that the receiver hears is stuck in one card.

JitterBuffer is used after decoder and before audio playback. After the speech decoding is complete, the decoded frame is placed into the JitterBuffer. When the sound card calls back the video, the oldest frame is taken from the JitterBuffer for playing.

The buffer depth of JitterBuffer depends on the degree of network jitter. The larger the network jitter, the larger the buffer depth, the larger the playback latency. Therefore, JitterBuffer uses a high latency in exchange for smooth playing of the sound, because compared to the sound card, the latency is slightly larger but smoother, and the subjective experience is better.

Of course, the buffer depth of JitterBuffer does not remain unchanged, but is dynamically adjusted based on changes in the degree of network jitter. When the network is restored to a very stable and smooth, the buffer depth will be very small, so that the increase in playback latency due to JitterBuffer can be ignored.

4. Voice detection VAD

In a speech conversation, if one party does not speak, it will not generate traffic. The mute detection is used for this purpose. Mute detection is usually integrated into the encoding module. The mute detection algorithm, combined with the preceding noise suppression algorithm, can identify whether there is a voice input. If there is no voice input, it can encode and output a special Encoding Frame (for example, the length is 0 ).

Especially in multi-person video conferences, there is usually only one speaker. In this case, the use of the mute detection technology can save a considerable amount of bandwidth.

5. audio mixing Algorithm

When talking to multiple people, we need to play the voice data from multiple people at the same time, while the sound card only has one buffer. Therefore, we need to mix multiple voices into one channel, this is what the sound mixing algorithm needs to do. Even if you can try to avoid mixing and play multiple audio streams at the same time, you must mix the audio into one channel for ECHO elimination. Otherwise, only one channel in multiple voices can be eliminated if ECHO is eliminated at most.

Sound mixing can be performed on the client or on the server (which can save downstream bandwidth ). If a P2P channel is used, the audio mixing can only be performed on the client. In client mixing, sound mixing is usually the last step before playback.

 

Based on the above conceptual model and network speech technology used in reality, a complete model diagram is provided below:

This article is a rough summary of our experience in implementing some OMCS speech functions. Here, we just make a simple description of each link in the figure, and any part can be written into a long paper or even a book. Therefore, this article provides an entry-level map for those who are new to network voice system development and provides some clues.

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.