Mobile communications state-of-the-art audio codec EVs and work to do

Source: Internet
Author: User
Tags unpack

Voice communication from the initial only wired communication to later wired communication and wireless communication (mobile communication) of the competition, when the mobile voice communication price down after the wired voice communication obviously in the contrarian. Now the competitor of Mobile voice communication is OTT (on the Top) voice, OTT Voice is the service provided by Internet vendors, generally free, such as voice. At present, the voice communication technology is divided into two camps: traditional communication camp and the Internet camp, competing with each other, driving the development of voice communication technology. Specific to the codec on the Internet camp presented an audio codec covering both voice and music opus (Opus is co-led by the nonprofit Xiph.org Foundation, Skype and Mozilla etc, the full band (8kHZ to 48kHZ), which supports voice and music (silk , music with Celt), has been accepted by the IETF as the Voice codec standard (RFC6716) on the network, the vast majority of Ott voice app support, there is unified Internet camp trend. Mobile communication Standards Organization 3GPP in response to competition from the Internet camp, an audio codec that covers both voice and music (enhanced voice Service) is also presented. I used to make a mobile platform for me to successfully add EVs, and through the Chinese mobile real-world environment test. Let's talk about the codec and the work you need to do.

The 3GPP standardized the EVS codec in September 2014, defined by the 3GPP R12 version, primarily for VoLTE, but also for VoWiFi and fixed VoIP. The EVS codec is developed jointly by operators, end devices, infrastructure and chip providers, and experts in voice and audio coding, including Ericsson, Fraunhofer IC Research Institute, Huawei Technology Co., Ltd., Nokia Corporation, Japan Telecom Telephone Company (NTT), NTT Japan DoCoMo Company, France Telecom (ORANGE), Japan Panasonic, Qualcomm, Samsung Electronics, Voiceage and ZTE Corporation. It is the best performance and quality voice frequency encoder in the 3GPP so far, it is the full band (8kHZ to 48kHZ), can work in the rate range of 5.9kbps to 128kbps, not only for voice and music signal can provide very high audio quality, It also has a strong anti-drop frame and anti-delay jitter ability, can bring a new experience for users.

is a 3GPP EVS-related spec, from TS26.441 to TS26.451.

I have the key of a few with the red box, which TS26.441 is the overview, TS26.442 is written in C language fixed-point implementation (reference code), which is behind the use of good EVs in the work of the heavy weight. TS26.444 is a test sequence, optimization reference code process almost every day to save an optimized version, every day to run a run with the test sequence optimized version, such as the discovery is not the same, the problem of optimization, to go back to the previous version, and check out which step to optimize the problem. TS26.445 is the specific description of the EVS algorithm, nearly 700 pages, to tell the truth to see the headache, if not to do algorithm, the algorithm part to see a probably can, but the characteristics of the relevant description must be looked at.

EVS uses different encoders for voice signals and music signals. The speech encoder is an improved generation of digital excitation linear prediction (ACELP), and a linear prediction model suitable for different speech classes is also adopted. For music signal encoding, the frequency domain (MDCT) encoding is used, with special attention to the frequency domain coding efficiency in low-latency/low-bit-rate situations, thus enabling seamless and reliable switching between voice processors and audio processors. is a block diagram of the EVS codec:

The input PCM signal is preprocessed at the time of encoding, and the voice signal or audio signal is determined. If the voice signal is encoded with a voice encoder to get the bitstream, the audio signal is encoded with the perceptual encoder to get the bitstream. Decoding is based on the information in the bit stream to determine whether it is a voice signal or audio signal, if the voice signal is decoded with a voice decoder to obtain PCM data, and then do voice bandwidth expansion. If the audio signal is decoded with the perceptual decoder to obtain the PCM data, then do the frequency bandwidth expansion. Finally, it is processed as the output of the EVS decoder.

The key features of EVs are explained below.

The 1,evs supports full band (8khz--48khz) with a bitrate range of 5.9kbps to 128kbps. Each frame is 20Ms long. Is the distribution of audio bandwidth:

Narrow Band (Narrow Band, NB) range is 300hz-3400hz, the corresponding sampling rate is 8KHZ,AMR-NB with this sampling rate. Broadband (Wide Band, WB) range is 50hz-7000hz, the corresponding sampling rate is 16KHZ,AMR-WB with this sampling rate. The ultra-wideband (Super Wide Band, SWB) range is 20hz-14000hz, and the corresponding sample rate is 32kHZ. Full Band, FB) range is 20hz-2000hz, the corresponding sampling rate is 48kHZ. The EVS supports a full band, so it supports four sample rates: 8kHZ, 16kHZ, 32kHZ, and 48kHZ.

is the bitrate supported at various sample rates:

From the see only under WB support full code rate, the other sample rate only support partial bitrate. It is important to note that the EVS is forward compatible with AMR-WB, so it also supports all bitrate of the AMR-WB.

2,evs supports DTX/VAD/CNG/SID, which is the same as AMR-WB. During a call, there is usually about half the time spoken and the rest of the time listening. There is no need to send a voice packet to the other when listening, so there is DTX (non-continuous transmission). To use the VAD (mute detection) algorithm to determine whether it is voice or mute, is the voice packet when the voice packet is muted when the mute packet (SID packet). When the other party receives the SID packet, it uses the CNG (Comfort Noise generation) algorithm to generate comfortable noise. There are two CNG algorithms in EVS: Linear prediction-based CNG (linear Prediction-domain based CNG) and frequency-domain-based CNG (Frequency-domain based CNG). In the SID Packet sending mechanism, the EVS is different from the AMR-WB, in the AMR-WB, the VAD detects that it is silent, sends a SID packet, then sends a second SID packet after 40Ms, and then sends a SID packet every 160Ms, but the VAD detects that the voice packet is sent immediately. The sending mechanism of SID packet in EVs can be fixed every time (several frames, the range is 3--100) send a SID packet, can also be based on Snr Adaptive send SID packet, send cycle range is 8-50 frames. The payload size of the EVS SID packet is also different from AMR-WB, AMR-WB is 40 bytes (50*40=2000bps), and the EVS is 48 bytes (50*48=2400bps). From the above can be seen DTX has two advantages, one is to save bandwidth, increase capacity, second, because the non-codec reduces the amount of computing, thereby reducing power consumption increases the duration.

3,evs also supports PLC (packet loss compensation), which is the same as AMR-WB. However, EVs also included the Jitter Buffer Module (JBM), which was never seen in the previous codec. I do not use the JBM, because time is tight, there is no time to study. There's time in the back. We must study, JB but one of the difficulties of voice communication is also one of the bottleneck of voice quality.

The algorithm delay of EVs is different depending on the sampling rate. When the sampling rate is WB/SWB/FB, the total time delay is 32ms, including a frame 20ms delay, the encoding side input resampling 0.9375ms delay and 8.75ms forward delay, decoding side time domain bandwidth expansion 2.3125ms delay. When the sampling rate is NB, the total delay is reduced to 30.9375ms, the relative wb/swb/fb is reduced by 1.0625ms, the 1.0625ms is mainly reduced on the decoding side.

The Voice quality (MOS value) of the EVS has a significant improvement over the AMR-NB/AMR-WB. is a comparison of the MoS values of these several codec:

It is shown that the MoS value of EVS-NB is significantly higher than the MOS value of AMR-NB when the sampling rate is NB, and the MoS value of EVS-WB at various bitrate is also significantly higher than that of the AMR-WB MOS value when the sampling rate is WB. When the sampling rate is SWB and the bitrate is greater than 15kbps, the MoS value of EVS-SWB is close to the MOS value of the non-encoded PCM. The voice quality of the EVS can be seen to be quite good.

Using the good EVS to do the work on different platforms will vary, I was used on the mobile phone platform audio DSP, for voice communication. Let's talk about what I've done to support EVS for mobile phones.

1, learn about the spec of EVs. To the front of my list of the spec are read, because it is not to do algorithm, algorithm-related can be seen in the coarse, but the characteristics of the description of the relevant must look at the fine, this relates to the use of the latter.

2. Generate the Encoder/decoder application on the PC. I do on Ubuntu, the PCM file as encoder input, according to different configurations to generate the corresponding code stream files, and then the code stream file as the input of decoder, decoding and restore to PCM files. If the decoded PCM file sounds like the original PCM file, the algorithm implementation is credible (the authoritative organization comes out of the algorithm is credible, if there is a strange description of the application is not done). The application is made for later optimizations, and also facilitates understanding of peripheral implementations, such as how to turn the encoded value into a bitstream. The encoded value is placed in indices (up to 1953 indices), each indices has two member variables, one is nb_bits, which indicates how many bits the indices has, and the other is value, which represents the indices. There are two ways to store indices: G192 (ITU-T g.192) and MIME (multipurpose Internet Mail Extensions). First look at the G192, each frame of the G192 storage format see:

The first word is a sync value, divided good frame (value 0x6b21) and bad frame (value 0x6b20), the second word is length, followed by each value (1 is represented by 0x0081, 0 is represented by 0x007f). The value in the indices is expressed in binary notation, and the values on the bit are stored as 0x0081, which is 1, and is saved as 0x007f for 0. Is an example, the sample rate is 16000HZ, the bitrate is 8000bps, so a frame has 160 bits (160 = 8000/50), saved in G192 format is 160 word. The middle header is 0x6b21, which indicates that good frame,length is 0x00a0, and the back 160 word is content.

Then look at the MIME format. To pack the value of indices into a serial value, specifically how to pack, see Pack_bit () function. The MIME format is the first word is the header (low 4-bit notation rate index, which is not required when using Wb_io when the 5th and 6 bits are to be placed 1,evs), followed by a bit stream. Or the above sample rate is 16000HZ bit rate of 8000bps example, but in MIME format, a frame has 160 bits, need 20byte (20 = 160/8), such as:

The first 16 bytes are reference code, and the 17th byte is header,0x2, which represents the EVs encoding, the bitrate is 8kbps (8kbps index is 2), and the next 20 bytes represent the payload after the pack.

In voice communication, the value of indices is pack into serial value and sent to the other as payload. The other party receives the PCM value by unpack and then decoding it.

3, the original reference code is usually not directly used and needs to be optimized. As for how to optimize, please look at an article I wrote earlier (audio codec and its optimization methods and experience), the article is written in a more general way. I am now using DSP, DSP frequency is low, only more than 300 MHz, do not need to compile the optimization of the uncertain. I didn't use the DSP assembly before, to optimize in a short period of time is very good has a lot of difficulty. The boss weighed the decision to use the DSP IP vendors to provide an optimized library, the compilation of which they are more professional.

4, the application of the reference code is modified to facilitate later debugging when the tool is used. The original reference code is saved as a file in bytes, and the DSP is in Word (two bytes), so the Pack/unpack function in reference code is modified to fit the DSP.

5, want to call when codec is Evs,audio DSP and CP should add corresponding code, first write their own code self-tuning, and then the United tune. I use the AMR-WB shell (because the EVS and AMR-WB frames are 20ms long), which is AMR-WB on the process, but codec from AMR-WB to EVs. The main verification is encoder, pack, unpack, decoder is OK, where encoder and pack are upstream, unpack and decoder are downstream. Their relations are as follows:

First tune the upstream, the encode after the preservation of the G192 format, using decoder tools to decode the PCM data with Cooledit listen, with their words are the same, the encoder is OK. Re-adjust the pack, the code stream after the pack to save the MIME format, the same with the decoder tool to decode the PCM data with Cooledit listen, with their words are the same, stating that pack is OK. Then adjust the downward. Since the CP is not the correct EVS stream sent to audio DSP, the loopback way to debug, specifically the pack after the code stream as a unpack input, unpack after the saved into G192 format, Use the decoder tool to decode the PCM data with Cooledit Listen, the words are the same as themselves, indicating that UNAPCK is OK. Finally tune decoder, the decoder after the PCM data with Cooledit listen, with their words is the same, the decoder is OK. So the self-tuning is over.

6, with the CP Joint tune. Because of the front self-tuning of the key modules are adjusted, the combination of relatively smooth, not a few days on the tune. This way you can enjoy the high sound quality of EVs when you call.

Mobile communications state-of-the-art audio codec EVs and work to do

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.