The previous article (latency) and the reduction method at the end of the speech communication, said that from the beginning of this article will cut into the WebRTC Neteq theme, Neteq is one of the two core technologies of audio technology in WEBRTC (another core technology is the front and back processing of audio, including AEC, ANS, AGC etc, commonly known as 3 a algorithm). WEBRTC is an open source of Google's gips re-packaging, and is now a powerful real-time audio and video communication solution. Domestic internet companies, to do real-time audio and video communications products, the vast majority are based on WEBRTC to do, some directly with the WEBRTC solution, there are some WEBRTC in the core technology, such as 3 a algorithm. Not only the Internet companies, other types of companies (such as communications companies), but also the essence of WEBRTC in their own products. When I first started to do voice engine, WEBRTC was not open source, but then I knew that Gips was a top-notch company for real-time voice communications. WEBRTC Open Source, at first did not have the opportunity to use, and later do OTT voice (app voice) used in the WEBRTC 3A algorithm. After doing the audio development on the Android mobile platform, I used the Neteq on the WEBRTC, but used the earlier C language version, not the C + + version, and only involved the DSP module in Neteq (Neteq has two modules, MCU (Micro control unit , Micro Control Unit) and DSP (digital signal processing, signal Processing unit), the MCU is responsible for controlling the insertion and extraction of voice packets received from the network in jitter buffer, while controlling which algorithm the DSP module uses to process the decoded PCM data, DSP is responsible for decoding and decoding the PCM signal processing, the main PCM signal processing algorithm has acceleration, deceleration, packet loss compensation, fusion, etc., MCU module in the CP (communication processor, communication processor) do, two modules through the message interaction. After debugging the DSP module, basically mastered the mechanism. MCU module due to do on the CP, no source code, I came from the Internet to find the version of the corresponding WEBRTC of the open source version, after a period of understanding, but also basically made clear the mechanism. Starting with this article, I'll spend a few words on Neteq (based on my early C language version). Here is to explain that each product in the use of the code on the WEBRTC will be based on the characteristics of their products to make certain changes, I do the product is no exception. I am talking about some of the details will not be involved, the main mechanism. This article first gives an overview of Neteq.
The software architecture diagram for real-time IP voice communication is usually as follows:
In the sender (or uplink, TX) will be collected from the mic to the voice data before processing, and then encoded to get the code stream, and then use RTP packaging through the UDP socket sent to the network to each other. Receiver (or downlink, RX) through the UDP socket voice packet, parsing RTP packet into jitter buffer, to play every time from jitter buffer to remove the packet and decode the PCM data, after processing sent to the player.
Neteq module on the receiving side, it is to combine jitter buffer and decoder and add the decoded PCM signal processing form, namely Neteq = jitter Buffer + decoder + PCM signal processing. In this way, the software architecture block diagram becomes:
As mentioned above Neteq module mainly includes MCU and DSP two large units. Its software block diagram is as follows:
From the above two figure, jitter buffer (that is, packet buffer, followed by Neteq consistent, expressed as packet buffer, used to remove the network jitter) module in the MCU unit, decoder and PCM signal processing module in the DSP unit. MCU unit is mainly responsible for the voice packets received from the network side after RTP parsing into packet buffer insert (insert), and from packet buffer extract (extract) voice packet to the DSP unit to do decoding, signal processing, etc. It also calculates the network delay (optbuflevel) and jitter buffer delay (bufflevelfilt), according to the network delay and jitter buffer delay and other factors (the process of the previous frame, etc.) decide what signal processing command to the DSP unit. The main signal processing commands have 5 kinds, one is normal playback, that is, do not need to do signal processing. The second is to speed up the playback, for a large delay in the call, by accelerating the algorithm so that voice information is not lost and reduce the length of speech, thereby reducing delay. Third, the slow play, used for voice intermittent situation, through the deceleration algorithm so that voice information is not lost and increase the length of the speech, thereby reducing the voice discontinuity. The packet loss compensation is used for packet loss, and the lost voice compensation is returned by the packet loss compensation algorithm. Five is the fusion (merge), used for the previous frame drops and the current packet is normally received, because the previous packet lost with the loss compensation algorithm to back up the voice, and the current package needs to do a fusion process to smooth the previous compensation package and the current normal received voice packet. The above signal processing improves the voice quality in the harsh network environment and enhances the user experience. It can be said that in the current open speech processing network packet loss, delay and jitter is the best solution.
DSP unit is mainly responsible for decoding and PCM signal processing. The code stream extracted from the packet buffer is decoded into the PCM data into the Decoded_buffer, and then the signal is processed according to the command given by the MCU, the result of which is placed in Algorithm_buffer, and finally algorithm_ The data in buffer is placed in the Speech_buffer to be taken away from play. The data in the Speech_buffer is divided into two pieces, one is the data that has been played (Playedout), the other is the data that is not played (Sampleleft), and Curposition is the dividing point of the two data. There is also a variable endtimestamps that records the timestamp of the last sample and reports it to the MCU, allowing the MCU to decide whether to remove the packet and whether it should be removed, based on the timestamp of the Endtimestamps and packet buffer packets.
Here is a brief introduction to the process of Neteq, later in the article will be detailed. The processing process is divided into two parts, one is the process of inserting RTP voice packet into packet packet, and the other is extracting speech packet decoding and PCM signal processing from packet buffer. First look at the process of inserting RTP voice packet into packet packet, there are three main steps:
1, initialize the NETEQ after the first RTP voice packet is received.
2, parse the RTP voice packet and insert it into packet buffer. Inserts are inserted according to the order in which they are received, then to the end and then from scratch. This is a simple way to insert.
3, calculate the network delay optbuflevel.
Look at how to extract the voice packet and decode and PCM signal processing, there are six main steps:
1, the Endtimestamp of the DSP module is assigned to the Playedoutts, and Sampleleft (the number of samples not played in the voice buffer) is passed to the MCU to tell the MCU the current status of the DSP module.
2. See if you want to remove the voice packet from packet buffer and whether you can remove the voice packet. The package is removed by traversing the entire packet buffer method, according to Playedoutts find the minimum time stamp greater than or equal to Playedoutts, recorded as Availablets, take it out. If the package is lost, the package is not taken.
3, calculate jitter buffer delay bufflevelfilt.
4, according to the network delay jitter buffer delay and the previous frame processing mode, such as the decision of the MCU control command.
5, if there is extract from packet buffer to the packet decoding, otherwise do not decode.
6, according to the control command given by the MCU to signal processing of the decoded and the data in the voice buffer.
In my personal opinion, Neteq has two core technical points. One is to calculate the current network delay and jitter buffer delay algorithm. To determine signal processing commands based on network delay, jitter buffering delay, and other factors, signal processing commands can improve sound quality and conversely degrade sound quality, so the decision to signal processing commands is critical. The second is a variety of signal processing algorithms, mainly acceleration (accelerate), deceleration (preemptive expand), packet loss compensation (PLC), fusion (merge) and background noise generation (BNG), these are very professional algorithms.
WEBRTC Audio-related neteq (i)