Version 1.0:
2008-11-11
1
VoIP Problems 1.1
Problems caused by terminals:
Background noise:
Echo:
Input/output signal level: I/O intensity mismatch;
Amplitude pruning: generated when the signal level is too large to exceed the data bit expression range;
Quantization distortion: generated during modulus conversion, which can be ignored;
Codec distortion: lossy compression technology;
Time cropping: triggered by the mute suppression technology;
Multiple speakers: it will affect the processing algorithm;
1.2
Network problems:
Circuit noise:
Frequency-related distortion: analog lines (such as modem) have different electrical transmission characteristics for different frequency signals, such as some fast transmission and some slow transmission;
Delay: the time when the speaker starts to speak to the listener;
Jitter: The group arrival interval is irregular;
Echo: a delay of more than 25 ms to hear your voice (that is, 25 ms later) may lead to poor calls;
Random bit deviation: if an error occurs during group transfer, the entire group is discarded and re-transmitted;
Burst Error: such as group loss;
Quantization and CODEC distortion:
2
Problem details 2.1
Latency:
It can be divided into transmission, processing, serialization (the time when data is transmitted to the interface, which can be ignored), transmission latency, one-way latency of signal around the Earth half circle about 70 ms, processing latency, for g.711 data with a packet size of 20 ms, there will be a fixed delay of 20 ms, while g.729 will have an initial delay of 5 ms. The queue delay and output blocking will cause data to be stuck in the output queue.
ITU-T
The latency of one-way end-to-end speech quality stipulated in g.114 proposal should not exceed 150 ms.
Latency will highlight echo issues;
2.2
Echo:
It is divided into circuit echo (typically produced by unmatched sound impedance of the 2-4 Line converter mixer) and acoustic echo (produced by the speaker's voice transmitted to the microphone, it also includes direct and indirect echo based on whether the Sound passes through environmental reflection. For VoIP, it mainly deals with acoustic echo;
Echo suppression: Compared the sound to be played and the sound extracted from the microphone, the echo suppression is ineffective and may cause intermittent sound;
Acoustic Echo eliminator AEC: Creates a voice model based on the correlation between the Speaker signal and the echo generated by the speaker to estimate the echo and obtain an echo that approaches the real echo, remove this part from the audio data to be encoded; one of its parameters is Echo tail, that is, waiting for the time to receive reverse speech; this parameter needs to be properly set;
According to dialogic data, acoustic echo is generally handled by terminals (including IP phones and gateways), and PSTN does not provide any processing for it; HMP echo cancellation only processes electric echo, use DNI to connect to PSTN (DNI Board provides board echo elimination) and DSI Board (no electroecho exists) IP address access does not require echo cancellation (echo cancellation should be performed when the IP terminal packs the voice into an IP packet). Echo Cancellation is not required in our pure IP network.
The echo elimination technology is used on the terminal to prevent the echo from being encoded into the voice stream, so as to improve the Speech Quality of the peer experience.
2.3
Jitter:
Jitter buffer is generally used to solve the jitter problem, but this solution will increase the system delay time. To achieve the best effect, the jitter buffer size should be dynamically adjusted according to the actual jitter;
The jitter level can be determined based on the RTP timestamp (Cisco
IOS );
2.4
Detection and supplement of group loss:
A simple processing policy: if a new voice packet is not received within the expected time, the packet received is replayed, g.729 can tolerate packet loss of 5% of the entire call (Cisco );
2.5
Noise: 2.6
Speech Activity Detection VAD-silence suppression ):
Determine whether there is a speech activity based on The Voice decibel change. Generally, a speech amplitude is detected waiting for 200 ms (hangover), and then new speech data packaging is stopped. The defect of this algorithm: it is difficult to distinguish noise. It takes some time for the system to switch from the speech suppression state to the speech transmission State, resulting in the loss of words starting with the speaker;
RTP description of mute suppression in rfc3389:
RTP allows
Discontinuous transmission (silence suppression) on any audio payload format.
The specified er can detect silence suppression on the first packet initialized ed after
The silence by observing that the RTP timestamp is not contiguous with the end
Of the interval covered by the previous packet even though the RTP Sequence
Number has incremented only by one. The RTP marker bit is also normally set on
Such a packet.
That is, you need to determine the timestamp to obtain the mute time, instead of the sequence number. The timestamp value is related to the sampling period, such as 8 K g.711 data, the number of samples covered by a 20 ms RTP package is 8000*20/1000 = 160, so that the timestamp difference between the two RTP packages is 160.
2.7
Sound mixing:
Secondary codec;
Decodes the input to the state before compression, linearly overlays the input, adjusts the adaptive gain, and then encodes the output;
Problems caused by sound mixing: increase one-way latency;
2.8
AGC solves the problem of inconsistent two-way speech stream strength
The process of adjusting the signal to the standard value;
3
Speech compression and encoding technology
The 8 K sampling rate of PCM is derived from the nycept theorem that if sampling is at a rate of 2 times the highest frequency, the signal can be completely restored to the analog form-most sounds are less than 4 kHz;
Both a-Law and U-law use the compression algorithm to obtain the PCM quality of 12-13 BITs in 8 bits. In some cases, U-Law's sound quality is slightly better than a-law; the U-law country is responsible for the conversion from U-law to a-Law;
Encoder classification: waveform encoder-PCM and ADPCM, which adopts waveform redundancy feature encoding compression technology; source encoder-uses the source characteristics produced by sound to send only simplified feature information of the original voice, including LPC (linear
Predictive coding), CELP (Code Excited Linear Prediction
Compression), MP-MLQ (multipulse, multilevel Quantization), etc.
Relationship between g.7xx series and PCM. The former is a standard and the latter is an algorithm;
Standard for measuring the advantages and disadvantages of coding: Bit Rate, latency, complexity, and quality;
Evaluation of Speech Quality: subjective evaluation of MOS (mean opinion
Score, average opinion score), presented by people based on subjective feelings; objective evaluation of psqm (Perceptual
Speech Quality Measurement, perceptual speech quality measurement), which is provided by a computer based on certain objective data.
4
Related Protocols
RTP, serial number: determines whether packets arrive in sequence; timestamp: determines jitter;
RTCP, which can be used to support conference applications, synchronize different media streams, and provide QoS feedback;
Crtp, RTP Header Compression;
Rudp: reliable UDP. The basic mode is to transmit multiple identical packets, and the receiving end discards duplicate data;
5
References
Group speech technology and network implementation solution
Speech and data integration network
VoIP technical architecture (version 2)