SummaryThis article introduces the principle and basic implementation process of VoIP, and analyzes the composition of audio latency in the Ethernet environment. The experimental results show that the audio latency in the Ethernet environment is mainly composed of the buffer latency and API call latency, and the most important part is the API call latency. This paper proposes a method to reduce API call latency using directsound interface functions, and discusses the policies to further reduce API latency.
With the rapid development of network technology, VoIP technology has been widely used. Especially in the LAN environment, VoIP has become one of the main methods for instant communication because of its convenient application and low cost. Latency is a key factor affecting the quality of VOIP voice. ITU-TG.114 specifies that the acceptable latency for high-quality speech is 300 ms. Generally, if the latency is between 300 and ~ 400 MS, the call interaction is poor, but still acceptable. When the latency is greater than 400 ms, interactive communication is very difficult, so how to ensure real-time audio transmission has become one of the top problems in VoIP technology.
This article first introduces the principle and basic implementation process of VoIP, and then conducts experimental research on real-time audio transmission in the Ethernet environment, and analyzes the impact of buffer settings and audio API calls on audio latency, based on the analysis results, the countermeasures for solving the Ethernet audio latency are proposed.
1. VoIP principles and implementation process based on the PC Platform
The basic principle of VoIP is: the sender compresses the collected raw voice data through the voice compression algorithm, and then packs the compressed voice data according to TCP/IP standards, packets are sent to the receiving end through an IP network. The receiving end reassembles the grouped voice into the original voice signal after decompression, so that the network can transmit the voice.
Figure 1 shows the VoIP implementation process based on the PC platform ., The basic implementation of PC-based VoIP applications consists of the receiving module, sending module, and network transmission. The sending module consists of audio collection, audio encoding, and grouped voice encapsulation. The implementation process of the receiving module is generally composed of the inverse process of the sending module, which mainly includes the receiving of grouped voice, audio decoding, and audio playback.
Figure 1 VoIP implementation process based on the PC Platform
The following describes the functions and general implementation methods of each part.
The Audio Acquisition and playback module collects and plays back audio signals to complete conversion between analog and digital speech. It mainly implements its functions through audio API functions. In Windows, common audio API functions include wavex, directsound, and ASIO.
The audio encoding and decoding module compresses and decompress audio data. Because the volume of raw speech data collected on the sending end is large, the original speech data must be compressed and encoded in a specific audio format. Similarly, the receiver needs to extract and restore the received voice data. In Windows, ACM (audio compression manager, audio compression manager) manages all audio codec in the system and compresses and decompress audio data. Codec is a short piece of code used to compress and decompress (decompress) data streams. Codec can be a codec attached to the operating system, or other codec can be installed by the applications installed in the system.
The grouping speech encapsulation and grouping speech receiving module adds the corresponding header to the compressed speech data to make it a voice packet and then sends it to the transmission module. The TCP/IP protocol system has two different transport layer protocols: connection-oriented transmission control protocol TCP and connectionless User Datagram Protocol UDP. The difference between the two Protocols is that UDP provides a connectionless service. You do not need to establish a connection before data transmission. After the remote host receives UDP data, no confirmation is required; TCP provides connection-oriented services. A connection must be established before data transmission, and the connection must be released after data transmission. For audio applications, UDP is generally used. This is because although UDP does not provide the error retransmission function, it can ensure real-time audio data.
The network transmission module sends encapsulated IP voice packets from the sender to the receiver. In Windows, the Winsock function is used.
2. Relationship between the buffer size and latency
The buffer size is closely related to the latency. Generally, when the buffer zone is large, the latency is large, but operations such as out-of-order reorganization can be effectively performed. The voice quality is good. The buffer zone is small and the latency is low, however, the buffer does not effectively eliminate latency jitter and other factors, resulting in poor voice quality. Therefore, we need to set the buffer size to a suitable size, so that the latency is small while maintaining a good speech quality.
The experimental program is a pctopc VoIP program written in the early stage. It is compiled by VC ++ and uses the low-level audio API-wavex function for Audio Acquisition and playback; use ACM for voice compression and decompression; Use Winsock for network communication. The experimental program implements the basic functions of network voice transmission. The collection set and playback buffer in the program are of the same size, and the number is both 2. The experiment uses the ping-pong system.
We have measured the relationship between the buffer size and the end-to-end latency in the Ethernet environment. The idea of end-to-end latency measurement is to run a program, input an excitation from the microphone, and obtain an output from the headset. If the difference between the two is the end-to-end latency. You can call the local machine to run the test. In this case, synchronization is not required. Because the test environment is based on the 100 Mbit/s Ethernet link, the link transmission latency is several seconds, which is negligible, therefore, the local loop test results can basically represent the end-to-end latency. The specific method of measurement is to generate an appropriate signal through the oscilloscope, simulate the voice input, and then observe the output to obtain the latency between the two. The coding and decoding algorithm used in the test program is gsm610, the parameter is the sampling frequency of 11.025khz, the 8-bit single-channel mode, and the audio API is wavex. The experimental results are shown in table 1.
Table 1 Relationship between buffer size and latency
Buffer size (byte) |
512 |
768 |
1024 |
1536 |
2048 |
4096 |
Voice duration (MS) |
46 |
70 |
93 |
140 |
196 |
392 |
End-to-End latency (MS) measured) |
About 350 |
About 400 |
About 500 |
About 600 |
About 700 |
About 800 |
In the above test environment, each sample point is quantified as one byte, the sampling frequency is 11.025 kHz, and the size of the raw speech data generated per second is 11025 bytes. The speech duration is divided by the buffer size by 11025. Therefore, the speech duration should also be a buffer delay.
In the experiment, we found that when the buffer zone is 512 bytes, although it can get a small buffer latency, but at this time the voice pause is very obvious, the sound quality is very poor. If the buffer is set to 768 bytes, the sound quality can be significantly improved, but not much packaging latency is increased. Therefore, we set the buffer to 768 bytes in later experiments.
As can be seen from table 1, when the buffer increases, the latency increases significantly. However, when the buffer zone is relatively small (512 bytes), the latency is not significantly reduced, stable at around ms, and the corresponding voice duration is only 53 Ms. Obviously, in addition to buffer packaging and transmission, other factors in the VoIP transmission path also introduce large latency. The third part of this article will analyze the specific structure of end-to-end latency in detail.
3. Composition of latency in an Ethernet environment
The latency in VoIP exists in all links of the IP Phone. Marked by 2, it can be roughly divided into four parts: (1) Audio collection and playback latency. Is caused by audio API. (2) buffer latency. The buffer latency is the latency introduced when the sending Side Buffer removes the wait time and the receiving end unpacks. As shown in the experiment in section 2nd of this article, the buffer latency is related to the buffer size. (3) Speech Coding/decoding latency. It is caused by the Speech Encoding Algorithm. The value varies according to different algorithms, but the difference is not big. The experience is 5 ~ 40 ms. (4) network transmission latency. Network transmission latency is the time required for data to arrive at the destination through network transmission.
Figure 2 VoIP latency Distribution
Due to the large distance of Ethernet bandwidth, network latency is generally less than 1 MS, can be ignored, therefore, the latency of VoIP in the LAN environment is mainly composed of the Speech Encoding/decoding latency, packaging/buffering latency, and audio collection and playback latency.
To further determine the latency distribution of each part of VOIP over Ethernet, we use the queryperformancecounter function to set the timestamp in the experiment program. The queryperformancecounter function allows accurate timing. We conduct a loop call test on the local machine. The codec mode is gsm610, the parameter is the sampling frequency of 11.025khz, the 8-bit single-channel mode, the buffer is 768 bytes, and the audio API is wavex, the latency of the Audio Acquisition, compression, decompression, and playback parts of the program is measured. The raw audio data in the experiment is a buffer size. Table 2 shows the experiment results:
Table 2 uses the wavex program to construct the latency of each part.
Audio collection latency |
Compression latency |
Decompression latency |
Audio Playback latency |
About 180 ms |
About 5 ms |
About 5 ms |
About 200 ms |
By adding the latencies of each part, we can obtain an end-to-end latency of about 390 Ms. This is basically the same as the experiment results in section 2nd, indicating that our experiment results are credible. Based on the experimental results, we can see that the latency mainly consists of the audio collection latency and the audio playback latency. After the buffer latency (voice duration) is 200 ms, the latency is about Ms, this part should be caused by the low-level audio API-wavex.
4. Countermeasures for solving Ethernet latency
Based on the experimental results in section 3rd, we must consider using a better performance audio API to reduce latency.
We modified the program and used directsound instead of wavex for audio collection and playback. Wavex has no hardware acceleration function, and features high CPU utilization and high latency. Directsound is one of the audio components of DirectX API. It provides fast sound mixing, hardware acceleration, and direct access to related devices. Directsound allows you to capture, replay, and obtain more services by controlling hardware and drivers. Compared with wavex, directsound has a new technology and powerful functions. It supports sound mixing, hardware acceleration, and low latency during collection and playback.
The following describes how to implement directsound: The sound collection process of directsound is shown in step 3. The directsoundcaptureenumerate function is used to enumerate all the recording devices in the system. The directsoundcapturecreat function creates the device object, then, use the creatcapturebuffer function to create a recording cache object. The seticationicationpositon function is used to set the notification bit so that data can be periodically copied from the recording cache. Directsound playback sound process 4 is shown. The directsoundcapture, directsoundcreat, and creatsoundbuffer functions also perform initialization. The lock function is used to lock the cache location. Then, the audio data is written to the buffer using the writebuffer function. After the audio data is written, it is unlocked using the unlock function.
Figure 3 sound collection process figure 4 playing sound process
In the same experimental environment as section 3rd, we have measured the latency of the directsound program. The end-to-end latency measured by the oscilloscope is about Ms, the time stamp measurement result is shown in table 3.
Table 3 uses the latency of each part of the directsound Program
Audio collection latency |
Compression latency |
Decompression latency |
Audio Playback latency |
About 120 ms |
About 5 ms |
About 5 ms |
About 130 ms |
Based on the experimental results, we can see that the latency of directsound programs is significantly less than that of wavex programs.
In addition, you can also use ASIO (Audio Stream Input Output, audio stream input and output interface. ASIO can enhance the processing capability of the sound card hardware, greatly reducing the latency of the system on the audio stream signal, Asio Audio Acquisition latency can be shortened to several milliseconds. However, it requires the support of professional sound cards and is difficult to implement due to complicated usage.
5. Conclusion
This paper analyzes the end-to-end latency of VoIP applications in the LAN environment, and verifies that the audio transmission latency in the Ethernet environment is mainly composed of the buffer latency and API call latency, the most important part is the API call latency. Therefore, when developing an Ethernet VoIP application system, we should focus on optimizing the implementation policies of the above two parts to improve the voice quality.