1. Background
With the popularization of broadband Internet and great technological advances, VoIP has become a technology that can rival traditional telephone services and PBX products. The VoIP Terminal samples and codes audio and video analog signals into Compressed Frames, and then converts them into IP packets for transmission, thus achieving multimedia communication on the IP network. This article analyzes several factors that may affect the collection/playback performance based on the requirements of VoIP for media collection/playback, and proposes solutions.
2 VoIP media collection/playback requirements
Generally, VoIP uses G.723.1 or G.728 audio encoding methods. The sampling frequency is 8000Hz and each sample point must be 13 bits or more. Therefore, the specified time interval (20 ~ 200 ms, default value is 20 ms), sampling frequency is 8000Hz, each sample point is 16 bits of audio data. Correspondingly, the audio data is obtained at a certain interval, the sampling frequency is 8000Hz, and each sample point is 16 bits.
Video Encoding usually uses the H261 and H263 modes. A complete CIFCommon Intermediate Format is required at an interval of 1/30 seconds, and the image size is 352*288), QCIFQuarter-CIF, 176*144) or SQCIFsub-QCIF, 128*96) original video data. Correspondingly, a video is played at a fixed interval, and a complete frame of video data conforms to the CIF, QCIF, or SQCIF format.
3 VoIP media encapsulation requirements
After the original media stream is encoded, It is segmented into segments according to certain rules, and then RTP is encapsulated. The RTP packet has a fixed header:
The collection module only cares about the timestamp field. Timestamp is a timestamp, indicating the sampling time of audio/video data. Its initial values are random. For audio data, the resolution is 1/8000 seconds. For video data, the resolution is 1/90000 seconds.
4. Several factors that may affect system performance and Solutions
To evaluate the system performance, you must consider two aspects: processing time and occupied memory space. Below we will discuss several factors that may affect the system performance and propose solutions.
4.1 Data Interaction between acquisition/playback and other modules
The collected media data will be sent to the audio/video encoding part. To avoid repeated application/release of memory, the encoding module should try its best to encode the data block sent by the collection module, instead of applying for another piece of data to store the encoded data in the video encoding, if the difference between a frame and the previous frame is too large, the data size after the frame encoding may not be loaded into an RTP package, it must be divided into multiple RTP packets. In this case, another application space cannot be avoided ). At the same time, due to the RTP encapsulation of the encoded data, in order to avoid repeated copies, the data block sent from the collection module to the encoding module should leave room for the RTP Header.
Considering that encoding algorithms such as H.263, G.728, G.723.1) need to obtain information such as the time stamp, the length of the original data, and the video frame number from the collection module, a custom structure is required to store the data, it is called the encoding information structure. Therefore, the collection module sends the data block to the encoding module and stores a Union first. The Union includes the RTP Header structure and the encoding information structure. The next step is to store raw media data.
According to the encoding method, the RTP Header length may also be different. For example, G.728 only has one RTP Fixed Header, while H.263 and H.261 need to add a RTP load header before adding a fixed header. For the sake of scalability, there must be a field in the data block structure to indicate the offset between the "useful" data and the data block.
4.2 internal structure of the Audio Acquisition/playback module
The collection module provides data sources for audio and video conferences. Therefore, real-time collection of audio and video data is very important. In terms of implementation methods, the system overhead is small. Taking Windows as an example, you can use the Waveform Audio SDK to complete Audio collection, and use the vfw sdk to complete video collection without using DirectShow. Similarly, you can use the above two sdks to play audio and video.
Because humans are sensitive to sound, in order to ensure the continuity and uniformity of audio collection/playback, both audio collection and playback should apply for a ring buffer to store the address of the audio data block to be encoded/played. The reason why the data block address is stored instead of entering the audio data directly is that if the data is directly filled in, each time the data is sent to the encoding module on the collection side, the data needs to be copied out of the buffer zone; at the same time, when the playback side receives audio data from the synchronization module, it also needs to merge the data from the message into the buffer zone. This has a certain impact on performance. To reduce copying, you can store only the addresses of data blocks in the buffer zone. When the Audio Acquisition Module sends data to the Audio Encoding module or the synchronization module sends data to the audio playback module, you only need to specify the address of the data block and the offset of the audio data, and do not need to copy the data block repeatedly.
In a real-time operating system, the audio acquisition/playback module can be divided into two independent tasks or threads, namely, Audio Acquisition and audio playback. The audio collection thread also has a subthread, namely, the data sending thread. The task of this thread is to send the collected audio data to the Audio Encoding module in real time; the audio playing thread also has a subthread, namely, the data playing thread. The task of this thread is to extract the audio data from the playing buffer in real time and send it to the audio driver for playback.
After receiving the collection start message, the Audio Acquisition Module informs the driver of the address of the data block to be collected to ensure the collection continuity, and notifies the driver to start collection; after a data block is collected, the data sending thread sends the collected data to the Audio Encoding entity and sends the next data block to the driver. After receiving the Playback Start message, the audio playback module does not play the video immediately. Instead, it first enters the data sent by the synchronization module into the playing loop. After a threshold value is filled in, the address of multiple data blocks is sent to the driver starting from the first data block in the buffer zone, and the driver is notified to start playing, the data playing thread extracts the next piece of data and sends it to the driver.
Synchronization and mutex of 4.3 threads
Because the two threads collaborate to collect and play audio, synchronization and mutex between threads are the focus. Between the audio collection line and the data sending thread, the audio playing thread and the data playing thread need to be synchronized, that is, when a data block is collected/played, you need to inform the data sending/playing thread in some way to fetch the next data block.
For VoIP applications, real-time notification is the most important. First, consider the callback function method, which has the highest real-time performance, but in some systems such as Windows). calling some audio-driven functions in the callback function may cause deadlocks. Second, the event notification method, that is, in the data sending thread, wait for the occurrence of the "Collection Completed" event. Once the collection is complete, the event is set to "Event". This method is more real-time, it's just a bit difficult to get the data block index because more than one piece of data is sent to the driver for collection/playback). After the collection is complete, the driver can also send messages to the specified thread, however, the message mechanism is not highly real-time, so it is not considered. Therefore, events are usually notified to complete synchronization and mutual exclusion.
5. timestamp generation
According to the Protocol, the resolution of the audio timestamp is 1/8000 seconds, and the resolution of the Video timestamp is 1/90000 seconds. Therefore, a high-precision counter must be used to ensure the resolution is no less than 1/90000 seconds ). Because the audio and video timestamps require different resolutions, you can use one counter to obtain two timestamps without two counters.
First, obtain the counter frequency Freq. When the collection is started, the initial counter value is queried and saved in a static variable. Next, use the MD5 Algorithm to generate a 32bit random number and assign timeStamp as the timeStamp of the first RTP packet. Each time a data block is collected, the current counter value is queried, and the difference between the value and the counter initial value is calculated. CntGap is used to multiply the difference with the counter period to obtain the actual time difference timeGap, that is:
TimeGap = (double) CntGap/Freq), divide the time difference by the period 1/80001/90000 of the RTP timestamp), and add the initial value of the timestamp, obtain the timestamp of this collection.
6 conclusion
As the sound/video data collection/playback performance is directly related to the call quality, high-performance VoIP terminals are inseparable from the high-performance media collection/playback module. This paper presents some factors that affect the performance of audio/video data collection/playback, and discusses the design method of using Windows OS as an example. When developing VoIP terminals on other platforms, the interfaces between the acquisition/playback module and the operating system may vary depending on the actual platform. However, the internal structure of the modules proposed in this solution is as follows, it has nothing to do with the interfaces of the surrounding modules, so it still has some reference value.