New VoIP Speech Quality Measurement Method E-model

Source: Internet
Author: User

1 Introduction
In recent years, with the wide application of the IP network technology, more and more researchers are paying attention to the service quality issues that the IP network can provide, scientific and reliable measurement and evaluation of service quality is a critical issue in Network Measurement and network planning and design. As a pioneer in the Next Generation aggregation Service Network Based on packet transmission, VoIP will provide reference and experience for measuring the service quality of future networks.

2 VoIP speech characteristics requirements for network performance
Voice Transmission over an IP network is different from the traditional PSTN voice transmission. It uses voice encoding to digitize Analog voice and package it with an IP packet transmission mechanism that is best delivered, it is transmitted to the receiving end through an IP address. After the receiving end collects data packets, it decodes the analog voice. In addition, there are many differences between VoIP and traditional network applications, such as using network bandwidth as much as possible to transmit files through FTP; while ERP applications send less data, but they exchange data streams frequently between the sender and receiver. On the contrary, VoIP only occupies a small amount of network bandwidth, but it cannot tolerate network latency and changes. Even if the VoIP service and traditional data service are implemented in the same network, the voice stream and data stream cannot be processed in the same way, because:

1) they have different packet sizes

2) They send data packets at different rates

3) They cache and transmit data packets to the destination in different ways

4) They must meet different user expectations.

At present, most networks are not ready to provide the same speech quality and reliability as PSTN for end-to-end VoIP implementation. The existing VoIP network implements IP relay and provides two remote connections to the PSTN. The following two major VoIP speech features reflect the specific network performance requirements:

First, VoIP uses RTP real-time transmission protocol to transmit data. RTP is an application protocol based on connectionless UDP. UDP is connectionless, and it does not provide responses and tracking for packet transmission, so that RTP will not re-transmit packet loss on the network, this requires that packet loss should be minimized during network transmission. In addition, according to the TCP application protocol, RTP does not have direct collision control, because the sender sends too many packets that are too fast, the recipient will be drowned. To overcome this problem, RTP applications always send data packets at a fixed rate, which requires the network to transmit data packets at a fixed rate as much as possible.

Second, interactive sessions cannot tolerate excessive latency. A typical telephone session depends on a large amount of interaction between the initiator and the listener. The more interaction, the less latency the conversation can bear. This requires that the latency of data packets through the network be as small as possible.

It can be seen that the transmission of voice over an IP network needs to consider many factors different from the traditional telephone network and the traditional data network. All these factors will restrict the quality of VoIP speech.

3 VoIP Speech Quality Evaluation Standard
How can we judge whether the voice quality of VoIP is good or bad? Of course, we hope that the quality of VoIP speech is as good as that of PSTN, which is also called the Toll level. It is very good, but this is not necessarily the case. Before or after the implementation of VoIP, we must know the voice quality of the network, so we need some voice quality measurement standards. Since the invention of the phone number, the way of measuring the voice quality is subjective. people pick up a phone number and then the voice quality is perceived by the ears. This method is widely recognized. After improvement, this subjective speech quality measurement method is now the average subjective value MOS method, defined in ITU-T P.800. Based on this subjective evaluation, human behaviors that receive and perceive speech quality are investigated and quantified, and the average subjective value of MOS is obtained by answering speech quality. The correspondence between the voice quality and the average subjective value provides a standard basis for network configuration, benchmarking, and monitoring.

An average subjective MOS value is 4 or higher, which is considered to be better speech quality. If the average subjective MOS value is lower than 3.6, most listeners cannot be satisfied with the speech quality. Although the average subjective test is accurate and effective, the biggest problem with this subjective method is that, in reality, it is very difficult and expensive to enable a group of people to answer and evaluate the quality of speech. Therefore, people are constantly exploring ways to make objective measurements.

Now many objective measurement methods have been applied, such as PSQM/PSQM + perception call quality measurement [2], PESQ perception evaluation call quality measurement [3], pams bt) perception Analysis and Measurement. Both the PSQM and PAMS measurement methods need to send a voice reference signal through the telephone network. At the other end of the network, a digital signal processing method is used to compare the sample signal and the received signal, then we can estimate the voice quality of the network. PESQ combines the advantages of PSQM and PAMS, improves the VoIP and hybrid end-to-end applications, and modifies the MOS and MOS-LQ calculation methods. At first, these methods were used to measure the encoding algorithm, and then gradually applied to the measurement of VoIP network systems. The famous measuring instrument manufacturer, Agilent, represented by the voice quality measuring instrument VQT. In addition, it is necessary to point out that the average subjective value MOS is a widely recognized speech quality standard. Therefore, regardless of the method used, all measurement methods must correspond to their results to the final average subjective value MOS. The above methods can be expressed as MOS values.

4 Proposal of E-model measurement method
The measurement method described above can be used in the laboratory to analyze problems of individual devices, for example, PSQM and PESQ are used to analyze the quality of the phones. However, these measurement methods are not suitable for analyzing voice quality on data networks, and are based on traditional telephone networks. Their main disadvantage is that the measurement is not based on the data network and does not reflect the special problems of data networks such as latency, jitter, and packet loss. The impact of network faults on user perception is not considered, analyze network speech problems from the perspective of differences in sending and receiving signals. To overcome these shortcomings, ITU's G.107 standard proposed E-model, which focuses on comprehensive network damage factors and is well suited to the Evaluation of voice quality in data networks.

The premise of E-model is that it is assumed that the speech quality damage factor is always physically attached. Simply put, if network damage factors such as noise, Echo, latency, encoder performance, and jitter can be flexibly added, then a comprehensive and objective quality level of the network, or a factor known as the "Caller Experience", can be estimated.

4.1 E-model's basic algorithm formula and its correspondence with MOS values

E-model is used as the final result of the algorithm. It is called a comprehensive network transmission level element and has a value ranging from 0 to 100. The R value calculation starts with no impact on the network and equipment. At this time, the speech quality is the best, R = Ro. Ro is the ratio of basic signals without network delay and Device Damage Factors to the transmission and receiving noise, as well as the current and background noise, that is, the basic signal-to-noise ratio. However, due to the presence of network and device damage factors, the voice quality through the network is reduced. The basic formula for calculating the R value is as follows:

R = Ro-Is-Id-Ie +

Where, Is: Damage Caused by synchronization with voice signal transmission

Id: damage caused by delay of voice signal transmission

Ie: Damage Caused by devices, such as encoder damage

A: advantage factor. It is designed to consider the caller's expectation. In most cases, it is generally set to 0.

According to the formula, the total R value of speech quality Is calculated by first estimating the signal-to-noise ratio Ro of a connection), and then subtracting the network damage Is, Id, Ie ), and then use the caller to compensate for the expected voice quality.. In practical applications, input Ro, Is, Id, and Ie in the basic formula all need to consider a variety of actual network damage factors, which are obtained through complicated mathematical calculations.

As mentioned above, any measurement method will eventually correspond to the MOS value standard, and the E-model will be the same. The following graph clearly shows the ing between the R value and the average subjective value MOS. The X axis represents the r value of the e model, and the Y axis represents the average subjective value MOS.

The following table lists the differences between the R value and MOS value. Because there is a conversion process between network data and actual speech, the inherent loss makes the R value only reach 93.2, that is, the average subjective MOS value is only 4.4. The maximum R value of G.107 is 94 by default.

4.2 Impact of Speech Encoding methods, latency, jitter, and packet loss on R Value

The main causes of network damage include voice encoding, Echo, average packet delay, jitter, and packet loss rate. Echo occurs in the connection between an IP network and a traditional PSTN. It is not discussed in a single VoIP network. The formula for calculating R is simplified as follows:

R = Ro-Icodec-Idelay-Ipdv-Ipacketloss

The impact of these four major damage factors on the R value is discussed below.

In speech processing, encoding uses hardware or software to sample speech and determine the packet rate. The ITU Standard defines almost a dozen encoding methods, each of which has different features. The low-speed encoding method consumes less bandwidth. However, because of the lossy compression algorithm, the low-speed encoding method further weakens the speech quality. In actual situations, selecting a low-speed encoding method can create more calls on the same connection, but introduces a greater latency, making the speech quality more sensitive to packet loss. Therefore, selecting a lower-rate encoding method will significantly reduce the r value of the e model. Of course, this is not completely absolute. The following table shows the Ie values and inherent r values of some common encoding methods [4].

Latency refers to the time taken by the voice from the initiator to the receiver [5]. Generally, the end-to-end latency consists of the following four parts:

1) Propagation latency: refers to the time when the voice passes through the network from one end to the other end, determined by the speed and distance of the signal passing through the media.

2) transmission delay: refers to the time when all network devices are connected through the network path.

3) package conversion latency: refers to the time when the encoder performs digital-to-analog conversion.

4) jitter buffer latency: it refers to the time at the receiving end used to maintain one or more received data packets, to overcome the change in the arrival time of data packets, that is, to overcome the delay caused by jitter.

Latency may lead to gaps in the speech session process, resulting in speech deformation and session interruption. That is to say, an increase in latency leads to a decrease in the r value. Delay Duration: 100 ~ Between ms, it was noticed by the listener, making the session unnatural. The recommended maximum latency is 150 ms. If the latency reaches ms, serious session interruption occurs.

Jitter, also known as Latency Change, refers to the time difference in the arrival of all sent packets during a VoIP call. When a data packet is sent, the sender adds a timestamp to the RTP Header. When the other end is received, the receiver also adds another timestamp; calculate the two timestamps to obtain the packet path time. If a call contains different access times, there is jitter. In video applications, jitter is represented by image flashing. in telephone calls, jitter is similar to packet loss. Some words are unclear or incorrect.

Related Articles]

  • Enterprise-level VoIP applications
  • How to deploy VoIP
  • Cisco SIP VoIP architecture solution

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.