Detailed description of AAC decoding algorithm principles

Last Update:2018-12-03 Source: Internet

Author: User

Tags alternation

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Detailed description of AAC decoding algorithm principles

This document describes in detail ISO/IEC 13818-7 (MPEG2 AAC audio codec), ISO/IEC 14496-3 (MPEG4 audio codec AAC low complexity) the AAC audio decoding algorithm. 1. Program System Structure

The following is the AAC decoding flowchart:

AAC decoding Flowchart

After the master module starts running, the master module puts a part of the AAC bit stream into the input buffer, and obtains the starting point of a frame by searching for the synchronous word, noisless decoding (noisless decoding) is performed according to the syntax described in ISO/IEC 13818-7. noisless decoding (noisless decoding), which is actually the Harman decoding, through dequantize and joint
Stereo), perception noise replacement (PNS), instantaneous noise shaping (TNS), inverse Discrete Cosine Transform (imdct), and band replication (SBR, obtain the PCM code stream of the left-right channel, and then the main control module puts it into the output buffer and outputs it to the sound playing device.

2. master module

The main task of the main control module is to operate the input and output buffer and call other modules to work collaboratively. The input and output buffers are provided by the DSP control module. The data stored in the output buffer is decoded PCM data, representing the amplitude of the sound. It consists of a fixed-length buffer. by calling the interface function of the DSP control module, the header pointer is obtained. After the output buffer is filled, call the interrupt processing to output the analog sound to the audio ADC chip (stereo audio frequency DAC and directdrive headphone amplifier) connected to the I2S interface.

3. Synchronization and element Decoding

The synchronization and element decoding module is mainly used to identify the format information, decode the header information, and decode the element information. The decoded results are used in subsequent non-noise decoding and scale factor decoding modules.

AAC audio files can be in either of the following formats:

ADIF: Audio Data Interchange Format
Audio Data exchange format. The characteristic of this format is that the start of the audio data can be determined, and decoding starts from the middle of the audio data stream is not required, that is, its decoding must start at the beginning of a clearly defined definition. Therefore, this format is often used in disk files.

ADTs: Audio Data Transport Stream
Audio Data transmission streams. The feature of this format is that it is a bit stream with synchronous words. decoding can start at any position in the stream. Its features are similar to the MP3 data stream format.

For the ADIF format of AAC, see:

3.1 ADIF Organizational Structure

The general ADTs format of AAC is as follows:

3.2 ADTs Organizational Structure

The figure shows the Concise structure of an ADTs frame. The blank rectangle on both sides of the structure indicates the data before and after a frame.

The headers of ADIF and adts are different. They are as follows:

3.3 ADIF header information

3.4 ADTs Fixed Header Information

ADTsVariable header information

3.5 Frame Synchronization

Frame Synchronization aims to find the position of the frame header in the bit stream. According to 13818-7, the frame header in AAC ADTs format is 12-bit "1111 1111 ".

3.6 Header Information Decoding

The adts header information consists of two parts: Fixed Header information followed by variable header information. The data in the Fixed Header information is the same for each frame, while the variable header information is variable between frames.

3.7 decode Element Information

In AAC, the composition of original data blocks may have six different elements. They are

SCE: Single Channel element. A single channel is basically composed of only one ICs. A raw data block is most likely composed of 16 SCE.

CPE: Channel pair element dual-channel element, which consists of two ICS that may share edge information and some associated stereo encoding information. A raw data block may consist of up to 16 SCE data blocks.

CCE: coupling channel element is used together with Channel elements. Indicates the multi-channel joint stereo information of a block or dialog information of multilingual programs.

LFE: Low Frequency element low frequency element. It includes a channel to enhance the low sampling frequency.

DSE: Data Stream element data stream element, which contains some additional information that is not audio.

PCE: configuration element of the program config element program. It contains the audio channel configuration information. It may appear in ADIF
Header information.

FIL: Fill element of fill element. Contains some extension information. Such as SBR and dynamic range control information.

3.8 handling process

(1) determine the file format and determine it as ADIF or ADTs

(2) If it is ADIF, describe the ADIF header information and jump to step 2.

(3) If it is ADTs, find the synchronization header.

(4). Describe the ADTs frame header information.

(5) If any error is detected, perform the error detection.

(6). Unblock information.

(7). parse element information.

4. Non-noise Decoding

Non-noise encoding is the Hamman encoding. Its function is to further reduce the redundancy of the scale factor and the quantified spectrum, that is, to encode the scale factor and the quantified spectrum information.

The global gain is encoded into an 8-bit unsigned integer. the first scale factor and the global gain value are differentiated and then encoded using the scale factor encoding table. All subsequent scale factors are differentiated from the previous scale factor.

The noise-free coding of quantization spectrum has two spectral coefficients. One is the division of 4 and 2 tuples, and the other is the division of sections. For the previous division, determine whether the value of the table is 4 or 2. For the next division, determine which table should be used. This section contains several scale factor bands and each section uses only one table.

4.1 Segmentation

Noise-free coding divides the input 1024 quantization spectrum coefficients into several sections. each point in the section uses the same table. Considering the coding efficiency, the boundary of each segment is best the same as that of the Scale Factor band. Therefore, each segment must be transmitted with the following information: segment length, the corresponding scale factor band, and the user's table.

4.2 grouping and Alternation

Grouping refers to ignoring the window where the spectrum coefficient is located. Consecutive spectrum coefficients with the same scale factor are grouped together to share a scale factor to achieve better coding efficiency. This will inevitably lead to alternation, that is

C [group] [Window] [Scale Factor Band] [
Coefficient index]

Arrange the coefficients in order to put the coefficients with the same scale factor together:

C [group] [Scale Factor Band] [Window] [
Coefficient index]

This leads to the alternation of coefficients of the same window.

4.3 processing of a large number of values

There are two ways to deal with the mass value in AAC: Use the escape flag in the Harman coding table or use the pulse escape method. The former is similar to the MP3 encoding method, and uses a dedicated table for many large amounts of values, this table implies that it will be followed by a pair of escape values and symbol of the pair value after the Harman encoding. When the pulse escape method is used, the big value is subtracted from a difference value to a small value, and then encoded in the table of the user. The difference is restored with a pulse structure.

The flowchart of non-noise decoding is as follows:

Non-noise decoding Flowchart

5. Scale Factor decoding and inverse quantization

In AAC coding, the inverse quantization spectrum coefficient is implemented by an uneven quantizer. In decoding, Inverse Computation is required. That is, the symbol is maintained and the power is calculated for 4/3 times.

The basic method to adjust the quantization noise in the frequency domain is to use the scale factor for noise shaping. A scale factor is an increase in the amplitude of all spectrum coefficients used to change the scale factor band. The scale factor mechanism is used to change the Bit Allocation of quantization noise in the frequency domain using an uneven quantizer.

5.1 scalefactor-band)

The frequency line is divided into multiple groups based on the auditory characteristics of human ears. Each group corresponds to several scale factors, which are called the scale factor band. To reduce the edge information that contains short windows, consecutive short Windows may be divided into a group, that is, several short windows are transmitted as one window, then, the scale factor applies to the window after all groups.

5.2 antiquantization formula:

X_invquant = Sign (x_quant) * | x_quant | ^ (4/3)

Where

X_invquant indicates the reverse quantization result.

Sign (x) indicates the X symbol.

^ Indicates Power Operation

5.3 application scale factor formula:

X_rescal = x_invquant * gain

Gain = 2 ^ (0.25 * (SF-sf_offset ))

Where

X_rescal is the value after the scale factor formula is applied.

Gain is a gain.

SF is the scale factor value.

Sf_offset is a constant, set to 100

6. Combined stereo Decoding

Two types of combined stereo are available: M/s stereo (middle side channel stereo) and intensity stereo (intensity stereo)

6.1 m/s stereo

In M_s stereo mode, normalized intermediate/adjacent channels are transmitted. The calculation formula is as follows:

Where,

L, r indicates the value of the left-right channel after conversion.

M indicates the intermediate channel value.

S indicates the value of the adjacent channel.

6.2 intensity stereo

In the intensity stereo mode, the left audio channel transmits the amplitude, And the right audio channel's scalefactor transmits the stereo position is_pos. If you specify the user table intensity_hcb or intensity_hcb2 in the right channel of the CPE with common_window 1 specified, the intensity stereo mode is used for decoding. The formula is as follows:

Is_pos + = dpcm_is_pos

Scale = invert_intensity * 0.5 ^ (0.25 * ispos)

R_spec = scale * l_spec

Restore from full backup

It is very easy to restore the database from a full backup, which will be detailed in section 9.3.2.

Is_pos is the scalefactor transmitted in the right channel.

Dpcm_is_pos is the previous is_pos and the initial value is 0.

Scale is the intensity factor

Invert_intensity indicates whether to reverse the Harman table (table 14 and table 15). This variable is specified by ms_used. The relation is: invert_intensity = 1-2 * ms_used. In addition, when ms_mask_present is 0, invert_intensity is 1.

6.3 handling process

Combined stereo decoding Flowchart

7. PNS

The perceptual noise substitution (PNS) sensor noise replacement module is a module that simulates noise by parameter encoding. After identifying the noise in the audio value, some noise is not quantified, but some parameters are used to tell the decoder end that this is a certain noise, the decoder then uses random encoding to produce this type of noise.

In specific operations, the PNS module detects signal components with a frequency of less than 4 kHz for each scale factor. If this signal is neither a tone nor a strong change in time, it is considered a noise signal. The tone and energy changes of the signals are calculated in the acoustic model.

If Table 13 (noise_hcb) is used in decoding, PNS is used.

Because the M/s stereo decoding and PNS decoding are mutually exclusive, you can use the ms_used parameter to indicate whether the two channels use the same PNS. If the ms_used parameter is 1, the two channels use the same random vector to generate noise signals.

The energy signal of PNS is represented by noise_nrg. If PNS is used, the energy signal is transmitted instead of the corresponding scale factor.

The noise energy encoding is the same as the standard factor and adopts the differential encoding method. The first value is also a global gain value. It is combined with the intensity stereo position value and the scale factor, but is ignored for differential decoding. That is, the next noise energy value is greater than the noise energy value, instead of the intensity stereo position or the standard deviation factor.

Random Energy will generate the average energy distribution calculated by noise_nrg in a scale factor band.

7.1 handling process

PNS decoding Flowchart

8. TNS

The TNS transient noise shaping is used to control the instantaneous noise pattern in a conversion window. It is implemented by filtering a single channel.

Traditional transcoding schemes often encounter sharp changes in the time domain, especially voice signals, because the quantified noise distribution is controlled in the frequency domain, however, in the time domain, a constant is distributed in a conversion block. If the signal in this block changes sharply but does not turn to a short block, the noise of this constant distribution will be heard.

The TNS principle utilizes the dual nature of the time and frequency domains and the time-frequency symmetry of LPC (linear prediction coding, that is to say, encoding in any of these domains is equivalent to prediction encoding in another domain. That is to say, prediction encoding in one domain can increase the resolution in another domain. Quantization noise is produced in the frequency domain, which reduces the resolution of the time domain. Therefore, prediction encoding is performed in the frequency domain.

In aacplus, because it is based on AAC profile LC, the filter order of TNS is limited to 12.

8.1 handling process

TNSDecoding Flowchart

9. imdct

The process of converting audio data from the frequency domain to the time domain is mainly achieved by entering the frequency domain data into a set of imdct filters. After imdct transformation, the output values are added to the window, superimposed, and finally the time domain values are obtained.

9.1 imdct Formula

For 0 <= n <n

Where

N is the index value of the sample point.

I is the window index value

K is the index value of the spectrum coefficient.

N indicates the window function length. All values are short windows. n = 256. The remaining values are 2048.

N0 = (n/2 + 1)/2

9.2 block type

Because the frequency-domain resolution of long blocks is high and the time-domain resolution of short blocks is high, long blocks are more suitable for relatively stable time-domain signals, while short blocks are more suitable for time-domain signals with relatively fast changes.

The length of a long block is 2048 points, and the length of a short block is 256 points.

9.3 + window

AAC uses two window functions: Kaiser-Bessel
Class (KBD) window and sine window.

The KBD window is as follows:

It is defined:

For 0 <= n/2

Where

When the KBD window is used, window_shape is 1

The sine window is as follows:

It is defined

When a sine window is used, window_shape is 0.

Another definition:

Corresponding to four different window sequences for different window addition transformations:

1) only long blocks:

Window_shape is 1:

Window_shape is 0:

After a window is added, the time domain signal can be expressed as W (n:

2.) Long start block:

Window_shape is 1:

Window_shape is 0:

After a window is added, the time domain signal can be expressed as W (n:

3) only short blocks:

Window_shape is 1:

Window_shape is 0:

After a window is added, the time domain signal can be expressed as W (n:

2.) Long ending block:

Window_shape is 1:

Window_shape is 0:

After a window is added, the time domain signal can be expressed as W (n:

9.4 overlay

After the window is added, the time domain signal value Z is calculated by overlapping the front and back windows to obtain the final PCM value:

Glossary

AAC: Advanced Audio Coding Advanced Audio Encoding

Aac lc: AAC with low complexity AAC's low-complexity Configuration

AAC plus: Also called HE-AAC, AAC +, MPEG4 aac lc after the addition of the SBR module to form an AAC version

MPEG: Motion Picture Expert Group

Imdct: inverse Discrete Cosine Transformation

ADIF: Audio Data Interchange Format
Audio Data Exchange Format

ADTs: Audio Data Transport Stream
Audio Data transmission stream

SCE: Single Channel element single channel Element

CPE: Channel pair element dual channel Element

CCE: coupling channel Element

DSE: Data Stream element data stream Element

PCE: Program config element program configuration element

FIL: Fill element fill Element

ICS: individual channel stream independent channel stream

PNS: perceptual noise substitution Perceived Noise replacement

SBR: spectral band replication frequency band Replication

TNS: Temporal Noise Shaping instantaneous noise shaping

Ch: Channel

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More