This document describes in detail ISO/IEC 13818-7 (MPEG2 AAC audio codec), ISO/IEC 14496-3 (MPEG4 audio codec AAC low complexity) the AAC audio decoding algorithm. 1. Program System Structure The following is the AAC decoding flowchart: AAC decoding Flowchart After the master module starts running, the master module puts a part of the AAC bit stream into the input buffer, and obtains the starting point of a frame by searching for the synchronous word, noisless decoding (noisless decoding) is performed according to the syntax described in ISO/IEC 13818-7. noisless decoding (noisless decoding), which is actually the Harman decoding, through dequantize and joint Stereo), perception noise replacement (PNS), instantaneous noise shaping (TNS), inverse Discrete Cosine Transform (imdct), and band replication (SBR, obtain the PCM code stream of the left-right channel, and then the main control module puts it into the output buffer and outputs it to the sound playing device. 2. master moduleThe main task of the main control module is to operate the input and output buffer and call other modules to work collaboratively. The input and output buffers are provided by the DSP control module. The data stored in the output buffer is decoded PCM data, representing the amplitude of the sound. It consists of a fixed-length buffer. by calling the interface function of the DSP control module, the header pointer is obtained. After the output buffer is filled, call the interrupt processing to output the analog sound to the audio ADC chip (stereo audio frequency DAC and directdrive headphone amplifier) connected to the I2S interface. 3. Synchronization and element DecodingThe synchronization and element decoding module is mainly used to identify the format information, decode the header information, and decode the element information. The decoded results are used in subsequent non-noise decoding and scale factor decoding modules. AAC audio files can be in either of the following formats: ADIF: Audio Data Interchange Format Audio Data exchange format. The characteristic of this format is that the start of the audio data can be determined, and decoding starts from the middle of the audio data stream is not required, that is, its decoding must start at the beginning of a clearly defined definition. Therefore, this format is often used in disk files. ADTs: Audio Data Transport Stream Audio Data transmission streams. The feature of this format is that it is a bit stream with synchronous words. decoding can start at any position in the stream. Its features are similar to the MP3 data stream format. For the ADIF format of AAC, see: 3.1 ADIF Organizational Structure
The general ADTs format of AAC is as follows: 3.2 ADTs Organizational Structure
The figure shows the Concise structure of an ADTs frame. The blank rectangle on both sides of the structure indicates the data before and after a frame. The headers of ADIF and adts are different. They are as follows: 3.3 ADIF header information3.4 ADTs Fixed Header Information ADTsVariable header information 3.5 Frame SynchronizationFrame Synchronization aims to find the position of the frame header in the bit stream. According to 13818-7, the frame header in AAC ADTs format is 12-bit "1111 1111 ".
3.6 Header Information DecodingThe adts header information consists of two parts: Fixed Header information followed by variable header information. The data in the Fixed Header information is the same for each frame, while the variable header information is variable between frames. 3.7 decode Element InformationIn AAC, the composition of original data blocks may have six different elements. They are SCE: Single Channel element. A single channel is basically composed of only one ICs. A raw data block is most likely composed of 16 SCE. CPE: Channel pair element dual-channel element, which consists of two ICS that may share edge information and some associated stereo encoding information. A raw data block may consist of up to 16 SCE data blocks. CCE: coupling channel element is used together with Channel elements. Indicates the multi-channel joint stereo information of a block or dialog information of multilingual programs. LFE: Low Frequency element low frequency element. It includes a channel to enhance the low sampling frequency. DSE: Data Stream element data stream element, which contains some additional information that is not audio. PCE: configuration element of the program config element program. It contains the audio channel configuration information. It may appear in ADIF Header information. FIL: Fill element of fill element. Contains some extension information. Such as SBR and dynamic range control information. 3.8 handling process(1) determine the file format and determine it as ADIF or ADTs (2) If it is ADIF, describe the ADIF header information and jump to step 2. (3) If it is ADTs, find the synchronization header. (4). Describe the ADTs frame header information. (5) If any error is detected, perform the error detection. (6). Unblock information. (7). parse element information. 4. Non-noise DecodingNon-noise encoding is the Hamman encoding. Its function is to further reduce the redundancy of the scale factor and the quantified spectrum, that is, to encode the scale factor and the quantified spectrum information. The global gain is encoded into an 8-bit unsigned integer. the first scale factor and the global gain value are differentiated and then encoded using the scale factor encoding table. All subsequent scale factors are differentiated from the previous scale factor. The noise-free coding of quantization spectrum has two spectral coefficients. One is the division of 4 and 2 tuples, and the other is the division of sections. For the previous division, determine whether the value of the table is 4 or 2. For the next division, determine which table should be used. This section contains several scale factor bands and each section uses only one table. 4.1 SegmentationNoise-free coding divides the input 1024 quantization spectrum coefficients into several sections. each point in the section uses the same table. Considering the coding efficiency, the boundary of each segment is best the same as that of the Scale Factor band. Therefore, each segment must be transmitted with the following information: segment length, the corresponding scale factor band, and the user's table. 4.2 grouping and AlternationGrouping refers to ignoring the window where the spectrum coefficient is located. Consecutive spectrum coefficients with the same scale factor are grouped together to share a scale factor to achieve better coding efficiency. This will inevitably lead to alternation, that is C [group] [Window] [Scale Factor Band] [ Coefficient index] Arrange the coefficients in order to put the coefficients with the same scale factor together: C [group] [Scale Factor Band] [Window] [ Coefficient index] This leads to the alternation of coefficients of the same window. 4.3 processing of a large number of valuesThere are two ways to deal with the mass value in AAC: Use the escape flag in the Harman coding table or use the pulse escape method. The former is similar to the MP3 encoding method, and uses a dedicated table for many large amounts of values, this table implies that it will be followed by a pair of escape values and symbol of the pair value after the Harman encoding. When the pulse escape method is used, the big value is subtracted from a difference value to a small value, and then encoded in the table of the user. The difference is restored with a pulse structure. The flowchart of non-noise decoding is as follows: Non-noise decoding Flowchart 5. Scale Factor decoding and inverse quantizationIn AAC coding, the inverse quantization spectrum coefficient is implemented by an uneven quantizer. In decoding, Inverse Computation is required. That is, the symbol is maintained and the power is calculated for 4/3 times. The basic method to adjust the quantization noise in the frequency domain is to use the scale factor for noise shaping. A scale factor is an increase in the amplitude of all spectrum coefficients used to change the scale factor band. The scale factor mechanism is used to change the Bit Allocation of quantization noise in the frequency domain using an uneven quantizer. 5.1 scalefactor-band)The frequency line is divided into multiple groups based on the auditory characteristics of human ears. Each group corresponds to several scale factors, which are called the scale factor band. To reduce the edge information that contains short windows, consecutive short Windows may be divided into a group, that is, several short windows are transmitted as one window, then, the scale factor applies to the window after all groups. 5.2 antiquantization formula:X_invquant = Sign (x_quant) * | x_quant | ^ (4/3) Where X_invquant indicates the reverse quantization result. Sign (x) indicates the X symbol. ^ Indicates Power Operation 5.3 application scale factor formula:X_rescal = x_invquant * gain Gain = 2 ^ (0.25 * (SF-sf_offset )) Where X_rescal is the value after the scale factor formula is applied. Gain is a gain. SF is the scale factor value. Sf_offset is a constant, set to 100 6. Combined stereo DecodingTwo types of combined stereo are available: M/s stereo (middle side channel stereo) and intensity stereo (intensity stereo) 6.1 m/s stereoIn M_s stereo mode, normalized intermediate/adjacent channels are transmitted. The calculation formula is as follows: Where, L, r indicates the value of the left-right channel after conversion. M indicates the intermediate channel value. S indicates the value of the adjacent channel. 6.2 intensity stereoIn the intensity stereo mode, the left audio channel transmits the amplitude, And the right audio channel's scalefactor transmits the stereo position is_pos. If you specify the user table intensity_hcb or intensity_hcb2 in the right channel of the CPE with common_window 1 specified, the intensity stereo mode is used for decoding. The formula is as follows: Is_pos + = dpcm_is_pos Scale = invert_intensity * 0.5 ^ (0.25 * ispos) R_spec = scale * l_spec Restore from full backup It is very easy to restore the database from a full backup, which will be detailed in section 9.3.2. Is_pos is the scalefactor transmitted in the right channel. Dpcm_is_pos is the previous is_pos and the initial value is 0. Scale is the intensity factor Invert_intensity indicates whether to reverse the Harman table (table 14 and table 15). This variable is specified by ms_used. The relation is: invert_intensity = 1-2 * ms_used. In addition, when ms_mask_present is 0, invert_intensity is 1. 6.3 handling processCombined stereo decoding Flowchart 7. PNSThe perceptual noise substitution (PNS) sensor noise replacement module is a module that simulates noise by parameter encoding. After identifying the noise in the audio value, some noise is not quantified, but some parameters are used to tell the decoder end that this is a certain noise, the decoder then uses random encoding to produce this type of noise. In specific operations, the PNS module detects signal components with a frequency of less than 4 kHz for each scale factor. If this signal is neither a tone nor a strong change in time, it is considered a noise signal. The tone and energy changes of the signals are calculated in the acoustic model. If Table 13 (noise_hcb) is used in decoding, PNS is used. Because the M/s stereo decoding and PNS decoding are mutually exclusive, you can use the ms_used parameter to indicate whether the two channels use the same PNS. If the ms_used parameter is 1, the two channels use the same random vector to generate noise signals. The energy signal of PNS is represented by noise_nrg. If PNS is used, the energy signal is transmitted instead of the corresponding scale factor. The noise energy encoding is the same as the standard factor and adopts the differential encoding method. The first value is also a global gain value. It is combined with the intensity stereo position value and the scale factor, but is ignored for differential decoding. That is, the next noise energy value is greater than the noise energy value, instead of the intensity stereo position or the standard deviation factor. Random Energy will generate the average energy distribution calculated by noise_nrg in a scale factor band. 7.1 handling processPNS decoding Flowchart 8. TNSThe TNS transient noise shaping is used to control the instantaneous noise pattern in a conversion window. It is implemented by filtering a single channel. Traditional transcoding schemes often encounter sharp changes in the time domain, especially voice signals, because the quantified noise distribution is controlled in the frequency domain, however, in the time domain, a constant is distributed in a conversion block. If the signal in this block changes sharply but does not turn to a short block, the noise of this constant distribution will be heard. The TNS principle utilizes the dual nature of the time and frequency domains and the time-frequency symmetry of LPC (linear prediction coding, that is to say, encoding in any of these domains is equivalent to prediction encoding in another domain. That is to say, prediction encoding in one domain can increase the resolution in another domain. Quantization noise is produced in the frequency domain, which reduces the resolution of the time domain. Therefore, prediction encoding is performed in the frequency domain. In aacplus, because it is based on AAC profile LC, the filter order of TNS is limited to 12. 8.1 handling processTNSDecoding Flowchart 9. imdctThe process of converting audio data from the frequency domain to the time domain is mainly achieved by entering the frequency domain data into a set of imdct filters. After imdct transformation, the output values are added to the window, superimposed, and finally the time domain values are obtained. 9.1 imdct FormulaFor 0 <= n <n Where N is the index value of the sample point. I is the window index value K is the index value of the spectrum coefficient. N indicates the window function length. All values are short windows. n = 256. The remaining values are 2048. N0 = (n/2 + 1)/2 9.2 block typeBecause the frequency-domain resolution of long blocks is high and the time-domain resolution of short blocks is high, long blocks are more suitable for relatively stable time-domain signals, while short blocks are more suitable for time-domain signals with relatively fast changes. The length of a long block is 2048 points, and the length of a short block is 256 points. 9.3 + windowAAC uses two window functions: Kaiser-Bessel Class (KBD) window and sine window. The KBD window is as follows: It is defined: For 0 <= n/2 Where When the KBD window is used, window_shape is 1 The sine window is as follows: It is defined When a sine window is used, window_shape is 0. Another definition: Corresponding to four different window sequences for different window addition transformations: 1) only long blocks: Window_shape is 1: Window_shape is 0: After a window is added, the time domain signal can be expressed as W (n: 2.) Long start block: Window_shape is 1: Window_shape is 0: After a window is added, the time domain signal can be expressed as W (n: 3) only short blocks: Window_shape is 1: Window_shape is 0: After a window is added, the time domain signal can be expressed as W (n: 2.) Long ending block: Window_shape is 1: Window_shape is 0: After a window is added, the time domain signal can be expressed as W (n: 9.4 overlay After the window is added, the time domain signal value Z is calculated by overlapping the front and back windows to obtain the final PCM value: GlossaryAAC: Advanced Audio Coding Advanced Audio Encoding Aac lc: AAC with low complexity AAC's low-complexity Configuration AAC plus: Also called HE-AAC, AAC +, MPEG4 aac lc after the addition of the SBR module to form an AAC version MPEG: Motion Picture Expert Group Imdct: inverse Discrete Cosine Transformation ADIF: Audio Data Interchange Format Audio Data Exchange Format ADTs: Audio Data Transport Stream Audio Data transmission stream SCE: Single Channel element single channel Element CPE: Channel pair element dual channel Element CCE: coupling channel Element DSE: Data Stream element data stream Element PCE: Program config element program configuration element FIL: Fill element fill Element ICS: individual channel stream independent channel stream PNS: perceptual noise substitution Perceived Noise replacement SBR: spectral band replication frequency band Replication TNS: Temporal Noise Shaping instantaneous noise shaping Ch: Channel |