I. Basic Concepts
1 bit rate: the number of bits required for encoding (compression) audio data per second. The unit is usually kbps.
2. Sound Intensity: the subjective attribute of a sound indicates the degree to which a sound is heard. The intensity of the sound changes, but it is also affected by the frequency. In general, the intermediate frequency pure audio is better than the low frequency and high frequency pure audio.
3 sampling and sampling rate: Sampling converts continuous time signals into discrete digital signals. Sampling Rate refers to the number of samples collected per second.
Nyceptian sampling law: when the sampling rate is greater than or equal to twice the maximum frequency component of a continuous signal, the sampling signal can be used to perfectly reconstruct the original continuous signal.
Ii. Common audio formats
1. WAV format is a sound file format developed by Microsoft, also known as waveform sound files. It is the earliest digital audio format and is used by Windows platforms and applications.ProgramWidely supported, low compression rate.
2. Midi, short for Musical Instrument Digital Interface, is also known as the digital interface of musical instruments. It is a unified international standard for digital music/electronic synthesis instruments. It defines how computer music programs, digital synthesizer, and other electronic devices exchange Music signals, it specifies the protocol for data transmission between cables and hardware and equipment connected by different manufacturers of electronic instruments and computers, and can simulate the sound of multiple instruments. A MIDI file is a MIDI file. Some commands are stored in the MIDI file. Send these commands to the sound card and the sound card will synthesize the sound according to the instructions.
3. MP3 stands for MPEG-1 audio Layer 3, which was incorporated into the mpeg specification in 1992. MP3 can compress digital audio files with high sound quality and low sampling rate. Most common applications.
4. mp3pro is developed by Coding Technology Co., Ltd. in Sweden. It includes two major technologies: First, it comes from the unique decoding Technology of coding technology, second, it is a decoding Technology jointly researched by the patent holder of MP3, the French Thomson Multimedia Company and the German Fraunhofer ic association. Mp3pro can improve the sound quality of MP3 music without changing the file size. It compresses audio files at a lower bit rate to maximize the sound quality before compression.
5. mp3pro is developed by Coding Technology Co., Ltd. in Sweden. It includes two major technologies: First, it comes from the unique decoding Technology of coding technology, second, it is a decoding Technology jointly researched by the patent holder of MP3, the French Thomson Multimedia Company and the German Fraunhofer ic association. Mp3pro can improve the sound quality of MP3 music without changing the file size. It compresses audio files at a lower bit rate to maximize the sound quality before compression.
6. WMA (Windows Media Audio) is Microsoft's masterpiece in the Internet audio and video fields. The WMA format achieves a higher compression rate by reducing data traffic but maintaining sound quality. The compression rate can generally reach. In addition, WMA can protect copyrights through DRM (Digital Rights Management.
7. realAudio is a file format launched by Real Networks. Its biggest feature is that it can transmit audio information in real time, especially when the network speed is slow, therefore, RealAudio is mainly used for online playback on the network. The RealAudio file formats include RA (RealAudio), RM (RealMedia, RealAudio G2), and RMX (RealAudio secured, the commonality of these files is that the quality of voice changes with different network bandwidth. While ensuring that most people hear a smooth voice, the audience with more extensive bandwidth can enjoy better sound quality.
8. Audible has four different formats: audible1, 2, 3, and 4. The audible.com website sells audio books on the Internet and provides protection for the goods and files they sell through one of the four audible.com specialized audio formats. Each format mainly includes the audio source and the device used for listening. Formats 1, 2, and 3 use different levels of speech compression, while format 4 uses the same decoding method as MP3 at a lower sampling rate, which makes the speech clearer, and can be downloaded more effectively from the Internet. Audible uses their own desktop playing tool, which is audible manager. Using this player, you can play files stored in PCs or in audible format transmitted to a portable player.
9. AAC is short for advanced audio encoding. AAC is an audio format jointly developed by Fraunhofer IIS-a, Dolby and at&t and is part of the MPEG-2 specification. The operation adopted by AACAlgorithmUnlike the MP3 algorithm, AAC improves the encoding efficiency by combining other functions. AAC audio algorithms far exceed the compression capacity of some previous compression algorithms (such as MP3 ). It also supports up to 48 audio tracks, 15 low-frequency audio tracks, more sampling rates and bit rates, compatibility with multiple languages, and higher decoding efficiency. In short, AAC can provide better sound quality with a 30% reduction than MP3 files.
10. Ogg Vorbis is a new audio compression format, similar to existing music formats such as MP3. But the difference is that it is completely free, open, and without patent restrictions. Vorbis is the name of this audio compression mechanism, while Ogg is the name of a plan designed to design a completely open multimedia system. Vorbis is also lossy compression, but the loss is reduced by using more advanced acoustic models. Therefore, Ogg encoded at the same bit rate sounds better than MP3.
11. Ape is a lossless compression audio format, which compresses the size to half of a traditional lossless WAV file without degrading the sound quality.
12. FLAC is the abbreviation of free lossless audio codec. It is a famous set of free audio lossless compression codes. It features lossless compression.
Iii. Audio Encoding principles
Speech Coding is committed to reducing the channel bandwidth required for transmission while maintaining high-quality input speech.
The goal of Speech Encoding is to design a low-complexity encoder to achieve high-quality data transmission at the lowest Bit Rate.
1. Mute threshold curve: the threshold value that the human ears can hear at various frequencies only in a quiet environment.
2 critical frequency band
Because the human ears have different resolutions for different frequencies, mpeg1/audio divides the perceptible frequency range within 22 kHz into 23 ~ 26 critical frequency bands. Lists the center frequency and bandwidth of the ideal critical frequency band. As shown in the figure, the human ears have better resolution for low frequencies.
Figure 5
3. Masking Effect in the frequency domain: signals with Large Amplitude Mask smaller signals with similar frequencies, such:
4. Masking Effect in the time domain: in a very short period of time, if two sounds appear, the larger sound of the SPL (sound pressure level) will mask the smaller sound of the SPL. The time-domain masking effect is divided into pre-masking and post-masking. The post-masking takes a long time, it is about 10 times that of pre-masking.
The time-domain masking effect helps eliminate the forward echo.
Iv. encoding methods
1 quantization and Quantizer
Quantization and quantizer: quantization converts continuous signals in discrete time into discrete signals in discrete time. Common quantifiers include: Uniform quantizer, log quantizer, and non-uniform quantizer. The goal of the quantization process is to minimize the quantization error and minimize the complexity of the quantizer (these two are contradictions in themselves ).
(A) Even quantizer: the simplest, the worst-performing, only applicable to telephone speech.
(B) Logarithm quantizer: it is more complex and easy to implement than uniform quantizer, and its performance is better than uniform quantizer.
(C) Non-uniform quantizer: design the quantizer Based on the signal distribution. Signal-intensive areas are precisely quantified, while sparse areas are roughly quantified.
2. Voice Encoder
The Speech Encoder is divided into three types: (a) waveform editor; (B) acoustic encoder; (c) hybrid encoder.
The waveform encoder aims to construct a simulated waveform, including the background noise. Acting on all input signals, it produces high-quality samples and consumes a high bit rate. Vocoder does not regenerate the original waveform. This set of encoder extracts a set of parameters, which are sent to the receiver to export the speech to generate a pattern. Audio quality is not good enough. The hybrid encoder integrates the strengths of the waveform encoder and sound generator.
Waveform encoder 2.1
Waveform encoder design is often independent of the signal. Therefore, it is applicable to the encoding of various signals, not limited to speech.
1. Time Domain Encoding
A) PCM: pulse code modulation is the simplest encoding method. Only the discretization and quantization of signals are usually used.
B) DPCM: differential pulse code modulation, differential pulse encoding, which only encodes the differences between samples. The first or more samples are used to predict the current sample value. The more samples are used for prediction, the more accurate the predicted value. The difference between the real value and the predicted value is called a residual, which is the encoded object.
C) ADPCM: adaptive differential pulse code modulation, adaptive Differential Pulse coding. That is, on the basis of DPCM, according to the signal changes, adjust the quantizer and provisioner appropriately, so that the predicted value is closer to the real signal, the residual is smaller, and the compression efficiency is higher.
(2) Frequency Domain Encoding
Frequency-domain encoding refers to decomposing signals into a series of elements with different frequencies and independently coding them.
A) sub-band coding: sub-band encoding is the simplest frequency-domain encoding technology. It is a technology that converts an original signal from a time domain to a frequency domain, divides it into several sub-bands, and performs digital encoding on them separately. It uses a band-pass filter (BPF) Group to divide the original signal into several (for example, m) Sub-bands ). After the sub-bands are moved to the vicinity of zero frequency through the modulation characteristics equivalent to single-band amplitude modulation (BPF, then, the output signals of each sub-band are sampled at the specified rate (the nequest rate), and the sampling values are normally digitally encoded. Then, the M-channel digital encoder is set. Send the digital coding signals to the multiplexing system, and output the sub-band encoding data stream.
You can use different quantization methods and assign different bits to different sub-bands based on the human ear perception model.
B) Transform Coding: DCT encoding.
6. Voice
Channel vocoder: the use of human ears is not sensitive to phase.
Homomorphic Vocoder: can effectively process Synthetic signals.
Formant Vocoder: most of the information of voice signals is located at the position and bandwidth of the resonance peak.
Linear Predictive Vocoder: The most common vocoder.
7. Hybrid Encoder
The waveform encoder tries to retain the waveform of the encoded signal and can provide high-quality speech at a medium bit rate (32 Kbps), but cannot be used in Low Bit Rate scenarios. Phonograph attempts to generate a signal that is similar to the encoded signal in terms of hearing ability. It can provide understandable speech at a low bit rate, but the speech produced does not sound natural. The hybrid encoder combines the advantages of both.
RELP: encode the residual based on linear prediction. The Mechanism is as follows: only a small part of the residual data is transmitted, and all the residual data is reconstructed on the receiver (the residual data of the baseband is copied ).
MPC: Multi-pulse coding, used to remove the correlation of residual, which is used to make up for the voiced and unvoiced sound, without intermediate state defects.
CELP: codebook excited linear prediction. It uses audio channels to predict its cascade with the pitch estimator to better approach the original signal.
MBE: multiband excitation, multi-band excitation, aims to avoid a large number of CELP operations and achieve a higher quality than the phonograph.