1. Introduction
Main objectives of H.264:
1. High video compression ratio
2. Good network affinity
Solution:
VCL video coding layer video encoding Layer
NAL network extract action layer network extraction layer
VCL: syntax-level definition of core algorithm engines, blocks, macro blocks, and segments
NAL: the syntax level above the chip level (such as the sequence parameter set and image parameter set). It also supports the following functions: Independent chip decoding, unique start Code guarantee, SEI and stream format encoding data transmission.
VCL design goals: to implement efficient coding and decoding based on the network as much as possible
NAL design goals: package data into corresponding formats based on different networks, and adapt the bit strings produced by VCL to various networks and Multi-Environment environments.
NALU header structure: NALU type (5bit), importance indication bit (2bit), and prohibition bit (1bit ).
NALU type: 1 ~ 12 used by H.264, 24 ~ 31 is used by applications other than H.264.
Importance indication: indicates the importance of the nal unit for reconstruction. The greater the value, the more important it is.
Bit prohibited: if the network discovers that the nal unit has a bit error, you can set this bit to 1 so that the receiver can discard this unit.
2. nal syntax and Semantics
NAL layer Syntax:
In the code stream output by the encoder, the basic unit of data is the syntactic element.
The syntax represents the organizational structure of the syntactic element.
Describes the meanings of syntactic elements.
Each group has a header. the decoder can easily detect the nal division and extract the nal for decoding.
However, in order to save the bitrate, H.264 does not set a syntax element in the nal header to indicate the starting position.
If the encoding data is stored on the media, since the nal is closely connected, the decoder cannot identify the start position and end position of each nal in the data stream.
Solution: add the start code: 0x000001 before each nal
For some types of media, data streams must be aligned in length or an integer multiple of a constant for ease of addressing. Therefore, add several bytes of 0 before the start code to fill in.
Start of detection NAl:
0x000001, 0x000000
We must consider that when internal NAL has 0x000001 and 0x000000
Solution:
H.264 proposes a mechanism to prevent competition:
0x000000--0x00000300
0x000001--0x00000301
0x000002--0x00000302
0x000003--0x00000303
For this reason, we can know:
In the nal unit, the following three-byte sequence should not appear at any byte alignment.
Zero X 000000
Zero X 000001
Zero X 000002
Forbidden_zero_bit = 0;
Nal_ref_idc: indicates the priority of Nal. 0 ~ 3. A greater value indicates that the more important the current nal is, the higher the priority is to be protected. If the current nal is a part of the reference frame, a sequence parameter set, or an important unit of the image parameter set, the syntactic element must be greater than 0.
Nal_unit_type: Type of the current nal Unit
264 nal layer Processing
Structure:
NAL uses NALU (Nal unit) as the unit to support the transmission of encoding data in the group-based exchange technology network.
It defines the data format that meets the requirements of the transport layer or storage media, and provides early information to provide video encoding and external world interfaces.
NALU: defines basic formats that can be used for group-based and bit stream-based systems.
RTP encapsulation: only for the local nal Interface Based on the nal unit.
Three different data forms:
Sodb data Bit String --> the original encoding data
Rbsp original byte sequence load --> after sodb, add the ending bit (rbsp trailing bits is a bit "1") to several BITs "0" for byte alignment.
Ebsp extended byte sequence load --> added the imitation validation byte (0x03) based on rbsp. The reason is: When NALU is added to limit B, you need to add the start code startcodeprefix before each set of NALU. If the slice corresponding to the NALU is the start of a frame, it is expressed in 4-bit bytes, ox00000001, otherwise, three bytes are used to indicate ox000001. in order to prevent the NALU subject from conflicting with the start code, each two bytes is 0 consecutively during encoding, insert 0x03 of a byte. Remove 0x03 from decoding. Also known as shell Removal
Processing Process:
1. encapsulate the sodb output at the VCL layer into nal_unit, which is a common Encapsulation Format and can be used for sequential byte stream mode and IP packet switching mode.
2. for different transmission networks (Circuit Switching | packet switching), nal_unit is encapsulated into encapsulation cells for different networks.
The specific process of Step 1:
The bitstream sodb (string of data bits) output at the VCL layer is processed in the following three steps to nal_unit:
1. After processing sodb bytes alignment, it is encapsulated into rbsp (raw byte sequence payload ).
2. to prevent byte competition in SCP (start_code_prefix_one_3bytes, 0x000001) in the rbsp byte stream and ordered byte stream transmission mode, the first three bytes of rbsp are detected cyclically, in case of byte competition, add emulation_prevention_three_byte (0x03) before the third byte. The specific method is as follows:
Nal_unit (numbytesinnalunit ){
Forbidden_zero_bit
Nal_ref_idc
Nal_unit_type
Numbytesinrbsp = 0
For (I = 1; I <numbytesinnalunit; I ++ ){
If (I + 2 <numbytesinnalunit & next_bits (24) = 0x000003 ){
Rbsp_byte [numbytesinrbsp ++]
Rbsp_byte [numbytesinrbsp ++]
I + = 2
Emulation_prevention_three_byte/* equal to 0x03 */
} Else
Rbsp_byte [numbytesinrbsp ++]
}
}
3. Add a byte header (forbidden_zero_bit + nal_ref_idc + nal_unit_type) to the anti-byte contention processed rbsp and encapsulate it into nal_unit.
The specific process of Step 2:
Case1: encapsulation of ordered byte streams
Byte_stream_nal_unit (numbytesinnalunit ){
While (next_bits (24 )! = 0x000001)
Zero_byte/* equal to 0x00 */
If (more_data_in_byte_stream ()){
Start_code_prefix_one_3bytes/* equal to 0x000001 */nal_unit (numbytesinnalunit)
}
}
Transmission systems like H.320 and MPEG-2/h.222.0 transmit nal as an ordered continuous byte or bit stream, and also rely on the data itself to identify the nal unit boundary. In such an application system, the H.264/AVC Specification defines the byte stream format. A three-byte prefix is added before each nal unit, that is, synchronization bytes. In bitstream applications, an additional byte is required for each image as the boundary location. Additionally, you can add additional data to the byte stream to expand the volume of data sent, enabling fast boundary location and synchronization recovery.
Case2: RTP encapsulation of an IP Network
Group packaging rules
(1) The additional overhead is less, so that the MTU size is between 100 and ~ 64 K Bytes range;
(2) The importance of the group can be identified without decoding the data in the group;
(3) the load specification shall ensure that the Group cannot be decoded because of the loss of other bits without decoding;
(4) supports dividing NALU into multiple RTP groups;
(5) Multiple NALU nodes can be grouped into one RTP group.
The RTP Header can be the NALU header, and the preceding packaging rules can be implemented.
Put a NALU in an RTP group, put the NALU (including the NALU header simultaneously serving as the load header) into the RTP load, and set the RTP Header Value. In order to avoid further division of large groups by the IP layer, the size of the slice group is generally smaller than the MTU size. Due to the different transfer paths of packets, the decoder must re-sort the packets by group. The order information contained in RTP can be used to solve this problem.
NALU Segmentation
For pre-encoded content, NALU may be larger than MTU size limit. Although the division of the IP layer can make the data block smaller than 64 kilobytes, it cannot be protected at the application layer, thus reducing the effect of the non-classified protection solution. Because UDP data packets are smaller than 64 kilobytes and the length of a piece is too small for some applications, application layer packaging is part of the RTP packaging solution.
The new discussion Scheme (IETF) should meet the following characteristics:
(1) The NALU blocks are transmitted in ascending order according to the RTP sequence number;
(2) mark the first and last NALU blocks;
(3) detect lost parts.
NALU merge
Some nalu, such as SEI, and parameter set, are very small. merging them helps reduce the header overhead. Two collection groups exist:
(1) A single time collection group (STAP), which is combined by timestamp;
(2) multi-time collection group (mtap). Different timestamps can also be combined.
NAL regulates the format of video data. It mainly provides header information for transmission and storage of various media. NAL supports various networks, including:
1. Any real-time wired and wireless Internet services using the RTP/IP protocol
2. Serve as MP4 file storage and multimedia information file service
3. MPEG-2 System
4. Other networks
NAL specifies a common format, which is suitable for packet-oriented transmission and stream transmission. In fact, the packet transmission and stream transmission methods are the same. The difference is that a starting code prefix is added before transmission.
In a packet forwarding protocol system similar to Internet/RTP, the packet structure contains packet boundary recognition bytes. In this case, bytes do not need to be synchronized.
There are two types of NAL units: VCL and non-VCL.
The VCL nal unit contains video image sampling information,
Non-VCL includes various additional information, such as parameter set (header information, applied to a large number of vcl nal units), additional information for improving performance, and scheduled information.
Parameter set:
The parameter set is rarely changed. It is used for decoding a large number of vcl nal units. There are two types:
1. The sequence parameter set acts on a series of continuous video images, that is, the video sequence.
Two IDR images are set of sequential parameters. The differences between IDR and I frames are shown below.
2. Image parameter set, acting on one or more images in the video sequence
The sequence and image parameter set mechanism reduces the transmission of repeated parameters. Each VCL nal unit contains an identifier
To the relevant image parameter set, each image parameter set contains an identifier pointing to the content of the relevant sequence Parameter Set
Therefore, only a few pointer information is used to reference a large number of parameters, greatly reducing the information transmitted repeatedly by each VCL nal unit.
The sequence and image parameter set can be sent before the VCL nal unit is sent and transmitted repeatedly, greatly improving the error correction capability. Sequence and image parameter sets can be transmitted in "out-of-band" or in other "out-of-band" channels that are more reliable.
Storage Unit:
A set of NAL units in the specified format are called storage units. Each storage unit corresponds to an image. Each storage unit contains a set of vcl nal units to form a master-encoded image. The VCL nal unit is composed of image entries that indicate video image sampling. You can add a prefix before the storage unit, divide the storage unit, and add the enhancement information (SEI) (image timing information) to the front of the main encoded image. The VCL nal unit attached to the primary encoded image contains the redundant representation of the same image, which is called the redundant encoded image. When the data of the primary encoded image is lost or damaged, the redundant encoded image can be decoded.
Encoding video sequence
An encoded video sequence consists of a series of continuous storage units that use the same sequence parameter set. Each video sequence can be decoded independently. The encoding sequence starts with an instant refresh storage unit (IDR ). IDR is an I-frame image, indicating that subsequent images do not need to be referenced in previous images. A nal unit stream can contain one or more encoded video sequences.
RTP protocol:
Real-Time Transport Protocol (RTP) is a network protocol used to process multimedia data streams over the Internet) or you can transmit streaming media data in real time in a one-to-multiple (Multi-play) network environment. RTP usually uses UDP for multimedia data transmission, but other protocols such as TCP or ATM can be used if needed. The entire RTP protocol consists of two closely related parts: RTP data protocol and RTP control protocol. Real Time Streaming Protocol (RTSP) was first proposed by Real Networks and Netscape. It is located on RTP and RTCP, and its purpose is to transmit multimedia data effectively through an IP network.
RTP data protocol
The RTP data protocol is used to package streaming media data and implement real-time transmission of media streams. Each RTP data packet consists of header and payload, the first 12 bytes in the header are fixed, while the load can be audio or video data. The Header Format of the RTP datagram is 1:
The important domains and their meanings are as follows:
CSRC notation (CC) indicates the number of CSRC identifiers. The CSRC Mark follows the fixed RTP Header to indicate the source of the RTP datagram. the RTP protocol allows multiple data sources in the same session, which can be combined into a data source through the RTP mixer. For example, a CSRC list can be generated to represent a teleconference, which combines the voice data of all speakers into a RTP data source through an RTP mixer.
The load type (PT) indicates the RTP load format, including the encoding algorithm, sampling frequency, and bearer channel used. For example, type 2 indicates that the RTP data packet carries voice data encoded using the ITU g.721 algorithm. The sampling frequency is 8000Hz and the single channel is used.
Serial numbers are used to detect data loss for the receiver, but how to handle the lost data is the application's own business. RTP protocol itself is not responsible for data retransmission.
The timestamp records the sampling time of the first byte in the load. The receiver can use the timestamp to determine whether the arrival of data is affected by latency jitter, but how to compensate for latency jitter is the application's own business. It is not difficult to see from the RTP datagram format that it contains the type, format, serial number, timestamp, and whether there is additional data, these provide a foundation for Real-Time Streaming Media transmission. The RTP protocol is designed to provide end-to-end transmission services for real-time data (such as interactive audio and video). Therefore, there is no connection concept in RTP, it can be built on the underlying connection-oriented or non-connection-oriented transmission protocols. RTP does not depend on the special network address format, but only needs the underlying transmission protocol to support frame (framing) and segment (segmentation). In addition, RTP itself does not provide any reliability mechanisms, which must be ensured by the Transport Protocol or application itself. In typical application scenarios, RTP is generally implemented as part of an application over the transport protocol, as shown in Figure 2:
RTCP Control Protocol
The RTCP control protocol must be used together with the RTP data protocol. When an application starts an RTP session, both ports are used for RTP and RTCP respectively. RTP itself does not provide a reliable guarantee for data packets transmitted in sequence, nor does it provide traffic control and congestion control, which are completed by RTCP. Generally, RTCP uses the same distribution mechanism as RTP to periodically send control information to all session members. The application receives the data and obtains relevant information of the session participants, and network conditions, packet loss probability and other feedback information, so as to control service quality or diagnose network conditions.
The functions of the RTCP protocol are implemented through different RTCP datagram, mainly including the following types:
Sr sending end report. The so-called sending end refers to the application or terminal that sends the RTP datagram, And the sending end can also be the receiving end.
Rr receiving end report. The so-called receiving end refers to an application or terminal that only receives but does not send RTP datagram.
The sdes source description mainly serves as a carrier for the identity information of session members, such as user name, email address, and phone number. In addition, it also provides the ability to send session control information to session members.
The main function of the bye notification is to indicate that one or more sources are no longer valid, that is, other members in the notification session will quit the session.
The app is defined by the application itself, which solves the problem of RTCP scalability and provides great flexibility for the Protocol Implementers.
RTCP datagram carries the necessary information of service quality monitoring, which can dynamically adjust service quality and effectively control network congestion. Because RTCP datagram adopts the multicast mode, all members in the session can use the control information returned by the RTCP datagram to understand the current situation of other participants.
In a typical application, the application that sends a media stream periodically generates the sender Report SR. The RTCP datagram contains synchronization information between different media streams, as well as the sent datagram and byte count, the receiving end can estimate the actual data transmission rate based on the information. On the other hand, the receiving end sends the RR report to all known senders. The RTCP datagram contains the maximum serial number of the received datagram, the number of lost datagram, latency jitter, timestamp, and other important information, based on this information, the sender application can estimate the round-trip latency and dynamically adjust the transmission rate based on the datagram loss probability and latency Jitter to improve network congestion, or, you can smoothly adjust the service quality of your application based on the network conditions.
RTSP real-time stream Protocol
As an application layer protocol, RTSP provides a scalable framework, which makes real-time streaming media data controlled and On-Demand Streaming possible. In general, RTSP is a Streaming Media Protocol mainly used to control data transmission with real-time characteristics. However, RTSP does not transmit data, but must rely on some services provided by the lower-layer transmission protocol. RTSP can provide streaming media operations such as playing, pausing, and fast forward. It defines specific control messages, operation methods, status codes, and other operations. It also describes the interaction with RTP.
The RTSP has many references to the HTTP/1.1 protocol during the preparation, and even many descriptions are identical with HTTP/1.1. RTSP uses similar syntaxes and operations as HTTP/1.1 to be compatible with the existing web infrastructure, most HTTP/1.1 extensions can be directly introduced into RTSP.
A media stream set controlled by RTSP can be defined by the presentation description. The so-called representation refers to the set of one or more media streams provided by the Streaming Media Server to the client, the description contains information about each media stream, such as the data encoding/decoding algorithm, network address, and media stream content.
Although the RTSP server also uses identifiers to differentiate each session, the RTSP connection is not bound to a transport layer connection (such as TCP ), that is to say, during the entire RTSP connection, the RTSP user can open or close multiple reliable transmission connections to the RTSP server to send an RTSP request. In addition, RTSP connections can also be based on connectionless transmission protocols (such as UDP ).
The RTSP protocol currently supports the following operations:
Retrieving media allows users to submit a description to the Media Server through HTTP or other methods. For example, if the description is multicast, the description includes the multicast address and port number used for the media stream. If the description is unicast, to ensure security, only the target address is provided in the description.
The invited Media Server can be invited to an ongoing meeting, play back the media in the presentation, or record all media or its subsets in the presentation, which is very suitable for distributed teaching.
Adding media to notify users of new available media streams is particularly useful for on-site lectures. Similar to HTTP/1.1, RTSP requests can also be handled by proxy, channel, or cache.
3. Processing in jm86
Involved functions:
Flowchart:
Differences between I frame and IDR frame:
1. In H.264, I frames do not have the random access capability. This function is undertaken by IDR. In the previous standard, I frame is used.
2. IDR will result in DPB (refer to the frame list-this is the key) clearing, and I will not.
3. Both I and IDR frames are actually I frames and both use intra-frame prediction. However, the IDR frame function is to refresh immediately so that errors are not transmitted. The encoding starts from the IDR frame and recalculates a new sequence.
4. IDR images must be I images, but I images may not be IDR images. There can be many I images in a sequence, and the images after I images can reference the images between I images for motion reference.