MP4 file Format parsing

Source: Internet
Author: User
Tags format definition

Objective

the MP4 video file Encapsulation format is based on the QuickTime container format definition, so referencing the QuickTime format definition is helpful for understanding the MP4 file format. The MP4 file format is a very open container that can be used to describe almost all media structures, and the media descriptions in the MP4 file are separate from the media data, and the organization of the media data is free, not necessarily in chronological order, and even media data can refer directly to other files. At the same time, MP4 also supports streaming media. MP4 is now widely used in the encapsulation of H. S video and AAC audio, and is a representative of HD video.

I. Overview

All the data in the MP4 file is in box(in QuickTime Atom), which means that the MP4 file consists of several boxes, each with a type and length that can be understood as a block of data objects. Box can contain another box, which is called Container box. A MP4 file will first have a "Ftyp" type of box, as a MP4 format flag and contains some information about the file, then there will be a "moov" type of box (Movie box), it is a container box, The sub Box contains the media's metadata information, and the MP4 file's media data is included in the "Mdat" type of box (Midia Data box), which is also container box, which can have many You can also not (when the media data all refer to other files), the structure of the media data is described by metadata.

Here are some concepts:

Track represents a collection of sample, and for media data, track represents a video or audio sequence.

hint track This particular track does not contain media data, but rather contains instructions for wrapping other data in the track into streaming media.

for non-hint track, the video sample is a single frame, or a set of consecutive video frames, and audio sample is a continuous compressed audio, collectively referred to as Sample. For hint Track,sample, define the format of one or more streaming media packages.

Chunk A track of several sample units.

In this article, we do not discuss the design hint content, only the local MP4 file containing the media data, is a typical MP4 file structure tree.

Second, Box

First to illustrate, the byte order in box is the network byte order, that is, the big endian byte order (Big-endian), the simple is that the high-level bytes stored in the low-end memory

。 box consists of the header and body , where the header specifies the box's size and type, and the body has different meanings and formats depending on the type.

The standard box starts with a box size of 4 bytes, which includes the box header and the box body size, so that we can locate each box in the file. If size is 1, it indicates that the box size is large on the sized field. (actually only the "Mdat" type of box is likely to use the large size.) If size is 0, indicating that the box is the last box of the file, the end of the file is the end of the box (which is also only available in the "Mdat" box).

The 32-bit immediately following the Szie is box type, typically 4 characters, such as "Ftyp", "Moov", and so on, these box type are already predefined, respectively, to indicate the meaning of fixed. If "UUID", indicates that box is a user-extended type. If the box type is undefined, it should be ignored.

Third, File Type Box (Ftyp)

This box is especially limited to 1, and can only be included in the file layer, and cannot be included with other boxes. The box should indicate information about the MP4 file application at the very beginning of the file.

The "Ftyp" Boby includes a 32-bit major brand (4 characters), a 32-bit minor version (integer), and an array of 32-bit elements compatible brands. These are the information that is used to indicate the level of file application. The box's byte instance is as follows:

Four, Movie Box (Moov)

The box contains the metadata information of the file media, "Moov" is a container box, the specific content information is interpreted by the child box. As with file Type box, the box has only one and is included only in the file layer. In general, "Moov" will appear immediately following "Ftyp".

In general, "Moov" contains 1 "mvhd" and several "Trak". where "Mvhd" is the header box, generally as the first sub box of "Moov" (For other container box, the header box should appear as the first box). "Trak" contains information about a track and is a container box. As part of the "Moov" byte instance, where the red part is the box header, the green is "mvhd", and Yellow is part of "Trak".

1. Movie Header Box (MVHD)

The "MVHD" structure is shown in the following table:

Field Number of bytes Significance
Box size 4 Box size
Box type 4 Box type
Version 1 Box version, 0 or 1, typically 0 (the following bytes are calculated according to version=0)
flog= 3
Creation time 4 Creation time (in seconds of UTC time 1904-01-01 0 o'clock)
Modification time 4 Modification time
Time scale 4 The scale value of the file media within 1s time can be understood as the length of the time unit of 1s length
Duration 4 The length of the track, with the duration and time scale values, can be used to calculate the track duration, such as the duration of audio track scale= 8000,duration = 560128, which is 70.016,video Track's time scale = 600,duration = 42000. Duration is 70
Rate 4 Recommended playback rate, high 16-bit and low 16-bit are decimal integer part and fractional part, i.e. [16.16] format, the value 1.0 (0x0001 0000) indicates normal forward playback
Volume 2 Similar to rate, [8.8] format, 1.0 (0X0100) indicates maximum volume
Reserved 10 Reserved bits
Matrix 36 Video Change matrix
Pre-defined 24
Next Track ID 4 ID number to be used for the next track

Bytes for "Mvhd" for example, each field is already separated by a color area:

2. Track Box (Trak)

"Trak" is also a container box whose sub-box contains the media data references and descriptions for the track (except hit track). The media in a MP4 file can contain multiple track and at least one track, which is independent of each other and has its own time and space information. "Trak" must contain a "TKHD" and a "Mdia", plus a lot of optional box (slightly). where "TKHD" is the track Hreader box, "Mdia" is the media box, which is a container box with some track media data information box.

Part of "Trak" for example, where yellow is the head of "Trak" box, Green is "TKHD", Blue Is "Edts" (an optional box), red is part of "Mdia".

2.1. Track Header Box (TKHD)

The "TKHD" structure is shown in the following table:

Field Number of bytes Significance
Box size 4 Box size
Box type 4 Box type
Version 1 Box version, 0 or 1, typically 0. (the following bytes are pressed version=0)
flog= 3

The bitwise OR operation result value is predefined as follows:

0x000001 track_enabled, otherwise the track will not be played

0x000002 Track_in_movie, indicating that the track is referenced in playback

0x000004 Track_in_preview, indicating that the track is referenced at preview

Typically this value is 7, and if Track_in_movie and Track_in_preview are not set for all track in a media, they are understood to be set for all track, and for hint track, this value is 0.

Creation time 4 Creation time (in seconds of UTC time 1904-01-01 0 o'clock)
Modification time 4 Modification time
Track ID 4 ID number, cannot be duplicated and cannot be 0
Reserved 4 Reserved bits
Duration 4 Track length of time
Reserved 8 Reserved bits
Layer 2 Video layer, default is 0, the value is small on the upper level
Alternate Group 2 Track grouping information, default = 0 indicates that track does not have a group relationship with other track
Volume 2 [8.8] format, if the audio track,1.0 (0x0100) represents the maximum volume; otherwise 0
Reserved 2 Reserved bits
Matrix 36 Video Transformation matrix
Width 4 Wide
Height 4 High, both [16.16] format values, compared to the actual picture size in sample description, for display width when playing

2.2. Media Box (Mdia)

"Mdia" is also a container box, the structure and type of its sub-box is relatively complex. Let's take a look at the example tree of "Mdia".

Overall, "Mdia" defines the track media type and sample data to describe the sample information. General "Mdia" contains a "MDHD", a "HDLR" and a "Minf", where "MDHD" is the media header box, "HDLR" for handler reference box, "Minf" for media Information box. Let's look at the structure of these boxes in turn.

2.2.1 Media Header Box (MDHD)

The "MDHD" structure is shown in the following table:

Field

Number of bytes

Significance

Box Szie

4

Box size

Box type

4

Box type

Version

1

Box version, 0 or 1, typically 0. (the following bytes are pressed version=0)

Flags

3

Creation time

4

Creation time (in seconds of UTC time 1904-01-01 0 o'clock)

Modification time

4

Modification time

Time scale

4

The scale value of the file media within 1s time can be understood as the length of the time unit of 1s length

Duration

4

Track length of time

Language

2

Media language code. The highest bit is 0, and the back 15 bits are 3 characters

Pre-defined

2

2.2.2 Handler Reference Box (HDLR)

"HDLR" explains the media's playback process information, which can also be included in Meta box (meta). The "HDLR" structure is shown in the following table.

Field Number of bytes Significance
Box Szie 4 Box size
Box type 4 Box type
Version 1 Box version, 0 or 1, typically 0 (the following bytes are per version=0)
flog= 3
Pre-defined 4
Handler type 4

In media box, the value is 4 characters:

"Vide"--video track

"Soun"--audio track

"Hint"--hint track

Reserved 12
Name Indefinite Track type name, string ending with '/'

Bytes for "HDRL" for example, each field is already separated by a color area:

2.2.3, Media information Box (minf)

"Minf" stores handler-specific information that interprets the track media data, which media handler uses to map media time to media data and process it. The information format and content in "Minf" is closely related to the media type and media handler of medium data, and other media handler do not know how to interpret this information. "Minf" is a container box whose actual contents are described by the child box.

In general, "Minf" contains a header box, a "Dinf" and a "stbl", where the header box is divided into "VMHD", "SMHD", "HNMD" and "NMHD" according to track type; "Dinf" is data Information box; "STBL" is the sample table box. The following are described separately.

2.2.3.1, Media information Heder Box (VMHD, SMHD, HMHD, NMHD)

Video Media Header Box (VMHD)

Field Number of bytes Significance
Box size 4 Box size
Box type 4 Box type
Version 1 Box version, 0 or 1, typically 0 (the following bytes are per version=0)
flog= 3
Graphics mode 4 Video compositing mode, copy the original image for 2 o'clock, otherwise with opcolor
Opcolor 2*3 {Red,green,blue}

Note: The MP4 file recorded by the FFmpeg library differs from the above table.

Sound Media Header Box (SMHD)

Field Number of bytes Significance
Box size 4 Box size
Box type 4 Box type
Version 1 Box version, 0 or 1, typically 0 (the following bytes are per version=0)
flog= 3
Balance 2 Stereo balance, [8.8] format value, typically 0,-1.0 for all left channels, 1.0 for all right channels
Reserved 2

Hint Media Header Box (HMHD)

Slightly

Mull Media Header Box (NMHD)

non-Audio Video media use the box, slightly.

2.2.3.2 Data Information Box (dinf)

"Dinf" explains how to locate media information and is a container box. "Dinf" generally contains a "bref", that is, data reference box, "Dref" will contain a number of "url" or "urn", these boxes form a table to locate the track data. In short, track can be divided into segments, each of which can fetch data according to the address pointed to by "url" or "urn", in which the sequence number of these fragments is used to form a complete track. In general, the anchor string in "url" or "urn" is empty when the data is fully contained in the file.

DREF byte structure
Field Number of bytes Significance
Box size 4 Box size
Box type 4 Box type
Version 1 Box version, 0 or 1, typically 0 (the following bytes are per version=0)
Flags 3
Entry Count 4 Number of elements in the "url" or "urn" table
"url" or "urn" list Indefinite

here is a "dinf" of the byte instance graph, where the yellow is "dinf" box header, by the red part we know that contains the "url" or "urn" number is 1, red followed by "url" box content. Purple ask Oh "url" of the box header, the green is box flag, the value is 1, indicating that the "url" string is empty, indicating that the track data is included in the file.

2.2.3.3, Sample Table Box (STBL)

"stbl" almost the most complex box in an ordinary MP4 file, the first thing to recall is the concept of sample. Sample is a unit of media data storage that is stored in chunk, chunk and sample can be of different lengths, as shown in.

"STBL" contains information about all the time and location of sample in track, as well as the codec of sample. Use this table to explain the timing, type, size, and location of the sample in the respective storage container. "Stbl" is a container box whose sub-box includes: Sample Description box (STSD), Time to Sample box (Stts), Sample Size box (Stsz or STZ2), sample to Chunk Box (STSC), Chunk offset box (Stco or co64), Composition time to Sample box (Ctts), Sync samle box (STSS), etc.

"STSD" is essential and contains at least one entry, which contains the data reference box to retrieve information from sample. You cannot calculate the storage location of media sample without "STSD". "STSD" contains encoded information, and the information stored varies depending on the type of media.

Sample Description Box (STSD)

The box header and version fields will have a entry Count field, depending on the number of entry, each entry will have type information, such as "vide", "Sund", etc., depending on the type of sample Description will provide different information, such as "visualsampleentry" type information for video track, and "Audio sampleentry" type information for audio track.

Video encoding type, width height, length, audio channels, and other information will appear in this box.

Time to Sample Box (Stts)

"Stts" stores the duration of sample, describes the mapping method of the sample timing, and we can find the sample at any time through it. "Stts" can contain a compressed table to map the time and sample sequence numbers, and use other tables to provide the length and pointers for each sample. Each entry in the table provides a sequential sample sequence number in the same time offset, and the offset of the samples. By incrementing these offsets, you can create a complete time to sample table.

Sample Size Box (STSZ)

"Stsz" defines the size of each sample, contains the number of all the sample in the media, and a table that gives the size of each sample. This box is relatively large in size.

Sample to Chunk Box (STSC)

Using chunk to organize sample makes it easy to optimize data acquisition, and a chunk contains one or more sample. A table in "STSC" describes the mapping between sample and Chunk, and looking at this table can find the thunk that contains the specified sample to find the sample.

Sync Sample Box (STSS)

"Stss" determines the keyframes in media. For compressed media data, keyframes are the starting frames of a series of compressed sequences that are uncompressed without relying on the previous frame, and subsequent frames will be uncompressed depending on the keyframe. "STSS" can be very compact to mark the random access point in the media, it contains a sample ordinal table, each item in the table is strictly in accordance with the ordinal of the sample, indicating which sample in the media is a keyframe. If this table does not exist, it means that each sample is a keyframe and is a random access point.

Chunk Offset Box (STCO)

"Stco" defines the position of each chunk in the media stream. There are two possible positions, 32-bit and 64-bit, which are useful for very large movies behind. There is only one possibility in a table, this position is in the entire file, not in any box, so you can find the media data directly in the file without explaining box. It is important to note that once the previous box has changed, the table must be re-established because the location information has changed.

V. Free Space Box (free or skip)

The content in "free" is irrelevant and can be ignored. When the box is deleted, it will not have any effect on playback.

VI. Media Data Box (Mdat)

The box is included in the file layer, can have multiple, or not (when the media data are all external file references), used to store media data. Data directly following the box type field, the meaning of the specific data structure needs to refer to metadata (mainly described in the sample table).

Ordinary MP4 file structure is finished, may be more chaotic, the following picture is a common box tree structure, can be used to understand the structure of the MP4 file.

MP4 file Format parsing

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.