gzip Compression principle Analysis (04)--Chapter III gzip file format detailed (302) Gzip file header

Source: Internet
Author: User
Tags character set crc32 local time reserved

The file header is composed of fixed-length part and extension part, the extension part does not exist, especially the HTTP compression used in the network transmission, if the gzip format is used, the corresponding compression message usually does not have the extension part. The gzip file format identifies the head with an extended portion by using some bit position bits of the fixed length portion of the head, which we'll look at.

The following fields of chinese meaning, I searched the internet for a long time did not find the ideal, here I will RFC1952 in the English interpretation of a rough translation, the shortcomings of the welcome to vomit Groove.

1. The file starts with a 10-byte fixed length section

+------+------+------+-------+---+---+---+---+------+-----+

|  ID1 |   ID2 |   CM |       FLG |  MTIME |  XFL | OS |

+------+------+------+-------+---+---+---+---+------+-----+

The contents of the above two "+" represent a byte, so the above except Mtime uses four bytes, the others only occupy one byte.

ID1 and ID2 (identification): These two bytes are used to identify the gzip file, where ID1 = (0x1f,\037), ID2 = 139 (0x8b,\213), if a file is judged to start with these two bytes, Then you can initially think that this is a gzip file, but specifically not, the file format must be fully compliant with the gzip file format before line;

CM (Compressionmethod): This field identifies the compression method used by the compression result inside the current gzip compressed file, the range of values [0,8] (hint, this is a closed interval), where [0,7] is reserved, currently only 8, That is, Gzip uses deflate compression method;

FLG (Flags): The mark bit, each bit in the marker indicates whether the post-facing should extend the presence of the bit, the meaning of each bit is as follows,

Bit 0 Ftext

Bit 1 FHCRC

Bit 2 Fextra

Bit 3 FNAME

Bit 4 fcomment

Bit 5~7 reserved, must be full 0

bit 0 Ftext, if set, indicates that the file (which should refer to the compressed file) is an ASCII text file. Whether this bit is set is optional, and the compression tool positions it by checking whether a small amount of input data contains non-ASCII characters. Once a non-ASCII character is suspected, the pail bit indicates that this is a binary file. For systems that use different file formats for ASCII text files and binaries, the Unzip tool determines the appropriate file format by whether the bit is placed. We deliberately do not allow compression to use the algorithm to set this bit, because the compression tool itself can choose whether to place it, and the decompression tool can often choose to ignore this bit and throw the data conversion problem to other programs.

bit 1 FHCRC, if set, indicates that the gzip file header will be CRC16 checksum, the result will be placed before the actual compressed data, next to the actual compressed data. The CRC16 consists of two low-effective bytes of CRC32, CRC32 is used to calculate the checksum for the entire gzip header, but the two bytes that CRC16 occupies are not included in the calculation. This one is generally not to be placed. (Note: This CRC32 is not the same as what we are going to say later in the end of the gzip file CRC32 the same CRC32 ... But all are calculated with 32-bit cyclic redundancy check code algorithm, but the object is different. The CRC32 object here is the GZIP header, which is calculated to take only two low-effective bytes, which make up the CRC16; and what we're going to say is that the CRC32 at the end of the gzip file is all the original data to be compressed, and the two concepts must be clear.

bit 2 Fextra, if set, is indicated with an extended Gzip header section. Extended section Follow-up introduction.

bit 3 FNAME, if set, indicates the file name (not compressed) that carries the compressed file, and the file name ends with '/' (that is, a string). The file name must consist of characters in ISO 8859-1 (LATIN-1), and in those systems that use EBCDIC or other character sets, the file name must be converted to the ISO LATIN-1 character set for the gzip file to carry. This file name is the original filename of the compressed file, and does not carry any path information, just a file name. If the compressed file is on a file system that is not sensitive to the case of a name, the file name must be all lowercase. If the compressed data does not come from a file with a file name, the file name is not taken (for example, using gzip to compress the HTTP reply message), for example, if the compressed data comes from a UNIX system's standard input, the gzip file does not carry the file name.

bit 4 fcomment, if set, means carrying a description of the file (which is also a string) ending with '/'. This description is just for people to use, similar to the status code reason phrases in the header of the HTTP reply message. This file indicates that you must also use the characters in ISO 8859-1 (LATIN-1). A decimal line break should be used when swapping lines.

MTIME (modificationtime): This field gives the time at which the compressed original file was recently modified. This time uses the UNIX format, which is the number of seconds from 0 o'clock January 1, 1970. Note that this approach can cause problems for MS-DOS or other systems that use local time instead of universal time. If you are not compressing a file, this field is the time that the compression work began. If the field is 0, there is no time stamp available (this is common in HTTP compression messages using gzip).

XFL (extraflags): This field is dedicated to the gzip file used in the compression method, because the current gzip only use a compression method, or compression algorithm, that is, deflate, so for deflate, the field has the following meanings,

Xfl= to the compression rate of the largest but the slowest compression (the compression level);

Xfl= 4– the fastest compression (level);

(Note: Deflate is divided into 0~9 compression level, the subsequent analysis of the compressed source section will be specifically analyzed compression level)

OS (operatingsystem): This field represents the file system for the dry compress thing. This field is useful for determining the line end flag for a text file. The current value of this field represents the following system, respectively,

0-fat filesystem (MS-DOS, OS/2, Nt/win32)

1-amiga

2-vms (or OpenVMS)

3-unix

4-vm/cms

5-atari TOS

6-HPFS filesystem (OS/2, NT)

7-macintosh

8-z-system

9-cp/m

10-tops-20

11-NTFS filesystem (NT)

12-qdos

13-acorn Riscos

255–unknown

2. header extension Field

The above 10 bytes will exist anyway, and the extension field described here is based on the 10 bytes above to determine if there is one. Divided into four parts,

In order:Fextra+FNAME+fcomment+FHCRC,

Not necessarily will exist, but as long as there is, no matter how many, must follow the order, for example, FHCRC and fname are present, then fname must be in front of FHCRC ... Let's analyze it one by one.

Fextra:

+-----+-----+===============================================+

| Xlen | ..................... Xlen bytes of "extra field" ... | (more-->)

+-----+-----+===============================================+

Xlen is a two-byte record that represents the size of the Extra field section. And the extra field part is subdivided into the following structure,

+--------+--------+--------+--------+========================+

|     SI1 |          SI2 | LEN | ......... LEN bytes Ofsubfield Data ... |

+--------+--------+--------+--------+========================+

(I translate this part into a secondary domain) SI1 and SI2 provide an ID for this secondary domain, which is usually represented by two memory-friendly ASCII letters (this sentence is not known to be translated). Jean-loup Gaillygzip@prep.ai.mit.edu (gzip source author) maintains a secondary domain table that you can send to him with its secondary domain ID. SI2 = 0 of the secondary domain ID is currently reserved for future reuse. Now the secondary domain ID is defined in this way,

SI1 |        SI2 | Data

----------      ----------     -------------

0x41 (' A ') 0x70 (' P ') Apollo file typeinformation (this really doesn't know how to translate)

Len gives the length of the secondary domain data section, but does not include the four bytes of SI1, SI2, and Len.

FNAME:

+=========================================+

|...originalfile name, zero-terminated...| (more-->)

+=========================================+

End With ' s ', it's a string.

fcomment:

+===================================+

|...filecomment, zero-terminated...| (more-->)

+===================================+

End With ' s ', it's a string.

FHCRC:

+-------+-------+

| CRC16 |

+-------+-------+

This is followed by the actual compression data, which is the body part of the shrimp.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.