Methods for repairing corrupted GZ or tar.gz compressed files

Source: Internet
Author: User
Tags gz file

To repair the corrupted gzip compressed file principle, and then refer to the gzip structure diagram:

650) this.width=650; "title=" Presentation 1 "style=" border-left-0px; border-right-width:0px; Background-image:none; border-bottom-width:0px; padding-top:0px; padding-left:0px; padding-right:0px; border-top-width:0px "border=" 0 "alt=" Presentation 1 "src=" Wkiol1sxvvsigmotaaii5sfoxr0367.jpg "width=" 888 "height=" 397 "/>

As is known in the previous article, the key to repairing a corrupted gzip file is to find the starting point for the next normal compressed package. According to the information in the structure diagram, it is shown that the starting structure of each compressed package has the tail mark, the type of Huffman tree used, and the number of tree elements of 3 huffman trees. If there is a bad sector in the middle of a gzip file, to find a normal starting point after the bad sector, just shift the bitwise right and shift to a bit that can be decompressed properly, you may find the correct start of the compression packet. The compression job window of the gzip file calculates the 32KB size, and this traversal is not more than 64KB to find. Fast loops in memory can be quickly found, but there is a need for a clear method of judging the error.

The first thing to be clear is the end flag, which should be 0 (we are looking backwards from the broken point). Huffman tree type is also roughly the dynamic Huffman (0x02), the number of elements of CL1 should be a value of 257 to 286 (including the boundary), Cl2 the number of elements should be less than or equal to 30,CCL the number of elements of the value can be 1-15 (including the boundary).

In fact, can also refer to things have, untie the Huffman tree is abnormal, or through the rule of law to find the last value of 256 value, but these algorithms should be more cumbersome, there is the above algorithm to check several compressed block is sufficient.

The specific method is to modify the source code of gzip, to traverse. Due to the time relationship, no general engineering was made, and only some code was changed quickly. The approximate modification points are:

One, locate the damage point:

In the UNZIP.C,

Error ("Invalid compressed data--format violated");

Before this line, get the current decoded byte position.

Second, traverse to find the damage point:

1, inflate.c file, change

if (nl > 286 | | nd >)


return 1;


if (nl > 286 | | | nd > 30| | NL <257 | | nd <1)


return 1;

2. In the inflate.c file, in the int inflate_block (e) function

Before the following code

bb = b;
BK = k;

Add code:

if ((t! = 2) | | (*e! = 0))
return 2;

3. inflate.c file, in int inflate_block (e) function tail

The IF (t = = 0) and if (t = = 1) are returned directly to the error value 2.

4, inflate.c file, function int inflate (), change

if ((R = Inflate_block (&e))! = 0)
return R;


unsigned t; /* Block Type */
Register ULG b; /* Bit buffer */
Register unsigned k; /* Number of bits in bit buffer */
while (Inptr <= insize)
unsigned int tptr = inptr;
unsigned int tbk = BK;
unsigned long TBB = BB;
unsigned int twp = WP;
Long Long Tstart = * (Long long*) (Inbuf + tptr);
if ((R = Inflate_block (&e))! = 0)
Inptr = tptr;
bb = TBB;
BK = Tbk;
b = BB;
K = BK;
Needbits (1)
Dumpbits (1)
printf ("Get by!"); can also output TSTART,BB,BK value, reprint, please retain the copyright information: Yu

After this 4-step, try debugging the wrong. gz file, and, of course, you can also add a seek after interpreting the header structure in the code, and seek directly to the damaged location.

Typically, the output of printf ("Get by!") This line of code has found the correct starting bit.

After finding the starting bit, you can also construct or copy a normal gzip file header, and then splice the found bit stream, can be extracted. (If the bitstream is not byte-aligned, it is possible to do all of the displacements). After splicing a lot of compressed files can be opened even decompression, however, there may be error, mainly the tail of the checksum size error, in fact, can be ignored.

If the splicing is good under Linux, can not directly use "gzip–d" decompression, because of its CRC error, will cause the decompression to 99% error, and then delete the file, replaced by the pipeline command:

Gunzip < damaged.gz > Damaged

Methods for repairing corrupted GZ or tar.gz compressed files

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.