Talk about zip, rar file format

Source: Internet
Author: User
Tags repetition password protection

Ma Jian
Email:[email protected]
Published: 2006.11.21
Last update: 2006.11.25

Directory
I. Catalogue table (TOC) and Sub-volume (Volume)
Second, solid compaction (solid) compression mode
Third, security
Iv. openness
V. Conclusion

Disclaimer: This article is not an academic paper, the content is only for my personal views and experience, without any authority, only for interested people to refer to, but if you do not have sufficient identification ability, it is recommended not to look, so as not to mislead.

I. Catalogue table (TOC) and Sub-volume (Volume)

Aside from the compression algorithm, I think the biggest difference between zip and RAR is in the Directory table (table of CONTENTS,TOC): Zip has TOC, and RAR does not.

The term TOC is actually borrowed from the publication, refers to each book in front of the "directory", its role in the Earth people know: If you want to quickly find a content in a book, you can first check the TOC, and then the TOC indicated by the page number directly to turn.

In a paper book, the TOC is printed out of a table, and in the electronic file is composed of structured data form a table, its purpose is also for rapid positioning: If you want to find a file in a content, you can check the TOC, know the content of interest in the file where, directly jump over the line. The most common use is AVI, RM and other multimedia files: When playing a lot of people on the play bar point to jump to see (that is, "random access"), if there is no TOC, in a file up to hundreds of trillion in the back and forth will slow death.

In the zip file, the TOC is a table at the end of the file that lists the attributes (file name, length, and so on) of each file in the zip package and where it is stored in the ZIP package. If you need to randomly access a file in the ZIP package, simply locate the file in the TOC and skip over.

There is no TOC in the RAR file, and all files are stored sequentially after the file header.

The result of this discrepancy is that the random access zip is faster than RAR, while the sequential access RAR is faster than zip.

The so-called random access, which is mentioned earlier, randomly accesses a specified file in a compressed package. To give a simple example: an anti-compilation or download to the Web ebook, there is a lot of HTML, images, CSS, JS, and then into a compressed package. Now requires to be able to access the page in the case of the package: You can imagine, open each HTML page, it comes with the image, CSS, JS and other files may be randomly distributed throughout the compressed package, if there is no TOC, find each file when you have to start from scratch to find, will be how slow. So you can understand why the jar package is the standard ZIP package, and I only use the ZIP format to save the anti-compiled e-books, comics, PDG books and other things that may require random access.

The so-called sequential access, is the entire compressed package from the beginning to the end. In this respect, RAR has a natural advantage. And in order to save WinRAR column file time, for a single RAR I generally directly through the right-click menu decompression, rarely double-click the compressed package open and then unzip. When solving multiple rar, of course, use Batchunrar.

Since the original author of RAR has died, the exact cause of this discrepancy I believe has not been tested, but my personal speculation may be related to the DOS era of backup software dispute: In the DOS era, the computer hard disk is not as extravagant as now, 20MB even big. This capacity is backed up with two boxes of floppy disks, and the cost of backup is very low relative to the value of the data itself. So in the age of DOS, many companies and organizations have a regular hard disk backup policy to avoid irreparable data loss due to human or non-human factors, which may not be reliable in the early days. In terms of backup software, although Microsoft has provided the Backup/restore tool with DOS, but they basically do not have the data compression ability, therefore provides the backup function in the compression software, becomes the DOS era a fashion. Because of the DOS era of backup media are floppy disk, so the backup function of compression software is actually transformed into a very common feature: sub-volume compression function, that is, according to the volume of the floppy disk compression, and then the sub-volume compressed file back-up (backup) to the floppy disk, the need for re-decompression, or restore (restore) to the hard disk.

The most famous Zip tool in the DOS ERA was PKZip, which appeared earlier than the DOS version of RAR. In the case of sub-volume compression, the pkzip follows the ZIP file specification, storing the TOC at the end, which is stored in the last volume, thus causing the following problems:

1, recovery, each decompression of a disk, you must first insert the last disk, read a TOC.
2, as long as the last plate on the TOC is broken, even if the other disks are good, also can not be normal decompression.

These two shortcomings, especially the first one, were so notorious that there was a very strong call for reform. At this critical moment, the DOS version of the RAR appeared: not only the compression rate is higher than pkzip (this is very important in the DOS era, after all, the floppy disk is expensive and small), but also because of the criticism of the zip format, removed the TOC, so:

1, in the recovery of sub-volume compressed backup files, do not need to frequently insert a TOC with the sub-volume, in order to change the disk.
2, even if a sub-volume damage, you can also skip, from the good sub-volume and then start decompression.

For these reasons (and of course, there are other reasons), RAR launched quickly after the success of the PKZIP in the DOS era began to drain users, to the Windows Era basic silencing disappeared. In the Windows era introduced WinZip, then completely gave up the sub-volume compression function (zip format forever pain?) )。 And from what I see from the source code of WinRAR Unrar, now winrar decompression thinking is obviously still the file in order from the beginning to the end, it seems that the impact of the backup/Recovery tool dispute, is really far-reaching.

Second, solid compaction (solid) compression mode

In the compression algorithm aspect, I think the RAR format most characteristic is the solid solid compression way. The WinRAR v3.42 's help file describes the solid compaction as follows:

Solid compressed file is a special compression of RAR files stored in a compressed file, it is compressed file all the files as a continuous stream of data to see.

This explanation actually reveals the secret that the compaction compression format can improve the compression ratio: the basis of data compression is "repetition", for example, aaaabbb This string, there is a repetition, if expressed as a4b3, it seems to be shortened? This is "data compression". "Repetition" is a concept of relative significance, in a range of seemingly no duplication, or duplication of data, to expand the scope, may be able to find more duplicate data, which is the secret of solid compaction.

To give a simple example: zip and ordinary rar compressed a bunch of jpg files, it is difficult to press down, but with the solid compression method of RAR can be, the reason is: JPG file itself is a compressed format, a single JPG file is difficult to find reusable data, Therefore, both zip and normal rar are difficult to compress, because they will need to compress the files separated by a single processing. But for solid RAR, it is to compress all the JPG files that need to be compressed as a whole, there are duplicate data between these jpg, such as they have the same file header (including various data tables), there is a compressible space. From what I've seen, flash files also use similar techniques to compress JPG: If you use multiple JPG files in a flash file, they can share a single file header.

Of course, there will be no eat lunch, solid compaction in the compression ratio at the same time, there are some limitations, in the WinRAR v3.42 help file is said:

Solid compaction can increase compression performance, especially when adding a large number of small files, but it also has some important disadvantages:

    • slow to update existing solid-compressed files;
    • to extract a single file from a solid compressed file, its previous files need to be analyzed first. This causes the files to be taken out of the compressed file from the solid to be slower than the normal compressed file. However, when extracting all the files from the solid compressed file, the decompression speed is not affected.
    • if any file in the solid compressed file is damaged, it is impossible to extract all the files from the damaged range. Therefore, if solid compressed files are stored in media such as floppy disks, it is recommended that you use "Recover records" when making them.

The applicable occasions for solid compaction are:

    • compressed files are seldom updated;
    • do not need to extract a file or some files from the compressed file frequently;
    • compression efficiency is more important than compression speed.

Corresponding to the aforementioned "random access", solid compressed RAR files may be the world's least suitable for random access: If you need to access a file in a solid RAR package, it is necessary to extract from the file header, continue to solve this file.

RAR solid compaction also to manually select, so with less people, and 7z in order to pursue compression rate, the default is to use solid compression, so CV, UV, etc. need random access software, from beginning to end did not want to support 7z format.

Third, security

The security here contains several implications: File system security, password protection security, and file data security.

Due to the fact that the file security of the operating system itself has not been paid enough attention because of the ZIP format specification, the ZIP format only records the most basic file attributes, including read-only properties, and no additional security attributes.

When the RAR format was first introduced, the security of the file system can only be referred to DOS, and zip is similar. But RAR after all is a closed format, want to change the author of a person said even, so when NTFS appears in Windows, and introduce extended file system security attributes, RAR is also actively follow up, so it should be said that RAR format in this respect than zip strong.

Password protection is provided in both the zip and RAR formats, but the security intensity of password protection is different.

Zip because of open format, code open source, so zip password cracking software appears earlier, but also more. Initial violence-based, the threat is not serious, the real security of the zip password is known plain text (known plaintext) attack law: If you know the encrypted zip file in a piece of content (ciphertext, ciphertext) decrypted real content (plaintext, plain text), The ZIP encryption password can be rolled back. Under the threat of this attack method, and the laws of some countries limited by the password technology, the famous open source organization Zlib announced the permanent abandonment of the encryption zip support, see the zlib website for instructions (but in the zlib release of the source code to look closely, still can find the original encryption/decryption related code).

Remember that when RAR was launched as well as zip, although the file contents in the encrypted file cannot be listed, the file name in the encrypted file can be listed. Later probably also by the known plain text attack method scare, added an "Encrypted file name" option, simply even the encrypted RAR file in the files are not visible, so that the attacker would like to guess the plaintext is impossible to guess.

The RAR format was launched later than the zip, and has learned enough in terms of safety, so it is recommended by the National Standards and Technology Agency (Institute of Standard and Technology, NIST), AES symmetric encryption algorithm with a high degree of security is currently recognized, with a key length of 128 bits. Before the ASE was compromised (NIST believed it could not be breached within 30 years), everyone could only go around in a violent way, so password security should be higher than zip. The Help file for this winrar 3.42 is described in this way:

The ZIP format uses a private encryption algorithm. RAR compressed files are encrypted using a more powerful AES-128 standard. If you need to encrypt important information, choose RAR compressed file format will be better. For true security, the password length should be at least 8 characters. Do not use words in any language as passwords, preferably random combinations of characters and numbers, and note the case of passwords. Keep in mind that if you lose your password, you will not be able to remove the encrypted file, even if the author of the WinRAR itself cannot decompress the encrypted file.

In terms of data security, the RAR format natively supports a special type of additional information called "Recovery Records". If the RAR file has a recovery record, WinRAR can try to repair the data in accordance with "recovery record" if the media is physically damaged or otherwise causes data loss. There is no recovery record in the zip format, so it should be said to be weaker than RAR in terms of data security.

Although the RAR file itself supports the recovery of records, but in winrar this option is turned off by default, while opening will result in the compressed RAR file Volume increase (the percentage increase is related to the setting), May make some people feel unaccustomed (I have seen someone in the forum complaining about why the RAR file is so large), so this function is basically a fake.

Iv. openness

The openness of the contrast is obvious: The ZIP format is not only fully open file format, but also has a dedicated open source organization to provide operational source code, cross-platform use is not much limitation; RAR format is completely confidential, the author only provides the source code to extract the required, do not provide the required source code compression, cross-platform use a bit of trouble.

Zip Open source organization, the most famous is zlib and Infozip, each has a focus: zlib emphasis on memory buffer compression, so the PNG and other open source organizations as internal compression algorithm, even Java JAR program core from Zlib, Hit out of the jar package is also a standard zip file, infozip emphasis on the operation of the file (including password protection), the application does not seem to be zlib wide, but I personally think it is still full of use, the premise is that it needs to make some necessary changes to its source code.

In the PNG organization of the Web page on the origin of the PNG format, I think it is also very interesting: to do PNG em, in fact, the original is to do GIF format, but because Unisys company began to GIF format core--LZW compression algorithm levy royalties, this help people angry, Simply propose the PNG format: large structure or the use of segmented structure, but the core compression algorithm using open source zlib, the compression effect in most cases than the GIF LZW more powerful. Since there is no copyright restriction, PNG is widely used in the field of static graphics, if it is not in time to put forward animation support and so on the web, I guess the GIF has long died.

RAR extract source code on its official website www.rarlab.com, usually a little later than the official version of WinRAR, but is said to be directly from the WinRAR source code, so compatibility should be no problem.

V. Conclusion

The following views are purely personal and are for informational purposes only and do not have any guiding significance:

      • If you often need random access to a compressed package, you should choose zip instead of rar. While compressing the downloaded RAR into a zip will be a hassle, but it will reduce the number of problems in the future.
      • If you need to compress the volume (for example, some websites have limited upload file size), you can only use RAR. In fact, this is the only time I will use the RAR format, all other times zip is not discussed.

Talk about zip, rar file format

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.