Microsoft Office composite document

Source: Internet
Author: User
Laura "-mysterious Microsoft Office software File FormatIn composite documents, there can be many directories, each of which can contain subdirectories. The directories and subdirectories contain "Storage". One storage is equivalent to one file on the disk, the composite document forms a tree structure similar to directories and files on disks. If you use a composite file in a Windows environment, you can use the features provided by the operating system to read and write composite files, just like reading and writing common files and directories, you can create a directory in the composite file, you can open a specified directory to read and write a "Storage" (file ). However, in DOS or other environments, the operating system does not provide the function of reading and writing composite files. To read and write composite files in other operating systems, for example, if you develop software that can read and write Word files in Linux or that can scan and kill macro viruses in DOS, you must have a clear understanding of the binary structure of Microsoft "composite document.
When word macro virus first emerged, domestic anti-virus software vendors were helpless without exception. Unlike large foreign companies that have great cooperation with Microsoft, some internal documents of Microsoft can be obtained, and Chinese manufacturers have no idea about the internal structure of Word files. To deal with the virus, the vendor came up with two methods:
The first is to use Word basic to write programs to detect and clear viruses. In fact, word basic is the development language used by the macro virus itself, developers use Word basic to write a small segment of code automatically loaded (In a sense, this is also a virus). before opening any word file, first, check whether there are any macros called "auto-open" and "auto-save". If these macros exist, the document will not be opened.
The second method is simpler. After analyzing the format of Word files, it is difficult for developers to find the format of such files. Therefore, a simple search/replacement method is adopted, search for a string in the entire Word file. For example, search for a string named "autoopen". If this string is found, clear it as a space, so that when word opens the file, the Macro will no longer be automatically run.
Both methods have major problems. The first method can only run in the word environment, and there is no way to run the virus without starting the word. The second method has more problems. This kind of activity is not responsible for virus scanning and removal without figuring out the file structure. First, it will cause a large number of virus false positives and false negatives, take a normal word file as a virus, or even write an article about macro virus, which assumes the following sentence: "macro viruses often contain macros named by autoopen ", after virus detection and removal in this way, we will find that the word "autoopen" is missing. Oh, my God, can this be called virus killing? After the virus is killed, the damage to the data in the Word file is even more complicated. As a result, some manufacturers repeatedly publicized the basic requirement of anti-virus software as a major technical breakthrough of the product in the following long time.
"Laura" File Format: All files in the format of "Laura" are composed of 512 bytes of data blocks (you can note that, all Word, Excel, or other office files are multiples of 512. The data block sequence number starts from-1:
Composite document

Data Block-1
Data Block 0
Data Block 1
Data Block 2
Data block...
512 bytes 512 bytes      


The part with the serial number-1 is the file header block of the entire file, which stores some of the overall information of the composite file. The structure is as follows:

Offset (hexadecimal) Size (bytes) Content
0 8 Composite File ID (D0 CF 11 E0 A1 B1 1A E1)
2c 4 Size of a large image (number of blocks)
30 4 Start block number of the Directory chain Root
3c 4 Starting block number of a small image
4C Uncertain List of blocks used by a large image Graph

Based on a 512-byte data block, the composite file contains two basic structures:
The first type is a large blockchain connected by 512 bytes. If you are familiar with a file system based on a file allocation table (FAT), you can easily understand the concept of a large blockchain, as long as you know the sequence number of the starting block of a large blockchain, you can find all the content of this large blockchain through a large image. A typical large image is shown below:
00200: fd ff 05 00 00 00 Fe FF 04 00 00 00
00210: 06 00 00 00 Fe FF 07 00 00 00 08 00 00
00220: 09 00 00 00 0a 00 00 00 0b 00 00 Fe FF
00230: FF
We can see that if the starting block number of a blockchain is 0 (the content of this blockchain is 5), The blockchain includes: data blocks with no. 0, 5 (where content is 7), and 7 (where content is 9) the data block with the serial number 9 (the content in this section is 0b) and the data block with the serial number 0b (the content in this section is-1, indicating that this is the last data block of the chain ).
For a relatively small structure, if the unit is 512 bytes, it will cause a large waste of space. Therefore, a large blockchain is used to store small data blocks, the data structure smaller than 4096 bytes is represented by small blockchain. The composition and addressing method of small blockchain is very similar to that of large blockchain. The only difference is that, the addressing of small blocks in a small block chain is not within the scope of the composite file, but within a specific range of large blocks. The starting block sequence number of this large block chain is described later.
Directory chain is the most basic data link for composite files. It describes the directory structure information of composite files. The start of the Directory chain can be found in the header block. The directory chain contains the directory information of the composite file. The size of each directory item is 128 bytes. Therefore, a block of the Directory chain can contain four directory items, the first directory item is the root directory item named "root entry". This is the first directory item in any composite file. A typical root directory item is as follows:
00400: 52 00 6f 00 6f 00 74 00 20 00 45 00 6e 00 74 00 R o t e n t
00410: 72 00 79 00 00 00 00 00 00 00 00 00 00 00 00 R Y
00420: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00430: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00440: 16 00 05 00 FF 03 00 00 00
00450: 00 09 02 00 00 00 00 00 C0 00 00 00 00 00 46
00460: 00 00 00 00 00 00 00 00 00 00 86 29 F6 1f
00470: Ad 57 BB 01 03 00 00 00 00 0f 00 00 00 00 00 00
The directory item structure is described as follows:

Offset (hexadecimal) Size (bytes) Content
0 40 The name of the Directory item (so the name in the composite file cannot exceed 40 bytes)
40 2 Name Length
42 2 Directory item type 1 is a storage (file), 2 is a directory, 3 is the root
44 4 Previous directory item
48 4 Next directory item
4C 4 If it is a directory, point to a subdirectory
74 4 Start block of the stored content
78 4 Size of the stored content

Since the above data structure is not from Microsoft's official documentation, it includes many elements of speculation, so a lot of content cannot determine its meaning for the moment, some structure descriptions may not match Microsoft's original intent, but we have used this structure to analyze a large number of Microsoft documents. So far, no obvious errors have been found.
Based on the basic "Laura" file structure, word processing documents and electronic data table documents have different internal directory structures. Below is a typical Word file internal directory structure:
1. Doc
-- 1 Table: Some data tables
-- Compobj: Common Object
-- Objectpool: Object pool, which is a directory, including the image, sound, or other objects embedded in the Word file.
-- Worddocument: the actual text and formatting information are stored here.
-- Summaryinforamtion: summary information
-- Documentsummaryinformation: other summary information

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.