The mystery of Office files ——. NET platform without office for Word, PowerPoint and other files parsing (a)

Source: Internet
Author: User
Tags addchild pow

"Series Index"

    1. The mystery of Office files ——. NET platform without office for Word, PowerPoint and other files parsing (a)
      Get documentsummaryinformation and summaryinformation for office binaries
    2. The mystery of Office files ——. NET platform without office for Word, PowerPoint and other files parsing (ii)
      Get text content for Word binary document (. doc) (including body, header, footer, annotations, and so on)
    3. The mystery of Office files ——. NET platform without office for Word, PowerPoint and other files parsing (three)
      Details the storage structure in office binaries and get the text content of a PowerPoint binary document (. ppt)
    4. The mystery of Office files ——. NET platform without office for Word, PowerPoint and other files parsing (finish)
      Describes how Office Open XML documents (. docx,. pptx) Parse and resolve common open source class libraries for Office files

"Article index"

    1. . NET how to read Office files
    2. Windows composite binaries and their headers
    3. Let's start with the directory.
    4. Documentsummaryinformation and SummaryInformation
    5. RELATED LINKS

"One,. NET how to read Office files "

It's 10. To do a file retrieval system, to include full-text retrieval of Word, PowerPoint, and other file formats. Because I used it before. NET and consider that these are Microsoft's formats that may be used. NET reads would be easier, but not expected. NET the information found here is only an interop way to read Office files. Later contacted the Java poi, found. NET has migrated Npoi, but only the core of Excel read and write, and there is no word, PowerPoint and other files read and write, so finally there is no way only to bite themselves to do Word and PowerPoint file parsing.

So what is Interop? The full name of interop is "interoperability", which Microsoft wants to host. NET can be invoked with unmanaged COM in a way. The benefits of reading and writing Office software that is installed on your computer by using interop to read and write to office are obvious, and files are generated or read by office, so there's no difference between opening office with yourself; That is, you must have the corresponding version of Office software installed on the computer where you are running the program, and when you operate the Office files, you actually open the corresponding Office components, so it is inefficient, memory-intensive, and can cause memory leaks. An example of an interop way to read and write Office files online There are many, interested can be self-access, here is no more talk about.

So, is there a way to read and write Office files without using Office software? The answer must be yes, like the POI in Java and the NPOI implementation in. NET, that is, the program reads and writes files by itself to read and write Office files. However, because of the complexity of the Office file structure, only file summary information and file text content parsing is provided here. But even so, it's enough for full-text search.

"Second, Windows composite binaries and headers"

A few years ago, Microsoft opened up some private format specifications, so that everyone can parse their files without paying any fees, which also makes it possible for us to write a program to parse the file, related links can be found at the end of the article. For a Microsoft Office file, which is essentially a Windows composite binary (Windows Compound binary file), the header of the file is fixed at 512 bytes , The header records the most important parameters of the file. After the header can be divided into different types of sector,sector have FAT, Mini-fat (belongs to Mini-sector), Directory, DIF, Stroage and so on five kinds. For the convenience of salutation, we stipulate that each sector has a sectorid, thesector after the header is the first sector, and its sectorid is 0.

Let's start with the header, the part of a header and the information contained below, which is more important in bold.

  1. Header of the first 8 bytes byte[], that is, the entire file of the first 8 bytes, are fixed 0xd0 0xCF 0x11 0xE0 0xa1 0xb1 0x1A 0xe1, if not the description is not a compound file.
  2. The 16 bytes from 008H to 017H are class IDs, but many files are 0.
  3. The 2-byte UInt16 from 018H to 019H is a minor version of the file format.
  4. The 2-byte UInt16 from 01AH to 01BH is the major version of the file format.
  5. The 2-byte UInt16 from 01CH to 01DH is fixed to 0xFE 0xFF, indicating that the document is using little Endian (low in front, high in the rear).
  6. The 2-byte UInt16from 01EH to 01FH is a power of sector size, with a default of 9 (0x09 0x00), which is 512 bytes per sector.
  7. The 2-byte UInt16from 020H to 021H is a power of mini-sector size and defaults to 6 (0x06 0x00), i.e. 64 bytes per Mini-sector.
  8. The 2-byte UInt16 from 022H to 023H is reserved and must be set to 0.
  9. The 4-byte UInt32 from 024H to 027H is reserved and must be set to 0.
  10. The 4-byte UInt32 from 028H to 02BH is reserved and must be set to 0.
  11. The 4-byte UInt32 from 02CH to 02FH is the number of fat.
  12. The 4-byte UInt32from 030H to 033H is the Sectorid that directory starts with.
  13. The 4-byte UInt32 from 034H to 037H, which is used for transactions, must be set to 0.
  14. A 4-byte UInt32 from 038H to 03BH is the maximum size of the minimum string (stream), which defaults to 4096 (0x00 0x10 0x00 0x10).
  15. The 4-byte UInt32from 03CH to 03FH is the Sectorid that the Minifat table starts with.
  16. The 4-byte UInt32from 040H to 043H is the number of minifat tables.
  17. The 4-byte UInt32from 044H to 047H is the Sectoridthat Difat started.
  18. The 4-byte UInt32from 048H to 04BH is the number of Difat.
  19. The 436-byte uint32[] from 04CH to 1FFH is the Sectorid of the first 109 fat tables.

Then we can write the following code to parse the important contents of the header.

#region Field Private FileStream m_stream;private binaryreader m_reader;private Int64 m_length;private DirectoryEntry m_ Dirrootentry, #region head information private UInt32 m_sectorsize;//sector size private UInt32 m_minisectorsize;// Mini-sector size Private UInt32 m_fatcount;//fat number private UInt32 m_dirstartsectorid;//directory start sectoridprivate UInt32 m_minifatstartsectorid;//mini-fat start sectoridprivate UInt32 m_minifatcount;//mini-fat number Private UInt32 m_ Difstartsectorid;//dif start sectoridprivate UInt32 m_difcount;//dif number #endregion#endregion#region read header information private void    Readheader () {if (This.m_reader = = null) {return; }//First determine if the Office file format byte[] sig = (This.m_length > This.m_reader.    Readbytes (8): null);        if (sig = = NULL | | SIG[0]! = 0xD0 | | SIG[1]! = 0xCF | | SIG[2]! = 0x11 | |        SIG[3]! = 0xE0 | | SIG[4]! = 0XA1 | | SIG[5]! = 0XB1 | | SIG[6]! = 0x1A | | SIG[7]! = 0xe1) {throw new Exception ("The file is not an Office file!    "); }//Read header information This.m_stream. Seek (SEEKORIGIN.CU,Rrent); This.m_sectorsize = (UInt32) Math.pow (2, This.m_reader.    ReadUInt16 ()); This.m_minisectorsize = (UInt32) Math.pow (2, This.m_reader.    ReadUInt16 ()); This.m_stream.    Seek (ten, seekorigin.current); This.m_fatcount = This.m_reader.    ReadUInt32 (); This.m_dirstartsectorid = This.m_reader.    ReadUInt32 (); This.m_stream.    Seek (8, seekorigin.current); This.m_minifatstartsectorid = This.m_reader.    ReadUInt32 (); This.m_minifatcount = This.m_reader.    ReadUInt32 (); This.m_difstartsectorid = This.m_reader.    ReadUInt32 (); This.m_difcount = This.m_reader. ReadUInt32 ();} #endregion

Say a more interesting,. NET BinaryReader have a lot of reading methods, such as ReadUInt16, ReadInt32, and so on, only ReadUInt16 summary wrote "Use Little-endian code ..." (see), In fact, not only ReadUInt16, all Readintx, Readuintx, Readsingle, readdouble are used Little-endian encoding from the stream to read, we can rest assured that the use of Instead of needing a byte-by-byte read-and-reverse array, I went through a detour in 10. Explanation in the notes in each of the MSDN methods: http://msdn.microsoft.com/zh-cn/library/vstudio/system.io.binaryreader_methods.aspx

"Three, we start from the directory"

There is a lot of content in the compound document, so many content needs to have a directory, then directory is the directory. From the header we can read out the Sectorid that the directory started, we can seek to this position (0x200 + sectorsize * dirstartsectorid). Each DirectoryEntry in directory is fixed to 128 bytes , and its main structure is as follows:

    1. From 000H to 040H of 64 bytes, is stored DirectoryEntry name, and is stored in Unicode, that is, each character accounted for 2 bytes, can actually be considered as UInt16.
    2. The 2-byte UInt16 from 041H to 042H is the length of the DirectoryEntry name (including the last "\").
    3. The 1 byte byte from 042H to 042H is the type of DirectoryEntry. (main: 1 for the directory, 2 for the node, and 5 for the root node)
    4. The 4-byte UInt32 from 044H to 047H is the EntryID of the DirectoryEntry left brother (the EntryID of the first DirectoryEntry is 0, the same as below).
    5. The 4-byte UInt32 from 048H to 04BH is the EntryID of the DirectoryEntry right brother.
    6. The 4-byte UInt32 from 04CH to 04FH is the EntryID of the DirectoryEntry Child.
    7. The 4-byte UInt32 from 074H to 077H is the Sectorid that the DirectoryEntry starts with.
    8. A 4-byte UInt32 from 078H to 07BH is the length of all bytes stored by the DirectoryEntry.

Obviously, the directory is actually a tree-like structure, and we can just start with a recursive search from the first Entry (Root Entry).

To facilitate development, we create a DirectoryEntry class

public enum directoryentrytype:byte{Invalid = 0, Storage = 1, Stream = 2, LockBytes = 3, property = 4,    Root = 5}public class directoryentry{#region field private UInt32 M_entryid;    Private String M_entryname;    Private Directoryentrytype M_entrytype;    Private UInt32 M_sectorid;    Private UInt32 m_length;    Private DirectoryEntry M_parent;    Private list<directoryentry> M_children; #endregion #region Properties///<summary>//Get DirectoryEntry's EntryID///</summary> public UInt32 E    Ntryid {get {return This.m_entryid;} }///<summary>//Get DirectoryEntry name///</summary> public String EntryName {get {RE Turn this.m_entryname;    }}///<summary>//Get DirectoryEntry Type///</summary> public Directoryentrytype EntryType    {get {return this.m_entrytype;} }///<summary>//Get DirectoryEntry's Sectorid//</summary> PUBlic UInt32 Sectorid {get {return this.m_sectorid;} }///<summary>///Get DirectoryEntry Content Size///</summary> public UInt32 Length {get {RE Turn this.m_length;        }}///<summary>//Get DirectoryEntry parent Node///</summary> public DirectoryEntry Parent {    get {return this.m_parent;}     }///<summary>//Get DirectoryEntry's child nodes///</summary> public list<directoryentry> Children    {get {return this.m_children;}  } #endregion #region Constructors///<summary>//Initialize new DirectoryEntry////</summary>//<param Name= "Parent" > Parent node </param>//<param name= "EntryID" >DirectoryEntryID</param>//<param Nam E= "EntryName" >directoryentry name </param>///<param name= "entrytype" >directoryentry type </param>/ <param name= "Sectorid" >SectorID</param>///<param name= "Length" > Content size </param> Public DirectoryEntry (DirectoryEntry parent, UInt32 EntryID, String entryName, Directoryentrytype entrytype,        UInt32 Sectorid, UInt32 length) {This.m_entryid = EntryID;        This.m_entryname = EntryName;        This.m_entrytype = EntryType;        This.m_sectorid = Sectorid;        this.m_length = length;        This.m_parent = parent;  if (entrytype = = Directoryentrytype.root | | entrytype = = directoryentrytype.storage) {This.m_children =        New List<directoryentry> ();        }} #endregion #region method public void AddChild (DirectoryEntry entry) {if (This.m_children = = null)        {This.m_children = new list<directoryentry> (); } this.m_children.    ADD (entry); DirectoryEntry Getchild (String entryName) {for (Int32 i = 0; i < This.m_children. Count; i++) {if (String.Equals (This.m_children[i]).             EntryName, EntryName)) {   return this.m_children[i];    }} return null; } #endregion}

And then we're going to do a recursive search.

"Four, documentsummaryinformation and summaryinformation"

Office documents contain a lot of summary information, such as title, author, editing time, and so on.

Abstract information is divided into two categories, one is documentsummaryinformation, the other is SummaryInformation, respectively, contains different kinds of summary information. Through the above code should be able to get to root entry under a "\005documentsummaryinformation" and a entry called "\005summaryinformation" entry.

For documentsummaryinformation, the structure is as follows

    1. The 4-byte UInt32 from 018H to 01BH is the number of storage attribute groups.
    2. Every 20 bytes starting from 01CH, is the information for the attribute group:
      • For the first 16 bytes byte[], if it is 0x02 0xd5 0xCD 0xd5 0x9c 0x2e 0x1B 0x10 0x93 0x97 0x08 0x00 0x2B 0x2C 0xf9 0xAE, it is documentsummaryinform ation; if 0x05 0xd5 0xCD 0xd5 0x9c 0x2e 0x1B 0x10 0x93 0x97 0x08 0x00 0x2B 0x2C 0xf9 0xAE, it is userdefinedproperties.
      • For the latter 4-byte UInt32, this is the offset of the attribute group relative to entry.

For each attribute group, the structure is as follows:

    1. The 4-byte UInt32 from 000H to 003H is the attribute group size.
    2. The 4-byte UInt32 from 004H to 007H is the number of attributes in the attribute group.
    3. Every 8 bytes starting from 008H, is the information for the attribute:
      • For the first 4 bytes of UInt32, is the property number that represents the kind of property.
      • For the latter 4-byte UInt32, is the offset of the property content relative to the property group.

The following are the common attribute numbers:

View Code

For each property, the structure is as follows:

    1. A 4-byte UInt32 from 000H to 003H, which is the type of the property content.
      • Is UInt16 when the type is 0x02.
      • Is UInt32 when the type is 0x03.
      • Boolean when the type is 0x0b.
      • String when the type is 0x1e.
    2. The remaining bytes are the contents of the property.
      1. The remaining three are 4-bit bytes (0 extra bytes) except when the type is string.
      2. When the type is string, the first 4 bytes are the length of the string (including "BinaryReader"), so it is impossible to read with the ReadString. After the length is the string content, the string is stored using single-byte encoding, and the string content can be obtained using getstring in encoding.

For ease of development, we create a documentsummary class. Interestingly, whether documentsummaryinformation or SummaryInformation, the first attribute is a code page encoding that records the contents of the group, which can be encoding.getencoding () Get the corresponding encoding and parse the corresponding string with GetString:

View Code

Then we can read it:

View Code

The summaryinformation is the same as documentsummaryinformation, except that the 16-bit identifier for the attribute group is 0xe0 0x85 0x9F 0xF2 0xf9 0x4f 0x68 0x10 0xAB 0x91 0x08 0x00 0x2B 0x27 0xb3 0xd9.

The properties of the common SummaryInformation attribute are numbered as follows:

View Code

Other codes are not given separately because they are similar to documentsummaryinformation.

Attached, this article all code download: Https://github.com/mayswind/SimpleOfficeReader

#region constant Private Const UInt32 headersize = 0x200;//512 bytes Private Const UInt32 directoryentrysize = 0x80;//128 bytes #endregion    #region Read directory information private void Readdirectory () {if (This.m_reader = = null) {return;    } UInt32 Leftsiblingentryid, Rightsiblingentryid, Childentryid;    This.m_dirrootentry = getdirectoryentry (0, NULL, out-Leftsiblingentryid, out-Rightsiblingentryid, out Childentryid); This. Readdirectoryentry (This.m_dirrootentry, Childentryid);} private void Readdirectoryentry (DirectoryEntry rootentry, UInt32 EntryID) {UInt32 Leftsiblingentryid, rightsiblingentr    Yid, Childentryid; DirectoryEntry entry = Getdirectoryentry (EntryID, rootentry, out Leftsiblingentryid, out Rightsiblingentryid, out    Childentryid); if (entry = = NULL | | entry.    EntryType = = directoryentrytype.invalid) {return;    } rootentry.addchild (entry); if (Leftsiblingentryid < Uint32.maxvalue)//have left sibling node {this.    Readdirectoryentry (Rootentry, Leftsiblingentryid);  }  if (Rightsiblingentryid < Uint32.maxvalue)//have right sibling node {this.    Readdirectoryentry (Rootentry, Rightsiblingentryid); } if (Childentryid < Uint32.maxvalue)//have child node {this.    Readdirectoryentry (entry, Childentryid); }}private DirectoryEntry Getdirectoryentry (UInt32 EntryID, DirectoryEntry Parententry, out UInt32 Leftsiblingentryid,    Out UInt32 Rightsiblingentryid, out UInt32 childentryid) {Leftsiblingentryid = Uint16.maxvalue;    Rightsiblingentryid = Uint16.maxvalue;    Childentryid = Uint16.maxvalue; This.m_stream.    Seek (Getdirectoryentryoffset (EntryID), seekorigin.begin); if (This.m_stream.    Position >= this.m_length) {return null;    } StringBuilder temp = new StringBuilder (); for (Int32 i = 0; i <; i++) {temp. Append ((Char) This.m_reader.    ReadUInt16 ()); } UInt16 Namelen = This.m_reader.    ReadUInt16 (); String name = (temp. ToString (0, (temp. Length < (NAMELEN/2-1)? Temp.    LENGTH:NAMELEN/2-1)); Byte Type = This.m_reader.    ReadByte ();    if (Type > 5) {return null; } this.m_stream.    Seek (1, seekorigin.current); Leftsiblingentryid = This.m_reader.    ReadUInt32 (); Rightsiblingentryid = This.m_reader.    ReadUInt32 (); Childentryid = This.m_reader.    ReadUInt32 (); This.m_stream.    Seek (seekorigin.current); UInt32 Sectorid = This.m_reader.    ReadUInt32 (); UInt32 length = This.m_reader.    ReadUInt32 (); return new DirectoryEntry (Parententry, EntryID, name, (directoryentrytype) type, sectorid, length);} #endregion #region Auxiliary Method private Int64 Getsectoroffset (UInt32 sectorid) {return headersize + this.m_sectorsize * sectorid; }private Int64 Getdirectoryentryoffset (UInt32 sectorid) {return headersize + this.m_sectorsize * This.m_dirstartsectori D + directoryentrysize * sectorid;} #endregion

"Five, RELATED LINKS"

1. Microsoft Open specifications:http://www.microsoft.com/openspecifications/en/us/programs/osp/default.aspx
2. Read the text in Ms Word (. doc) in PHP: https://imethan.com/post-2009-10-06-17-59.html
3. Office file Format: http://www.programmer-club.com.tw/ShowSameTitleN/general/2681.html
4. laola file system:http://stuff.mit.edu/afs/athena/astaff/project/mimeutils/share/laola/guide.html

PostScript

Took a few days to write finished reading documentsummaryinformation and SummaryInformation, sure enough to write their own programs and writing the difference is too big, the former is almost on the line, the latter also have to carefully consult the information. If you feel good, please click on the recommendation.

Reprint: http://www.cnblogs.com/mayswind/archive/2013/03/17/2962205.html

The mystery of Office files ——. NET platform without office for Word, PowerPoint and other files parsing (a)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.