PDF file structure (I) physical structure

Source: Internet
Author: User
Tags parsing pdf files

PDF file structure (1)

---- Physical structure

Author: bobob

Mail: zxbbobob@hotmail.com

Original article: http://blog.csdn.net/bobob/article/details/4328450

PDF (Portable Document Format) is a useful file Format, its biggest feature is platform-independent and powerful (supports text, images, forms, links, music, videos, and so on ). to parse a PDF file, you must first familiarize yourself with the physical and logical structure of the PDF file. The physical structure of PDF files can be divided into the following parts:
1. File Header
The file header is the first line of the PDF file. The format is as follows:

% PDF-1.4

This is a fixed format, indicating that this PDF file complies with the PDF standard version. Currently, most of the PDF generation tools, except the official acrobat, are mostly 1.4. For PDF development, the simplest principle is to try to comply with the lower version specifications when generating PDF files to ensure that most parsers support it; when parsing PDF files, try to support the higher version specifications to ensure that most tools generate PDF files.

After version 1.4, the version of the PDF file is not unique only here, but may be rewritten later (the version entry of catalog). Therefore, when parsing a PDF file, if the version here is greater than or equal to 1.4, compare the version in catalog and obtain a higher version.

2. object set 
This is the most important part of a PDF file. All objects used in the file, including text/image/music/Video/font/hyper-connection/encrypted information/document structure information, etc, all are defined here. The format is as follows:

2 0 OBJ
...
End OBJ

The definition of an object contains four parts:

The first 2 is the object serial number, which is used to uniquely mark an object. 0 is the generation number. According to the PDF specification, if a PDF file is modified, the number is accumulated, it marks the original object or modified object together with the object serial number, but in actual development, it is rarely used to modify the PDF in this way, and the object numbers are all re-arranged; OBJ and endobj are definitions of objects, which can be abstracted as a left and right brackets. The ellipsis is any legal object specified in PDF (a total of 8 types, see Appendix ).

You can use the r keyword to reference any object. For example, to reference the above object, you can use 2 0 r. The idea is that the R keyword can not only reference a defined object, you can also reference an object that does not exist, and the effect is the same as referencing an empty object.

3. cross-reference table 
A cross-reference table is a special file organization method within a PDF file. It is convenient to randomly access an object based on the object number. The format is as follows:

Xref
0 1
0000000000 65535 f
4 1

0000000009 00000 n
8 3

0000000074 00000 n
0000000120 00000 n
0000000179 00000 n

Xref indicates the content of a cross-referenced table. Each cross-referenced table can be divided into several sub-segments. The first row of each sub-segment is two numbers, the first is the start Number of the object, followed by the number of consecutive objects, next, each row contains the specific information of each object in this sub-segment. The first 10 digits in each row represent the offset address of this object relative to the file header, the last five digits are the generation numbers (used to mark the updated information of the PDF file, which is similar to the generation number of the object ), the last f or n indicates whether the object is used (n indicates that the object is used, and f indicates that the object is deleted or useless ). The above cross-referenced table has three sub-segments: One, one, and three. The objects in the first sub-segment are unavailable, and other sub-segment objects are available.


4. trailer:
Through trailer, you can quickly locate the position of a cross-referenced table to precisely locate each object. You can also use its own dictionary to obtain some global information about the file (author, keyword, title), encrypted information, and so on. The specific form is as follows:
Trailer
<
Key1 value1
Key2 value2
Key3 value3

...
>
Startxref
553
% EOF

Trailer follows a dictionary and contains several key-value pairs. The specific meaning is as follows:

Key

Value Type

Value description

Size

Integer

The number of indirect objects. If a PDF file has been updated, there will be multiple object sets, cross-reference tables, and trailer. The Last trailer field records the number of all previous objects. This value must be a direct object.

Prev

Integer

This key is used when the file contains multiple object sets, cross-referenced tables, and trailer, which indicates the offset position of the previous one relative to the file header. This value must be a direct object.

Root

Dictionary

The object number of the Catalog dictionary (logical entry point of a file. It must be an indirect object.

Encrypt

Dictionary

This field is used to encrypt the dictionary object number when the document is protected.

Info

Dictionary

The dictionary that stores document information. It must be an indirect object.

ID

Array

File ID

Startxref: the following number indicates the offset of the last cross-referenced table from the start position of the file.
% EOF: file Terminator.

A PDF file will have the above structure (except for the PDF file with linear optimization, which will be discussed separately later ). Actually, a PDF file is very complicated, but the above several parts are definite and can only be more or less. after understanding the physical structure of the PDF file, you can extract objects one by one. there are 8 types of objects in PDF:

1. booleam

It is represented by the true or false keyword. It can be an element of an array object or an entry of a dictionary object. It can also be used in a postscript computing function as a condition of if or ifesle.

2. numeric

Integer and real types are supported. Non-decimal numbers and exponential numbers are not supported.

Example:

1) integer 123 4567 + 111-2

Range: power 31 of positive 2-power 31 of negative 2

2) real number 12.3 0.8 + 6.3-4.01-3. +. 03

Range: Power 38 of ± 3. 403x10 + power 1. 175x10-power 38

Note: If the integer value exceeds the value range, it is converted to the real number. If the real number exceeds the value range, an error occurs.

 

3. string

A string consists of a series of bytes ranging from 0 to 65535. The total length of a string cannot exceed. There are two methods for string:

1) Direct string

A string contained by (). The Escape Character "/" can be used in the middle "/".

Example:

(ABC) indicates ABC

(A //) indicates/

The Escape Character is defined as follows:

Escape characters

Description

/N

Line feed

/R

Enter

/T

Horizontal Tab

/B

Return

/F

Form feed (FF ))

/(

Left parenthesis

/)

Parentheses

//

Backslash

/Ddd

Octal characters

 

2) hexadecimal string

A hex string contained by <>. Two digits represent one character. Less than two digits are filled with 0.

 

Example:

<AABB> represents two characters: AA and BB.

<AAB> represents two characters: AA and B0

 

4. name

It consists of a forward slash (/) and a series of subsequent characters. The maximum length is 127. different from string, name is inseparable and unique, which means that a name object is an atom, such as/name. N is not an element of this name; two identical names must represent the same object. starting from v1.1.2, except for ASCII 0, all others can be represented by a # plus two hexadecimal numbers.

Example:

/Name indicates name

/Name # 20is indicates name is

/Name #200 indicates name 0

 

5. array

A set of objects contained in [] can be any PDF object (including array ). although PDF only supports one-dimensional array, the nested array can be used to implement an array with any dimension (but the element of an array cannot exceed 8191)

Example:

[549 3.14 false (Ralph)/SomeName]

 

6. Dictionary

There are several groups of entries contained in "<" and ">". Each group of entries consists of key and value. The key must be a name object, and the key in a dictionary is unique; value can be a legal object of any pdf (including a dictionary object ).

Example:

</IntegerItem 12

/StringItem (a string)

/Subdictionary

</Item1 0.4

/Item2 true

/Lastitem (not !)

/Verylastitem (OK)

>

>

7. stream

It consists of a dictionary, a set of keywords stream and endstream followed by it, and a series of bytes in the middle of this set of keywords. the content is similar to the string, but there is a difference: stream can be read several times and different parts are used separately. The string must be used as a whole for all reads at a time; the string length is limited, however, stream does not have this restriction. generally, large data is represented by stream. it should be noted that stream must be an indirect object, and the stream dictionary must be a direct object. Since the 1.2 standard, stream may exist in the form of external files. In this case, the content between stream and endstream will be ignored during PDF parsing.

Example:

Dictionary

Stream

... Data...

Endstream

Common fields in the stream dictionary are as follows:

Field name

Type

Value

Length

Integer

(Required) the data length between the keyword stream and endstream. There may be an extra EOL mark before endstream, which is not included in the data length.

Filter

Name or Array

(Optional) Name of the Stream encoding algorithm (list ). If there are multiple, the encoding algorithm list order in the array is the data encoding order.

DecodeParms

Dictionary or Array

(Optional) a parameter dictionary or an array composed of parameter dictionaries for Filter. If there is only one Filter and this Filter requires a parameter, DecodeParms must be set to the Filter unless all the Filter parameters have been given the default value. If multiple filters exist and any Filter uses non-default parameters, DecodeParms must be an array, each element corresponds to a Filter parameter list (if a Filter does not require a parameter or all parameters have a default value, an empty object is used instead ). DecodeParms is ignored if no Filter parameter is required or all Filter parameters have default values.

F

File ID

(Optional) file for saving stream data. If this field exists, stream and endstream will be ignored, FFilter will replace Filter, and FDecodeParms will replace DecodeParms. The Length field still indicates the Length of data between stream and endstream, but usually no data is available at the moment. The Length is 0.

FFilter

Name or dictionary

(Optional) similar to filter, for external files.

Fdecodeparms

Dictionary or Array

(Optional) similar to decodeparams, for external files.

 

8. NULL

Null indicates null. If the value of a key is null, this key can be ignored. If a nonexistent object is referenced, it is equivalent to referencing an empty object.

Example: (omitted)

 

The above eight objects are classified according to the meaning of the object. If the objects are classified into indirect objects and direct objects according to the object usage rules. Indirect objects are the most common objects in PDF. For example, all objects in the previous object set are indirect objects, which are referenced by the R keyword in other locations, in a cross-reference table, indirect objects are referenced. The direct object is better understood. When the above eight objects appear separately, they are called direct objects.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.