To parse a PDF file, you must first understand the physical structure of the PDF file. This is the first step. However, this is only the foundation, and more importantly, the analysis of the logical structure of PDF. The logic of PDF is basically a tree structure, and the root node is a catalog dictionary. Here we will parse pages, directories, and link information. Here we will follow the tree structure of PDF, the logic framework of the entire file is discussed in detail.
1. Catalog Root Node
Catalog is the root node of the entire PDF logical structure. It can be located through the root field of trailer. Although simple, it is very important because it is the connection point between the physical structure and logical structure of the PDF file. The catalog dictionary contains a lot of information. Here, we will only describe the most important fields.
(1) pages Field
This is a required field and a description set of all pages in PDF. The pages field is a dictionary and contains the following main fields:
Field
Type
Value
Type
Name
(Required) must be pages.
Parent
Dictionary
(If it is not specified in catalog, it must exist and be an indirect object.) direct parent node of the current node.
Kids
Array
(Required) an array composed of indirect objects. The node may be page or page tree.
Count
Integer
(Required) Number of leaf nodes (Page Object) contained in the page tree.
From the above fields, we can see that the main function of pages is to organize all the page objects. The Page Object describes the attributes and resources of a pdf page. A page object is a dictionary that contains several important attributes:
Field
Type
Value
Type
Name
(Required) must be a page.
Parent
Dictionary
(Mandatory; and can only be indirect objects) the current page node's direct parent node page tree.
Lastmodified
Date
(Required if the pieceinfo field exists; otherwise, optional) record the date and time when the current page was last modified.
Resources
Dictionary
(Required; can be inherited) records all resources used by the current page. If the current page does not use any resources, this is an empty dictionary. If all fields are ignored, the resources of the parent node are inherited.
Mediabox
Rectangle
(Required; inherited) defines the region of the physical media to display or print the page (default user space units)
Cropbox
Rectangle
(Optional; inherited) defines a visible area. When the current page is displayed or printed, its content is cropped in this area. The default value is mediabox.
Bleedbox
Rectangle
(Optional) defines a region. When the output device is a production environment, the content displayed on the page is cropped. The default value is cropbox.
Contents
Stream or Array
(Optional) page content stream. If this field defaults, nothing is displayed on the page.
This value can be a stream or an array composed of several streams. If it is an array, the actual effect is equivalent to that of all streams connected in order. This allows you to insert images or other resources at any time when generating PDF files. Stream segmentation is only a word segmentation, not a logical or organizational cut.
Rotate
Integer
(Optional; can inherit from) the number of degrees of rotation in a normal time, which must be an integer multiple of 90. The default value is 0.
Thumb
Stream
(Optional) define the thumbnail of the current page.
Annots
Array
(Optional) Comments associated with the current page.
Metadata
Stream
(Optional) metadata contained on the current page.
A simple example:
3 0 OBJ
</Type/Page
/Parent 4 0 r
/Mediabox [0 612 792]
/Resources </font <
/F3 7 0 r/F5 9 0 r/F7 11 0 r
>
/Procset [/PDF]
>
/Contents 12 0 r
/Thumb 14 0 r
/Annots [23 0 r 24 0 r]
>
Endobj
(2) outlines Field
Outline is designed in PDF to help users jump from a part of PDF to another part. It is also called bookmark. It is a tree structure, the PDF file structure can be intuitively presented to users. You can use the mouse to open or close an outline item for interaction. When an outline item is opened, you can see all its subnodes and close an outline item, all child nodes of this outline are automatically hidden. In addition, when you click, the reader automatically jumps to the page location corresponding to outline. Outlines contains the following fields:
Key
Type
Value
Type
Name
(Optional) If this field has a value, it must be outlines.
First
Dictionary
(Required; must be an indirect object) The first top-level outline item.
Last
Dictionary
(Required; must be an indirect object) The last top-level outline item.
Count
Integer
(Required) Total number of items at all levels of outline.
Outline is a top-level object for managing outline items. We can see that it is actually an outline item, which contains text, behavior, and target area. An outline item mainly has the following fields:
Title
Text string
(Required) the title to be displayed for the current item.
Parent
Dictionary
(Required; must be an indirect object) the parent object of the current item at the outline level. If the item itself is a top-level item, the parent object is itself.
Prev
Dictionary
(Except for the first item in each layer, other items must have this field; they must be indirect objects) in the current level, the previous item of this item.
Next
Dictionary
(Except for the last item at each layer, other items must have this field; they must be indirect objects.) In the current level, the next item of this item.
First
Dictionary
(If the current item has any subnode, this field is required; it must be an indirect object) the first direct subnode of the current item.
Last
Dictionary
(If the current item has any subnode, this field is required; it must be an indirect object) The last direct subnode of the current item.
Dest
Name,
Byte string, or Array
(Optional; this cannot be omitted if field a exists.) the region to be displayed when the current outline item is activated.
A
Dictionary
(Optional; this parameter cannot be ignored if the Dest field exists.) the action to be executed when the current outline item is activated.
(3) URI Field
Uri (Uniform Resource Identifier), which defines the unified resource identifier and related link information at the document level. This field is used to process links in directories and documents.
(4) metadata field
Some additional information of the document, which is expressed in XML and complies with Adobe's XMP specifications. This allows the program to obtain the rough information of the file without parsing the entire file.
(5) Others
In the catalog dictionary, common fields include the following:
Field
Type
Value
Type
Name
(Required) must be catalog.
Version
Name
(Optional) version number of the PDF file (if higher than the version number specified in the file header ). If the default value of this field or the version specified by the file header is higher than the above value, the file header prevails. A pdf generator can update the value of this field to modify the PDF file version number.
Pages
Dictionary
(Must and must be indirect objects) the page set entry of the current document.
Pagelabels
Number tree
(Optional) number tree, which defines the relationship between the page and the page label.
Names
Dictionary
(Optional) Name Dictionary of the document.
Dests
Dictionary
(Optional; it must be an indirect object) Name and corresponding target ing dictionary.
Viewerpreferences
Dictionary
(Optional) read the parameter configuration dictionary to define the behavior when the document is opened. If the default value is used, the reader's own configuration is used.
Pagelayout
Name
(Optional) page layout when a document is opened. Singlepagedisplay single page
Onecolumndisplay Single Column
Twocolumnleftdisplay double row, with the odd page on the left
Twocolumnrightdisplay is double-row, with an odd number of pages on the right
Twopageleft dual pages, with odd pages on the left
Twopageright dual page, odd page on the right
Default Value: singlepage.
Pagemode
Name
(Optional) how to display a specified document when the document is opened
The usenone directory and thumbnail are not displayed.
Useoutlines display directory
Usethumbs display thumbnails
Fullscreen full screen mode, no menu, any other window
Useoc display optional Content Group panel
Useattachments display attachment panel
Default Value: usenone.
Outlines
Dictionary
(Optional; must be an indirect object) Directory Dictionary of a document
Threads
Array
(Optional; it must be an indirect object) an array composed of document clue dictionaries.
Openaction
Array or dictionary
(Optional) specify a region or action. The action is displayed or executed when the document is opened ). If default, the top of the first page is displayed with the default zoom rate.
AA
Dictionary
(Optional) an additional action dictionary that defines the actions that respond to various events globally.
Uri
Dictionary
(Optional) a URI dictionary contains document-level URI action information.
Terraform
Dictionary
(Optional) The Notebook form Dictionary of the document.
Metadata
Stream
(Optional; it must be an indirect object) metadata flow contained in the document.