PDF File Format Analysis

Source: Internet
Author: User

Date: 2010.10.31
Author: Cryin
Blog:Http://hi.baidu.com/justear
I. Overview:
The structured Document Format PDF (Portable Document Format) was first proposed by Adobe, an American typographical and image processing software company, in 1993. Adobe Reader, a pdf Reader software, is familiar to you because it is widely used. Few people who have been in computer may not know about it, however, it is believed that some people have noticed Adobe's software because of its frequent vulnerability exposure. Adobe, which claims to be the king of vulnerabilities, seems to have endless vulnerabilities and may surprise people from time to time, its potential is beyond doubt! In this way, if you want to analyze the principles of Adobe vulnerabilities. It is especially important to understand the format of PDF files!
Ii. pdf file structure:
The structure of PDF can be understood from two aspects: file structure and logical structure. The structure of a PDF file refers to the physical organization of the file, and the logical structure refers to the logical organization of the content [1].
1. Data Object Type:
The basic elements of a PDF file are PDF objects. PDF objects include Direct objects and Indirect objects. The Direct objects are of the following basic types: boolean, Number, String, Name, Array, Dictionary, and Stream object) and Null. An indirect object is a PDF object identified. This identifier is called the ID of an indirect object. The purpose of the identifier is to reference other PDF objects. Any PDF object identifier is converted into an indirect object.
2. pdf file structure:
The PDF file structure (physical structure) consists of four parts: Header, Body, Cross-reference Table, and Trailer ), -1:


Figure-1 PDF file structure
The Header specifies the version number of the file following the PDF specification, which appears in the first line of the PDF file. For example, % PDF-1.6 indicates that the file format complies with protocol 1.6 specifications.
A file Body consists of a series of PDF indirect objects. These indirect objects constitute the specific content of a PDF file, such as fonts, pages, images, and so on.
Cross-reference Table is an indirect object address index Table created for Random Access to indirect objects.
Trailer declares the address of the cross-referenced table, specifies the root object (Catalog) of the file body, and saves encryption and other security information. Based on the information provided at the end of the file, the PDF application can locate the root object of the cross-referenced table and the entire PDF file to control the entire PDF file.
3. PDF document structure:
The document structure in PDF reflects the hierarchical relationship between indirect objects in the file body. The document structure in PDF is a tree structure-2. The root node of the tree is the directory object (Catalog) of the PDF file ). This directory object is the root object of the PDF document, including the outline of the PDF document and the page Group Object (Pages ). The root node has four Subtrees: The Pages Tree, the Outline Tree, the Article Threads, and the Named Destination ).


Figure-2 PDF document structure

4. Resources in PDF:
The content (such as text, graphics, and images) in the PDF file is stored in the Stream object corresponding to the Contents keyword of the page object. Content Stream uses many basic objects such as numbers and strings, which are expressed by Direct objects. However, some other objects, such as Font, are represented by Dictionary objects or Stream objects and cannot be expressed directly, no indirect objects can appear in the content stream. Therefore, these objects are named and Corresponding names are used to represent them in the content stream. These Named objects are called Named Resources ).
There is a resource Key in the object on the page, which lists all the Resources used in the content stream, and creates a ing table between the resource name and the resource object.
The naming resources in PDF include: instruction set, Font, Color space, external object, and Extended graphics state) pattern and Property list ). Non-naming resources include Encoding, Font descriptor, Halftone, Function, and CMap. Because non-naming resources are implicitly referenced, there is no need for naming.
5. PDF page description command:
There are a total of 60 page description instructions in PDF. The 60 page description commands describe a series of graphic objects on the page. These graphic objects can be divided into four types: Path Object, Text Object, Image Object, and external Object ).
3. pdf file analysis:
A PDF file is a mixed format of text and binary, but Adobe prefers to treat it as a binary file. Therefore, it is recommended that when there are too many texts in the file, some binary comments can be added so that some existing compilers can treat it as a binary file. The text in the file mainly describes the file structure. The binary content comes from three aspects: 1. image; 2. Font; 3. Compressed Post Script [2].
The following uses a PDF file with only one sentence for analysis. Use UltraEdit to open the PDF file and select hexadecimal to edit the file to view information similar to the following. I will focus on selecting some information for introduction, use # to annotate and explain it later.
% PDF-1.6 # documentation header, compliant with specification 1.6
% Too many # below are many Object objects
2 0 obj # Object, where 2 is the sequence number of Obj and 0 is the version number of Obj
<# <> Object content
[/ICCBased 3 0 R]
>
Endobj # Object end keyword

7 0 obj
<
/Filter
/FlateDecode # Stream object compression method zip Compression Algorithm
/Length 148 # Stream object Length
>
Stream # Stream Object
PDF File Format analysis Author: Cryin # file content information, note: This is intuitive and manually entered here
Endstream # End mark of the stream object
Endobj

8 0 obj
<
/Contents 7 0 R # object number of the page content object is 7
/MediaBox [0 0 595.2 841.68] # page display size, in pixels
/PageIndex 1
/Parent 1 0 R # Parent object number 1 and Pages object
/Resources # Resources contained on this page
</Font </F4 4 0 R> # Font type
/Shading <>
/XObject <> # External Object
/ColorSpace </CS1 2 0 R> # Color Space
/Type/Page
>
Endobj

1 0 obj
<
/Count 1 # The number of page numbers is 1
/Kids [8 0 R] # The kids object indicates that its subpage object is 8
/Type/Pages
>
Endobj

13 0 obj
<
/Author (? Cryin)
/CreationDate (D: 20100926145832 + 0800)
/Title (? PDF File Format Analysis)
>
Endobj

Xref # indicates that the cross-reference table starts.
0 14 #0 indicates that the object referenced in the table description starts from 0. 8 indicates that there are 8 Objects in total.
0000000000 65536 f # generally, pdf files use this line to cross-reference tables. The starting address 0 and the generation number
0000003195 00000 n # indicates object 1, that is, catalog. 3195 is the offset address n, indicating that the object is in use
0000000018 00000 n
0000000051 00000 n
0000003464 00000 n
0000000000 00000 f
0000004282 00000 n
0000002728 00000 n
0000002992 00000 n
0000003256 00000 n
0000003892 00000 n
0000003620 00000 n
0000008660 00000 n
0000008712 00000 n
Trailer # indicates the start of the object at the end of the file
</Size 14 #14 indicate the number of PDF file objects
/Root 12 0 R # description and object number 12
/Info 13 0 R>
Startxref
8980 #8980 is the offset address of the cross-reference table, which is in decimal format.
% EOF # End mark

Iv. Conclusion:
The basic PDF file format is analyzed here. Of course, the analysis of PDF vulnerability files is not discussed here. The PDF analysis in this example does not contain nested JavaScrpit statements, however, the starting point of PDF vulnerability analysis is the nested JavaScrpit or flash file. The PDF File Vulnerability generally uses JavaScript to implement heap injection overflow [3]. Here, JavaScript statements will inevitably be nested in the PDF file. The nested JavaScript in the PDF can locate the specific position through/OpenAction In the Obj object, but it is generally encoded by FlateDecode, in short, when analyzing the PDF vulnerability, you can find the JS statement to find the shellcode, or you can modify JavaScript on your own. In this way, the vulnerability analysis can be carried out smoothly. Of course, the specific analysis also involves some debugging processes. In this regard, I am also in the learning stage. It is inevitable that the statement is incorrect. Please forgive me. In short, knowledge is constantly summed up and accumulated, and I hope this short article will help you a little bit!


Refer:
[1] Design and Implementation of an object-oriented Chinese PDF reader by Yang daoliang

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.