Author: Cryin
Connection: http://hi.baidu.com/justear/blog
Overview
For PDF File Parsing, you must first familiarize yourself with all types of PDF files. It seems that all official PDF files are in English. In this way, you can't do it. If you are confident in your English, go here and see [1]. In addition, you can only find some relevant materials written in China. After getting familiar with all kinds of PDF files, how can we Parse them? My current method is to find the keyword segment in the PDF file. the drawback is that the content contained in the stream object in the Obj object cannot be searched. In addition, some PDF vulnerability files use obfuscation technology, so there is no good way to parse such PDF files. As follows:
% PDF-1.5
1 0 obj
</#54 #79 P #65 R 0 5 O # 70e # 6e # 41c # 74i # 6fn 3 Pages C # 61ta # 6c # 6f #67>
Endobj
Keywords
Here we will consider the general malicious PDF file, mainly to find and parse the following key fields (I personally think it is irrelevant to the vulnerability), as shown below:
· Obj
· Endobj
· Stream
· Endstream
· Xref
· Trailer
· Startxref
·/Page
·/Encrypt
·/ObjStm
·/JS
·/JavaScript
·/AA
·/OpenAction
·/Terraform
·/URI
·/Filter
·/JBIG2Decode
·/RichMedia
·/Launch
Analysis ideas
In this example, almost every PDF file contains the first seven fields, and may not contain stream or endstream. It is said that some PDF files do not have xref or trailer, but this situation is rare. If a PDF file does not have an xref or trailer keyword segment, you can determine that it is not a malicious PDF file.
The/xref cross-reference table describes the serial number, version, and absolute file location of each indirect object. The first index in the PDF document must start with the 0 object whose version is 65535, and the first number after the identifier/xref is the number of the first indirect object (that is, the 0 object, the second number is the size of the/xref table.
/Page indicates the number of pages of a PDF file. Most Malicious PDF files only have one Page.
/Encrypt indicates that the PDF file has a digital watermark or is encrypted.
/ObjStm is the number of object streams. Here we need to understand that object streams is a data stream Object that can contain other object objects.
/JS and/JavaScript indicate that the PDF file contains JavaScript code. Almost all of the malicious PDF files I have seen are embedded with JavaScript code. Here, JavaScript Parsing Vulnerabilities are usually used or JavaScript is used to implement heap spray ). Of course, you must note that JavaScript code is also found in many normal PDF files.
/AA,/OpenAction, And/terraform indicate that when you view a PDF file or a page of a PDF file, automatic actions are executed with it, almost all malicious PDF files with JavaScript code embedded have the action to automatically execute JavaScript code ). If a PDF file contains a keyword segment for/AA or/OpenAction to automatically execute an action and contains JavaScript code, this PDF file is likely to be a malicious PDF file.
/URI: This keyword field is required if you want to open a webpage in a PDF file.
/The Filter is generally FlateDecode, that is, the zlib compression and decompression algorithm is used. For details, refer to [2].
/JBIG2Decode indicates that the PDF file is compressed using JBIG2. Although JBIG2 compression itself may have a vulnerability (CVE-2010-1297 ). However, the/JBIG2Decode keyword does not indicate whether the PDF file is suspicious.
/RichMedia Flash file
/Launch execution action count
The final task is to check whether the object of the PDF file and the object comply with Adobe's PDF file format specifications. Based on the keyword Fields described above, this article analyzes whether the PDF file may be a malicious file.
Conclusion
Use the above ideas and follow my current test. The accuracy of malicious PDF file detection is quite good, but it cannot be accurate to detect malicious PDF files, especially the analysis of some PDF files that have undergone so-called obfuscation technology or special processing. There is no good solution yet. If you have any good ideas and ideas, please feel free to contact me. I certainly have better ideas and methods to detect malicious PDF files more accurately. I am very curious about how anti-virus software works. Maybe one day I can try to kill a soft company!
Reference
[1] html "> http://www.adobe.com/devnet/pdf/pdf_reference.html
[2] http://www.zlib.com/