The PDF (Portable Document Format) format is an electronic file format developed by Adobe. This file format has nothing to do with the operating system platform. That is to say, PDF files are common in windows, UNIX, and Apple's Mac OS Operating Systems. This feature makes it an ideal document format for electronic document distribution and digital information dissemination on the Internet. More and more e-books, product descriptions, company announcements, online materials, and emails are beginning to use PDF files. PDF files have become an industrial standard for digital information.
PDF files use industry-standard compression algorithms, which are generally smaller than postscript files and are easy to transmit and store. It is still page independent. a PDF file contains one or more "pages" that can be processed separately, which is especially suitable for the work of a multi-processor system. In addition, a PDF file contains the PDF version used in the file and positioning information of some important structures in the file. Thanks to the advantages of PDF files, it has gradually become a new favorite in the publishing industry.
For ordinary readers, e-books made in PDF have the texture and reading Effect of paper books, which can "vividly" display the original appearance of the original book, and the display size can be adjusted freely, provides readers with personalized reading methods. PDF files can be easily read without the language, Font, and display devices of the operating system. These advantages allow readers to quickly adapt to electronic reading and online reading, which is undoubtedly conducive to the popularization of computers and networks in daily life. With the PDF file technology as the core, Adobe provides a complete set of electronic and online publishing solutions, including commercial software acrobat for generating and reading PDF files and illustrator for editing and making PDF files. Adobe also provides a font pack for reading and printing Asian text, that is, required for Chinese and Japanese characters.
The simplest structure of a PDF file is as follows:
% PDF-1.4 too much
1 0 OBJ </type/catalog
/Pages 2 0 r
> Endobj
2 0 OBJ </type/pages
/Kids [3 0 r]
/Count 1
> Endobj
3 0 OBJ </type/Page
/Parent 2 0 r
/Mediabox [0 300 200]
/Contents 4 0 r
/Resources </procset [/PDF]>
> Endobj
4 0 OBJ </length 00>
Stream
Endstream
Endobj
Xref
0 5
0000000000 65535 F
0000000014 00000 n
0000000071 00000 n
0000000146 00000 n
0000000297 00000 n
Trailer </size 5
/Root 1 0 r
>
Startxref
350
% EOF
The first line is the PDF file header. Different PDF files may be the 1.4 file, because it indicates the version information of the PDF file. The latest PDF version has reached 1.7, some PDF may be different in the following example. The two are actually two characters encoded greater than or equal to 128, which is used to tell other applications that the PDF contains binary information.
Then we can see that obj numbers 1, 2, 3, and 4 are objects, and PDF is structured based on objects, among them, 1, 2, 3, 4 is the object number, followed by 0 is the generation number, generally the number in the PDF is 0, only when someone else modifies some content in the PDF will it become a non-0, followed by a pair of <> is a dictionary, the above object is a dictionary, the dictionary has a pair of key-value. For example,/type indicates the type of the object, and/page indicates the page tree and page respectively, in PDF, the page is organized according to the tree structure, but this tree structure has nothing to do with the structure of the actual PDF content. It does not affect the order of the actual PDF content page, each page tree may have its child page tree, but the final pages with content can only be leaf nodes.
Note: the objects mentioned above all refer to indirect objects with their numbers. to reference these objects, you only need to write their object numbers and generate numbers, add an R to the end.
The following explains several objects in the PDF in sequence. Object 1 is the directory object of the entire PDF file. Its/type value is marked with/catalog, and the/pages value specifies the page tree object; object 2 is the page tree object. Its/kids value is an array enclosed by [], which identifies the subtree or leaf object of the tree in sequence, this file has only one leaf object, and the/Count value indicates the number of all leaf objects in the page tree, that is, the PDF has only one page. Object 3 is a leaf object (page object ), the value of/type is/page. This object indicates a page in PDF, And/parent indicates the parent page tree object containing this page object, /mediabox identifies the size of the page, in lbs (72 lbs = 1 inch = 2.54 cm),/contents indicates the Content Object of the page,/resource value is a dictionary, it identifies some resources used on the page. Object 4 is the Content Object of the page, which is a stream dictionary, because its object dictionary is followed by a binary stream contained by stream and endstream, It is also called a content stream (in this example, It is empty, indicating that there is no content on this page). What content do you want to display on the page, you only need to add the corresponding command here (for specific commands, see the PDF reference). The value of/length in the previous dictionary indicates the length of the binary stream (in bytes, do not include linefeeds before stream, endstream, and endstream ).
Object 4 ends with the xref mark after endobj until the end of the file is the cross-reference table of the PDF file and the position at the end of the file. To search for the PDF file object, you must use the cross-reference table and the end of the file, the end of the file is from trailer to the end. We can see that trailer is a dictionary. The/size mark in it indicates the number of PDF objects (including the default no. 0 object in PDF ), the/root value is the indirect reference of the directory object of the entire file (that is, the object containing/type/Catalog). The subsequent startxref can also be guessed literally, it indicates the starting position (in bytes) of the last cross-referenced table. The position of the cross-referenced table character x (x in xref) is reduced by 1, because the start position starts from 0), generally, the application reads PDF files starting from startxref at the end of the file, and the last % EOF is the end mark of the file.
The most important thing is the cross-referenced table. Generally, the cross-referenced table has the/size value of trailer plus one row (excluding the xref row ), the first row has two numbers. The first one indicates the starting object number of the cross-referenced table (this file is 0, which is a custom object in PDF, with no special effect ), the second one indicates the number of objects in the cross-reference table. Here it is 5 (which also includes the 0 object customized in PDF), and each row in the next five rows represents an object, the first one in each row is a ten-digit non-negative integer, indicating the starting position of the object, and the second is a five-digit non-negative integer, indicating the number of objects generated (up to 65535 ), the last one is a character of either f or N. F indicates that the object has not been referenced, N indicates that the object has been referenced, and the first line of the object with the object number 0 is special, the starting position is 0, the number of units generated is 65535, And the last digit is F. This is the document in PDF and cannot be changed. When reading object information, the cross-reference table is used to locate the location of each object for parsing. Once the cross-referenced tables are damaged, your PDF files cannot be opened by applications (such as Adobe Acrobat Reader.
Finally, if you want to save the code in the text as a PDF and open it with Acrobat, You may be disappointed, because when I copy some characters from ultraedit that involve line breaks, the object location will change accordingly, the value in the cross-reference table has not changed. Of course, it cannot be opened. to open it, you can use ultraedit to re-open the saved PDF file and switch it to the binary view (it must not be saved in the text view ), check the position of each object (move the cursor to the character before the object number, and the ultraedit status bar displays the hexadecimal cursor position and the decimal cursor position ), enter the correct cursor position in decimal format to the fields in the corresponding cross-reference table (it is best to use the replacement mode instead of the deletion mode when filling in, press the Insert key on the right of the backspace key to switch between the two modes). Fill in the positions of the four objects and save them in binary mode. Then, use Acrobat Reader to open them, if you can enable it without reporting an error, it will prove that you have done it! Although we can only see a blank page with nothing, this is also the first PDF file we made manually.