Integrate PDF and Java Technologies

Source: Internet
Author: User
Tags fdf fdf file
 
Summary

Since Adobe released its public PDF reference for the first time in 1993, PDF tools and libraries supporting various languages and platforms have sprung up. However, the support of Adobe Technology in Java application development lags behind.

Since Adobe released its public PDF reference for the first time in 1993, PDF tools and libraries supporting various languages and platforms have sprung up. However, the support of Adobe Technology in Java application development lags behind. This is a strange phenomenon, because the PDF document is the general trend of enterprise information system storage and information exchange, and Java technology is particularly suitable for this application. However, it seems that Java developers have not yet obtained mature and available PDF support.

Product_box (a source code open project under the BSD license) is a pure Java class library prepared for developers to read and create PDF documents. It provides the following features:

  • Extract text, including Unicode characters.
  • The process of integration with text search engines such as Jakarta Lucene is very simple.
  • Encrypt/decrypt PDF files.
  • Import or export form data from PDF and XFDF formats.
  • Add content to an existing PDF file.
  • Split a PDF file into multiple documents.
  • Cover PDF documents.

  
Product_box API

Product_box uses an object-oriented method to describe the PDF document. The data in the PDF document is a collection of basic objects: arrays, Boolean, dictionaries, numbers, strings, and binary streams. Pdfboxdefines these basic object types in the orgw.box. Cos package (COS model. You can use these objects to interact with PDF documents, but you should first have an in-depth understanding of the internal structure and high-level concepts of PDF documents. For example, pages and fonts are dictionary objects with special attributes. The PDF Reference Manual provides descriptions of the meanings and types of these special attributes, but this is a boring document viewing process.

As a result, the org.w.boxw.model package (PD model) came into being. It is based on the COs model, but provides high-level APIs (1) to access PDF document objects in a familiar way ). The classes such as pdpage and pdfont encapsulated for the underlying cos model are in this package.

 

Note: although the PD model provides some excellent functions, it is still a development model. In some instances, you may need to use the COs model to access the specific functions of PDF. All PD model objects provide methods to return the COs Model objects. Therefore, you generally use the PD model, but you can directly operate the underlying cos model when the PD model is unstable.

The above section gives a general introduction to product_box. It is time to give some examples. Starting from how to read an existing PDF file:

  1. Pddocument document =
  2. Pddocument. Load ("./test.pdf ");

The preceding statement parses the specified PDF file and creates its Document Object in memory. Taking into account the efficiency of processing large documents, product_box only stores the document structure in the memory, and objects such as images, embedded fonts, and page content will be cached in a temporary file.

Note: When the pddocument object is used, you need to call its close () method to release the resources used during creation.

  
Text Extraction and Lucene Integration

This is an information retrieval age. No matter which media the information is stored in, applications should support retrieval and indexing. It is critical to organize and classify information to form a searchable format. This is very simple for text documents and HTML documents, but PDF documents contain a large amount of structure and metadata, and extracting the document content is by no means a simple task. The PDF language is similar to the postscript language. The objects in the two are drawn as vectors in some locations on the page. For example:

  1. /Helv 12 TF
  2. 0 13.0847 TD
  3. (Hello World) TJ

In the preceding command, set the font to Helvetica 12, move it to the next line, and print "Hello World ". These commands are often compressed, and the display order of text on the screen is not necessarily the order in which characters appear in the file. Therefore, you sometimes cannot extract strings directly from the original PDF document. However, the mature Text Extraction Algorithm of javasbox allows developers to extract document content, as shown in the reader.

Lucene is a sub-project of the Apache Jakarta project. It is a popular open source code search engine library. Developers can use Lucene to create an index and perform complex searches on a large amount of text content based on the index. Lucene only supports Text Content Retrieval. Therefore, developers need to convert other forms of data into text to use Lucene. For example, Microsoft Word and StarOffice documents must be converted to text before they can be added to Lucene indexes.

PDF files are no exception, but product_box provides a special integration object, which makes it easy to include PDF documents in Lucene indexes. To convert a basic PDF document to a Lucene document, you only need one statement:

  1. Document Doc = policedocument. getdocument (File );

This statement parses the specified PDF document, extracts its content, and creates a Lucene Document Object. Then you can add the object to the Lucene index. As described above, PDF documents also contain author information, keywords, and other metadata, which is important when indexing PDF documents. Table 1 lists the fields that populate will fill in when the Lucene document is created.

This integration allows developers to easily use Lucene to support searching and indexing PDF documents. Of course, some applications require more sophisticated text extraction methods. In this case, you can directly use the javastextstripper class or inherit the class to meet this complex requirement.

By inheriting the javastextstripper and overwriting the showcharacter () method, you can control Text Extraction in many ways. For example, use the X and Y positions to extract specific text blocks. You can effectively ignore all texts whose Y coordinates are greater than a certain value, so that the content in the document header will be excluded.

Another example. This is often the case where a set of PDF documents are created from the form, but the raw data is lost. That is to say, these documents contain some text you are interested in, and the text is in a similar position, but the form data of the filled document is lost. For example, you have some envelopes with names and addresses in the same location. In this case, you can use the derived class of gradient textstripper to extract the expected field, which is like a device that captures the screen area.

  
Encryption/Decryption

A popular feature of PDF is to allow encryption and access control of the document content. Only unencrypted documents can be read. The PDF document uses a master password and an optional user password for encryption. If a user password is set, a pdf reader (such as Acrobat) will prompt you to enter the password before the document is displayed. The master password is used to authorize the modification of document content.

The PDF specification allows the creator of a PDF document to restrict certain operations performed when a user views a document using the Acrobat Reader. These restrictions include:

  • Print
  • Modify content
  • Extract content

PDF document security is not covered in this document. If you are interested, refer to the relevant sections of the PDF specification. The security model of the PDF document is pluggable. You can use different Security processors when encrypting the document ). For this article, product_box supports Standard Security processors, which are used by most PDF documents.

To encrypt a document, you must first specify a security processor and then use a master password and user password for encryption. In the following code, the document is encrypted. You can open it in Acrobat without entering it (without setting the user password), but the document cannot be printed.

  1. // Load the document
  2. Pddocument PDF =
  3. Pddocument. Load ("testpipeline ");
  4. // Create the encryption options
  5. Pdstandardencryption encryptionoptions =
  6. NewPdstandardencryption ();
  7. Encryptionoptions. setcanprint (False);
  8. Pdf. setencryptiondictionary (
  9. Encryptionoptions );
  10. // Encrypt the document
  11. Pdf. Encrypt ("master ",Null);
  12. // Save the encrypted document
  13. // To the file system
  14. Pdf. Save ("test-output.pdf ");

For more detailed examples, see the source code of the encryption tool class included in the release of javasbox: orgcompubox. encrypt.

Many applications can generate PDF documents, but do not support security options for controlling documents. In this case, product_box can be used to intercept and encrypt PDF documents before being sent to users.

  
Form Integration

When the output of an application is the value of a series of form fields, it is necessary to save the form into a file. At this time, the PDF technology will be a good choice. Developers can manually write PDF commands to draw graphics, tables, and text. Or save the data in XML format and use a XSL-FO template to create a PDF document. However, these methods are time-consuming, error-prone, and less flexible. A better way for a simple form is to create a template and then fill in the given input data to generate a document.

Employee eligibility verification is a form most people are familiar with, it is also called "I-9 form", see: http://uscis.gov/graphics/formsfee/forms/files/i-9.pdf

You can use an example program in the release box to list form domain names:

  1. Java orgdomainbox. Examples. FDF. printfields i-9.pdf

Another example program is used to insert text data into a specified domain:

  1. Java orgdomainbox. Examples. FDF. setfield i-9.pdf name1 Smith

Open this PDF file in Acrobat and you will see that the "last name" field has been filled in. You can also use the following code to perform the same operation:

  1. Pddocument PDF =
  2. Pddocument. Load ("i-9.pdf ");
  3. Pddocumentcatalog doccatalog =
  4. Pdf. getdocumentcatalog ();
  5. Pdterraform terraform =
  6. Doccatalog. getterraform ();
  7. Pdfield field =
  8. Terraform. getfield ("name1 ");
  9. Field. setvalue ("Smith ");
  10. Pdf. Save ("i-9-copy.pdf ");

The following code can be used to extract the value of the form field just filled in:

  1. Pdfield field =
  2. Terraform. getfield ("name1 ");
  3. System. Out. println (
  4. "First name =" + field. getvalue ());

Acrobat allows you to import or export form data to a specific file format, "forms data format ). There are two types of such files: FDF and XFDF. The format of form data stored in the FDF file is the same as that in the PDF file, while that in the XFDF file is stored in the XML format. Product_box processes FDF and XFDF: fdfdocument in a class. The following code snippet demonstrates how to export FDF data from the above I-9 form:

  1. Pddocument PDF =
  2. Pddocument. Load ("i-9.pdf ");
  3. Pddocumentcatalog doccatalog =
  4. Pdf. getdocumentcatalog ();
  5. Pdterraform terraform =
  6. Doccatalog. getterraform ();
  7. Fdfdocument FDF = terraform. exportfdf ();
  8. FDF. Save ("exporteddata. FDF ");

Pdfbox form integration steps:

  1. Create a PDF form template using Acrobat or other visualization tools
  2. Write down the name of each required form field
  3. Store the template in a place that the application can access
  4. When a PDF file is requested, use product_box to parse the PDF Template
  5. Fill in the specified form field
  6. Returns the filling result (PDF) to the user.

  
Tools

In addition to the APIS described above, Consumer box also provides a series of command line tools. Table 2 lists and briefly introduces these tool classes.

 

  
Remarks

The PDF specification has a total of 1172 pages, and its implementation is indeed a great project. Similarly, in the release of product_box, it is "in progress" and new functions will be added slowly. Its main weakness is to create a PDF document from scratch. However, some open source code Java projects can be used to fill this gap. For example, the Apache fop project supports generating PDF files from special XML documents, which describe the PDF documents to be generated. In addition, itext provides a high-level API for creating tables and lists.

The next version of product_box will support the new object stream and cross-reference stream in PDF 1.5. The built-in font and image support will be provided. With the efforts of javasbox, the PDF Technology in Java applications is expected to be fully supported.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.