Consolidating PDFs and Java Technologies

Source: Internet
Author: User
Tags format object cos fdf fdf file include new features access
Since Adobe launched its first public PDF reference in 1993, PDF tools and libraries that support various languages and platforms have sprung up.  However, the support of Adobe technology in Java application development is relatively lagging behind. Since Adobe launched its first public PDF reference in 1993, PDF tools and libraries that support various languages and platforms have sprung up. However, the support of Adobe technology in Java application development is relatively lagging behind. This is a strange phenomenon, because PDF documents are the trend for enterprise information systems to store and exchange information, and Java technology is particularly suited to this application.  However, it seems that the Java developer has only recently received the mature PDF support available. PDFBox (a BSD-licensed source-open project) is a pure Java class library for developers to read and create PDF documents. It provides the following features: Extracting text, including Unicode characters. And jakarta lucene, such as text search engine integration process is very simple. Encrypt/decrypt PDF document. Import or export form data from PDF and XFDF formats. Append content to an existing PDF document. Cut a PDF document into multiple documents.   Overwrite the PDF document. Pdfbox api PDFBox is designed to describe the PDF document in an object-oriented manner. The data in a PDF document is a collection of basic objects: arrays, booleans, dictionaries, numbers, strings, and binary streams. PDFBox defines these basic object types in the Org.pdfbox.cos package (COS model). You can use these objects to interact with the PDF document, but you should first make a deep understanding of the internal structure of the PDF document and the high-level concepts.  For example, pages and fonts are dictionary objects with special attributes; PDF Reference manuals provide descriptions of the meanings and types of these special attributes, but this is a tedious document lookup process. As a result, the Org.pdfbox.pdfmodel packet (PD model) came into being, based on the COS model, but provides a high-level API to access the PDF document object in a familiar way (see Figure 1). The class of pdpage and Pdfont encapsulated in the underlying COS model is in this package.   Note that although the PD model provides some excellent functionality, it is still a developing model. In some instances, you may need to use the COS model to access the specific functionality of a PDF. All PD model objects provide a way to return the corresponding Cos model object.  So, in general, you will use the PD model, but you can directly manipulate the underlying COS model when the PD model is out of reach. This is a general introduction to PDFBox, and now is the time to cite some examples. How we read from the alreadyThe existing PDF document begins:pddocument document =      pddocument.load (  "./test.pdf")    ); The statement above resolves the specified PDF file and creates its document object in memory.  Considering the efficiency of handling large documents, PDFBox only stores the document structure in memory, and objects such as images, inline fonts, and page content are cached in a temporary file.   Note: When the PDDocument object is used, it needs to call its close () method to release the resources used at the time of creation. Text extraction and Lucene consolidation This is an Information age (An information retrieval age), and applications should support retrieval and indexing regardless of the type of media in which the information resides. It is critical to organize and categorize information to form a searchable format. This is simple for text documents and HTML documents, but PDF documents contain a lot of structure and meta information, and extracting the contents of a document is by no means a simple matter. The PDF language is similar to PostScript, and the objects in both are drawn as vectors in some places on the page. For example:/HELV 12 TF 0 13.0847 td (Hello world)  tj The instructions above set the font to the Helvetica of number 12th, move to the next line, and then print. Hello world ". These command streams are usually compressed, and the order in which text is displayed on the screen is not necessarily the order in which characters appear in the file. Therefore, you sometimes cannot extract strings directly from the original PDF document.  However, the PDFBox mature text extraction algorithm allows developers to extract the content of the document, as shown in the reader. Lucene is a subproject of the Apache jakarta project, and it is a popular source code open search engine library. A developer can use Lucene to create an index and to perform complex retrieval of a large amount of textual content based on that index. Lucene only supports the retrieval of text content, so developers need to convert other forms of data into textual form to use Lucene.  For example, Microsoft word and StarOffice documents must be converted to textual form before they can be added to the Lucene index. PDF files are no exception, but PDFBox provides a special consolidation object, which makes it easy to include PDF documents in the Lucene index. Convert a basic PDF documentOnly one statement is required for a Lucene document: Document doc = lucenepdfdocument.getdocument ( file ); This statement resolves the specified PDF document, extracts its contents, and creates a Lucene document object. You can then add the object to the Lucene index. As noted above, the PDF document also contains metadata such as author information and keywords, which is important when you are indexing a PDF document.  Table 1 lists the fields that PDFBox will fill in (populate) when the Lucene document is created. This consolidation enables developers to easily use Lucene to support the retrieval and indexing of PDF documents. Of course, some applications require a more mature method of text extraction.  You can now use the Pdftextstripper class directly, or inherit the class to satisfy this complex requirement. By inheriting Pdftextstripper and overwriting the Showcharacter () method, you can control text extraction in many ways. For example, use X, Y-position information to restrict the extraction of a specific block of text.  You can effectively ignore all the y coordinates larger than a value of the text, so that the document header content will be excluded. Another example. This is often the case when a set of PDF documents is created from the form, but the raw data is lost. That is, these documents contain some text that you are interested in, and the text is in a similar position, but the form data that fills the document is missing. For example, you have some envelopes that have name and address information in the same location.   At this point, you can use the Pdftextstripper derived class to extract the desired field, which is like a device that intercepts the screen area. One popular feature of encrypting/decrypting PDFs is that it allows you to encrypt the contents of the document, control access, and restrict the reading of unencrypted documents. PDF documents are encrypted with a master password and an optional user password. If a user password is set, the PDF reader (such as Acrobat) prompts for a password before displaying the document.  The master password is used to authorize the modification of the document contents. The PDF specification allows the creator of a PDF document to limit some of the actions that a user uses to view a document using Acrobat Reader. These restrictions include: Print modify content Extract content PDF Document Security Discussion is not within the scope of this article, interested readers can refer to the relevant parts of the PDF specification. The security model for PDF documents is pluggable (pluggable), and you can use a different security processor (Security handler) when encrypting documents. For this article, PDFBox supports the standard security processor, which is the most PDF documentUse of. When encrypting a document, you must specify a security processor and then use a master password and a user password to encrypt it. In the following code, the document is encrypted and the user can open it in Acrobat without typing (no user password is set), but the document cannot be printed. Load the document pddocument pdf =      pddocument.load ( ) Test.pdf " ); Create the encryption options pdstandardencryption encryptionoptions =       new pdstandardencryption (); Encryptionoptions.setcanprint ( false ); Pdf.setencryptiondictionary (      encryptionOptions ); Encrypt the document Pdf.encrypt (  "master", null ); save the encrypted document //to the file system Pdf.save ( )   Test-output.pdf ");  For a more detailed example, see the Cryptographic tool class source code contained in the PDFBox release: Org.pdfbox.Encrypt. Many applications can generate PDF documents, but do not support security options for controlling documents.   The PDFBox can then be used to intercept and encrypt the PDF document before being sent to the user. form consolidation It is necessary to provide the ability to save a form as a file when the application's output is a list of values for a range of form fields. At this point the PDF technology will be a good choice. Developers can manually write PDF directives to draw graphics, tables, and text. or save the data as XML and use the Xsl-fo template to create the PDF document. However, these approaches are time-consuming, error-prone, and less flexible. For simple forms and, a better approach is to create a template, and then populate the template with the given input data to generate the document. Employment eligibility verification is a form that most people are familiar with, and it's called "I-9 form," See: http://uscis.gov/graphics/formsfee/   Forms/files/i-9.pdf You can use an example program in the PDFBox release to list the list of form fields: Java org.pdfbox.examples.fdf.printfields i-9.pdf   There is also an example program for inserting data into a specified field in the form of text: Java org.pdfbox.examples.fdf.setfield i-9.pdf name1 smith Open the PDF document in Acrobat and you will see that the "Last name" field has been filled in. You can also use the following code to complete the same operation:pddocument pdf =      pddocument.load (  "I-9.pdf")  ); pddocumentcatalog doccatalog =      pdf.getdocumentcatalog (); pdacroform acroform =      doccatalog.getacroform (); pdfield field =      acroform.getfield (  "NAME1"  ); Field.setvalue (  "Smith"  );   Pdf.save (  "i-9-copy.pdf"  ); The following code can be used to extract the value of the form field that you just filled out:pdfield field =      acroform.getfield (  "NAME1" & nbsp;);   System.out.println (       "first name="  + field.getvalue ()  ); Acrobat supports importing or exporting form data to a particular file format, form data Format (Forms data format). There are two categories of such documents: FDF and XFDF. The fdf file holds the form data in the same format as the PDF, while xfdf holds the form data in an XML format. PDFBox handles FDF and xfdf:fdfdocument in a class. The following code fragment demonstrates how to export FDF data from the I-9 form above:pddocument pdf =      pddocument.load (  I-9.pdf " ); pddocumentcatalog doccatalog =      pdf.getdocumentcatalog (); pdacroform acroform =      doccatalog.getacroform (); FDFDOCUMENT FDF = ACROFORM.EXPORTFDF ();   Fdf.save (  "EXPORTEDDATA.FDF"  ); PDFBox Form Consolidation step: Create a PDF form template with Acrobat or another visualizer note the name of each required form field to store the template where the application can access when the PDF is requested. Populating the specified form fields with PDFBox parsing PDF templates returns the fill result (PDF) to the user tool in addition to the APIs described earlier, PDFBox provides a series of command-line tools. Table 2 lists these tool classes and gives a brief introduction.   Notes the PDF specification has a total of 1172 pages, and its implementation is indeed a vast project. Similarly, the PDFBox release says it is "in progress" and new features are added slowly. Its main weakness is the creation of a PDF document from scratch. However, there are some open-source Java projects that can be used to fill this gap. such as,apache The FOP project supports the creation of PDFs from special XML documents that describe the PDF documents to be generated.  In addition, Itext provides a high-level API for creating tables and lists. The next version of PDFBox will support the new pdf 1.5  object stream and the cross-reference stream. It will then provide support for inline fonts and images. With PDFBox's efforts, PDF technology in Java applications is expected to be fully supported.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.