Use of PDFBox--page extraction of PDF text

Source: Internet
Author: User

Requirement: Extract PDF text with Java paging.

PDFBox is a good open source tool to meet the above requirements.

1.PDF Document Structure

To parse the PDF text, we first need to understand the structure of the PDF file.

The most important points about PDF documents are:

First, the content of the PDF document is more complex, such as plain text (can be extracted from the text, you can use the PDF software in the "copy" function), pictures (unable to use the PDF software "copy" function), forms, video, audio, etc., in short form more complex;

Second, the PDF file uses the binary stream and the pure text mixed encoding pattern, and does not adopt the standard character encoding method such as Unicode, its character encoding uses the Adobe company built-in code table (CMAP), this makes the processing of the PDF more difficult;

Third, PDF has its own file structure: file header, object collection, cross-reference table, end (to be precise, this is the physical structure of the PDF document, and the logical structure, details can be viewed in this blog post).

What's a 2.PDFBox thing?
    • License Agreement: Apache
    • Development language: Java
    • Operating systems: cross-platform
    • Official website: http://pdfbox.apache.org/
What is 3.PDFBox capable of?
    • Extracting text from a PDF
    • Merging PDF documents
    • PDF Document Encryption and decryption
    • Integration with Lucene search engine
    • Populating PDF/XFDF form data
    • Create a PDF document from a text file
    • Create a picture from a PDF page
    • Print PDF documents
4. Preparatory work

Again, this demo function is to extract the PDF text (currently only tested in English can be extracted, Chinese is not verified).

1) Download a good Jar pack (3):

A.fontbox-2.0.0-rc2.jar

B.pdfbox-2.0.0-rc2.jar

C.pdfbox-app-2.0.0-rc2.jar

: In the official website, you will know (note the version).

2) MyEclipse or eclipse.

5. Start programming

Create a new project and write the following source code:

1  PackageCom.primeton.pdfbox;2 3 ImportJava.io.File;4 ImportJava.io.FileOutputStream;5 ImportJava.io.OutputStreamWriter;6 ImportJava.io.Writer;7 8 Importorg.apache.pdfbox.pdmodel.PDDocument;9 ImportOrg.apache.pdfbox.text.PDFTextStripper;Ten  One  A /** - * PDFBox parsing PDF Text Implementation -  * @authorMrchen the  * -  */ -  -  Public classPdfreader { +     /** -       * @paramargs +       */ A       Public Static voidMain (string[] args) { at       //TODO auto-generated Method Stub -Pdfreader Pdfreader =NewPdfreader (); -System.out.println ("E:\\androidstudio.pdf"); -       Try { -            //get the contents of Springguide.pdf under E-disk -System.out.println ("Start extraction"); inFile File =NewFile ("E:\\androidstudio.pdf"); -System.out.println ("File absolute path is:" +File.getabsolutepath ()); to pdfreader.readfdf (file); +System.out.println ("Extract End"); -}Catch(Exception e) { the e.printstacktrace (); *       } $      }Panax Notoginseng       -       Public voidREADFDF (File pdffile)throwsException { the           //whether to sort +           BooleanSort =false; A           //Enter a text file name theString textFileName =NULL; +           //Encoding Method -String encoding = "UTF-8"; $           //Start extracting pages $           intStartPage = 1; -           //End Fetch Pages -           intEndPage = 3; the           //file input stream, generating a text file -Writer output =NULL;Wuyi           //in-memory stored PDF Document thePDDocument document =NULL; -            WuFile OutputFile =NULL; -           Try { About           $                //loading files from local -                //Note The parameter is not a URL in a previous version. It is file.  -System.out.println ("Start loading file" +pdffile.getname ()); -Document =pddocument.load (pdffile); A                 if(Pdffile.getname (). Length () > 4) { +textFileName = Pdffile.getname (). substring (0, Pdffile.getname (). Length ()-4) + ". txt"; theOutputFile =NewFile (Pdffile.getparent (), textfilename); -System.out.println ("The new file absolute path is:" +Outputfile.getabsolutepath ()); $                      the                   the                 } theSYSTEM.OUT.PRINTLN ("Load File End"); the   -             inSystem.out.println ("Start writing to TXT file"); the                //file input stream, write file inverted textfile theOutput =NewOutputStreamWriter (NewFileOutputStream (outputFile), encoding); AboutSystem.out.println ("Write TXT file end"); the                //Pdftextstripper to extract text thePdftextstripper stripper =NULL; theStripper =NewPdftextstripper (); +                //set whether to sort - stripper.setsortbyposition (sort); the                //Set Start PageBayi stripper.setstartpage (startpage); the                //Set End page the stripper.setendpage (endpage); -                //call Pdftextstripper's WRITETEXT to extract and output text -System.out.println ("Start calling WriteText method"); the stripper.writetext (document, output); theSystem.out.println ("Call WriteText method End"); the}Catch(Exception e) { the e.printstacktrace (); -}finally { the               if(Output! =NULL) { the                     //turn off the output stream the output.close ();94                } the                if(Document! =NULL) { the                 //Close PDF Document the document.close ();98                } About           } -      }101}
View Code

There are many piling statements that can be removed by themselves.

6. Problems encountered and solutions

1) The first use was not the PDF2.0 version, but the 1.8 version (2.0 version or experimental version, so the earlier version of the 1.8 was selected). However, with version 1.8, when using the Pddocument.load (String) method, this exception is always present-"java.io.IOException: Push back bufferwas full".

Solution: The above problems have plagued the author for a long time. I have reviewed the knowledge of IO and NIO for this, and consulted the PDFBox English API document (1.8 version), there is no solution. After a large number of access to information, this may be the 1.8 version of the bug. The bug was fixed in version 2.0. Instead of the 2.0 version, that's good enough. to be reminded, the 2.0 version Pddocument.load () method parameter is a file type and is no longer a string type. You can refer to the official API documentation.

Use of PDFBox--page extraction of PDF text

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.