Requirement: Extract PDF text with Java paging.
PDFBox is a good open source tool to meet the above requirements.
1.PDF Document Structure
To parse the PDF text, we first need to understand the structure of the PDF file.
The most important points about PDF documents are:
First, the content of the PDF document is more complex, such as plain text (can be extracted from the text, you can use the PDF software in the "copy" function), pictures (unable to use the PDF software "copy" function), forms, video, audio, etc., in short form more complex;
Second, the PDF file uses the binary stream and the pure text mixed encoding pattern, and does not adopt the standard character encoding method such as Unicode, its character encoding uses the Adobe company built-in code table (CMAP), this makes the processing of the PDF more difficult;
Third, PDF has its own file structure: file header, object collection, cross-reference table, end (to be precise, this is the physical structure of the PDF document, and the logical structure, details can be viewed in this blog post).
What's a 2.PDFBox thing?
- License Agreement: Apache
- Development language: Java
- Operating systems: cross-platform
- Official website: http://pdfbox.apache.org/
What is 3.PDFBox capable of?
- Extracting text from a PDF
- Merging PDF documents
- PDF Document Encryption and decryption
- Integration with Lucene search engine
- Populating PDF/XFDF form data
- Create a PDF document from a text file
- Create a picture from a PDF page
- Print PDF documents
4. Preparatory work
Again, this demo function is to extract the PDF text (currently only tested in English can be extracted, Chinese is not verified).
1) Download a good Jar pack (3):
A.fontbox-2.0.0-rc2.jar
B.pdfbox-2.0.0-rc2.jar
C.pdfbox-app-2.0.0-rc2.jar
: In the official website, you will know (note the version).
2) MyEclipse or eclipse.
5. Start programming
Create a new project and write the following source code:
1 PackageCom.primeton.pdfbox;2 3 ImportJava.io.File;4 ImportJava.io.FileOutputStream;5 ImportJava.io.OutputStreamWriter;6 ImportJava.io.Writer;7 8 Importorg.apache.pdfbox.pdmodel.PDDocument;9 ImportOrg.apache.pdfbox.text.PDFTextStripper;Ten One A /** - * PDFBox parsing PDF Text Implementation - * @authorMrchen the * - */ - - Public classPdfreader { + /** - * @paramargs + */ A Public Static voidMain (string[] args) { at //TODO auto-generated Method Stub -Pdfreader Pdfreader =NewPdfreader (); -System.out.println ("E:\\androidstudio.pdf"); - Try { - //get the contents of Springguide.pdf under E-disk -System.out.println ("Start extraction"); inFile File =NewFile ("E:\\androidstudio.pdf"); -System.out.println ("File absolute path is:" +File.getabsolutepath ()); to pdfreader.readfdf (file); +System.out.println ("Extract End"); -}Catch(Exception e) { the e.printstacktrace (); * } $ }Panax Notoginseng - Public voidREADFDF (File pdffile)throwsException { the //whether to sort + BooleanSort =false; A //Enter a text file name theString textFileName =NULL; + //Encoding Method -String encoding = "UTF-8"; $ //Start extracting pages $ intStartPage = 1; - //End Fetch Pages - intEndPage = 3; the //file input stream, generating a text file -Writer output =NULL;Wuyi //in-memory stored PDF Document thePDDocument document =NULL; - WuFile OutputFile =NULL; - Try { About $ //loading files from local - //Note The parameter is not a URL in a previous version. It is file. -System.out.println ("Start loading file" +pdffile.getname ()); -Document =pddocument.load (pdffile); A if(Pdffile.getname (). Length () > 4) { +textFileName = Pdffile.getname (). substring (0, Pdffile.getname (). Length ()-4) + ". txt"; theOutputFile =NewFile (Pdffile.getparent (), textfilename); -System.out.println ("The new file absolute path is:" +Outputfile.getabsolutepath ()); $ the the } theSYSTEM.OUT.PRINTLN ("Load File End"); the - inSystem.out.println ("Start writing to TXT file"); the //file input stream, write file inverted textfile theOutput =NewOutputStreamWriter (NewFileOutputStream (outputFile), encoding); AboutSystem.out.println ("Write TXT file end"); the //Pdftextstripper to extract text thePdftextstripper stripper =NULL; theStripper =NewPdftextstripper (); + //set whether to sort - stripper.setsortbyposition (sort); the //Set Start PageBayi stripper.setstartpage (startpage); the //Set End page the stripper.setendpage (endpage); - //call Pdftextstripper's WRITETEXT to extract and output text -System.out.println ("Start calling WriteText method"); the stripper.writetext (document, output); theSystem.out.println ("Call WriteText method End"); the}Catch(Exception e) { the e.printstacktrace (); -}finally { the if(Output! =NULL) { the //turn off the output stream the output.close ();94 } the if(Document! =NULL) { the //Close PDF Document the document.close ();98 } About } - }101}
View Code
There are many piling statements that can be removed by themselves.
6. Problems encountered and solutions
1) The first use was not the PDF2.0 version, but the 1.8 version (2.0 version or experimental version, so the earlier version of the 1.8 was selected). However, with version 1.8, when using the Pddocument.load (String) method, this exception is always present-"java.io.IOException: Push back bufferwas full".
Solution: The above problems have plagued the author for a long time. I have reviewed the knowledge of IO and NIO for this, and consulted the PDFBox English API document (1.8 version), there is no solution. After a large number of access to information, this may be the 1.8 version of the bug. The bug was fixed in version 2.0. Instead of the 2.0 version, that's good enough. to be reminded, the 2.0 version Pddocument.load () method parameter is a file type and is no longer a string type. You can refer to the official API documentation.
Use of PDFBox--page extraction of PDF text