How to convert a PDF document into a text document using Pdfbox-app-1.8.10.jar batch processing

Last Update:2015-08-10 Source: Internet

Author: User

Tags gettext

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. First download Pdfbox-app-1.8.10.jar (: http://pdfbox.apache.org/download.html)

2. Load the Pdfbox-app-1.8.10.jar into the Eclipse project

1. New Java Project: Flie->new->java project, such as Pdftotext project, then right-click the project Buildpath->configure Bulid Path. Click Add External JARs, add the pdfbox-app-1.8.10.jar you just downloaded, click on Order and Export, just tick the package, and finally click OK.

2. Create a new Pdfboxtest class, the following is the source code

ImportJava.io.File;ImportJava.io.FileOutputStream;ImportJava.io.Writer;Importjava.net.MalformedURLException;ImportJava.net.URL;ImportJava.io.OutputStreamWriter;Importorg.apache.pdfbox.pdmodel.PDDocument;ImportOrg.apache.pdfbox.util.PDFTextStripper;//Author:yiutto//destination: Mainly used for PDF file batch conversion to text document Public classPdfboxtest { Public voidGetText (String file)throwsException {//whether to sort        BooleanSort =false; //PDF file name @1 "e:\\data\\inputpdf\\" is the PDF folder root, all the PDF files are placed in this directory (you can set)String pdffile = "e:\\data\\inputpdf\\" +file; //Enter a text file nameString textfile =NULL; //Encoding MethodString encoding = "UTF-8"; //Start extracting pages        intStartPage = 1; //End Fetch Pages        intEndPage =Integer.max_value; //file input stream, generating a text fileWriter output =NULL; //in-memory stored PDF DocumentPDDocument document =NULL; Try {            Try {                //load the file first as a URL, if you get an exception and then load the file from the local systemURL url =NewURL (Pdffile); Document=pddocument.load (URL); //get the file name of the PDF//String fileName = Url.getfile (); //name the newly generated TXT file in the original PDF                if(File.length () > 4) {File outputFile=NewFile (file.substring (0, File.length ()-4) + ". txt"); Textfile=Outputfile.getname (); }            } Catch(malformedurlexception e) {//load from File system if exception is loaded as URLDocument =pddocument.load (Pdffile); if(File.length () > 4) {textfile= file.substring (0, File.length ()-4) + ". txt"; }            }            //file input stream, write file to Textfile @2 "e:\\data\\outputtxt\\" is the text document output directory (you can set it)Output =NewOutputStreamWriter (NewFileOutputStream ("E:\\data\\outputtxt\\" +textfile), encoding); //Pdftextstripper to extract textPdftextstripper stripper =NULL; Stripper=NewPdftextstripper (); //set whether to sortstripper.setsortbyposition (sort); //Set Start Pagestripper.setstartpage (startpage); //Set End pagestripper.setendpage (EndPage); //call Pdftextstripper's WRITETEXT to extract and output textstripper.writetext (document, output); } finally {            if(Output! =NULL) {                //turn off the output streamOutput.close (); }            if(Document! =NULL) {                //Close PDF DocumentDocument.close (); }        }    }     Public Static voidMain (string[] args) {//@3 "e:\\data\\inputpdf\\" is the PDF folder root directory, all the PDF files are placed in this directory (you can set)File input =NewFile ("e:\\data\\inputpdf\\"); if(Input.isdirectory ()) {string[] fileList=input.list (); Pdfboxtest Test=Newpdfboxtest (); System.out.println (input.tostring ()+ "\ n");  for(String file:filelist) {Try{System.out.println (" "+file+ "is prepared converting to text ...");                    Test.gettext (file); System.out.println ("" +file + "is done.\n"); } Catch(Exception e) {//TODO auto-generated Catch blockE.printstacktrace (); }            }        }    }}

Small article to say: @1 "e:\\data\\inputpdf\\" is the PDF folder root directory, all the PDF files are placed in this directory (you can set)

@2 "E:\\data\\outputtxt\\" is the text document output directory (you can set it)

@3 "E:\\data\\inputpdf\\" is the PDF folder root directory, all the PDF files are placed in this directory (you can set)

These 3 lines of comments under the code can be changed according to their own specific circumstances, has been introduced so detailed, I hope you praise!!

3. The results of the experiment are as follows:

How to convert a PDF document to a text document using Pdfbox-app-1.8.10.jar batch processing

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More