1. First download Pdfbox-app-1.8.10.jar (: http://pdfbox.apache.org/download.html)
2. Load the Pdfbox-app-1.8.10.jar into the Eclipse project
1. New Java Project: Flie->new->java project, such as Pdftotext project, then right-click the project Buildpath->configure Bulid Path. Click Add External JARs, add the pdfbox-app-1.8.10.jar you just downloaded, click on Order and Export, just tick the package, and finally click OK.
2. Create a new Pdfboxtest class, the following is the source code
ImportJava.io.File;ImportJava.io.FileOutputStream;ImportJava.io.Writer;Importjava.net.MalformedURLException;ImportJava.net.URL;ImportJava.io.OutputStreamWriter;Importorg.apache.pdfbox.pdmodel.PDDocument;ImportOrg.apache.pdfbox.util.PDFTextStripper;//Author:yiutto//destination: Mainly used for PDF file batch conversion to text document Public classPdfboxtest { Public voidGetText (String file)throwsException {//whether to sort BooleanSort =false; //PDF file name @1 "e:\\data\\inputpdf\\" is the PDF folder root, all the PDF files are placed in this directory (you can set)String pdffile = "e:\\data\\inputpdf\\" +file; //Enter a text file nameString textfile =NULL; //Encoding MethodString encoding = "UTF-8"; //Start extracting pages intStartPage = 1; //End Fetch Pages intEndPage =Integer.max_value; //file input stream, generating a text fileWriter output =NULL; //in-memory stored PDF DocumentPDDocument document =NULL; Try { Try { //load the file first as a URL, if you get an exception and then load the file from the local systemURL url =NewURL (Pdffile); Document=pddocument.load (URL); //get the file name of the PDF//String fileName = Url.getfile (); //name the newly generated TXT file in the original PDF if(File.length () > 4) {File outputFile=NewFile (file.substring (0, File.length ()-4) + ". txt"); Textfile=Outputfile.getname (); } } Catch(malformedurlexception e) {//load from File system if exception is loaded as URLDocument =pddocument.load (Pdffile); if(File.length () > 4) {textfile= file.substring (0, File.length ()-4) + ". txt"; } } //file input stream, write file to Textfile @2 "e:\\data\\outputtxt\\" is the text document output directory (you can set it)Output =NewOutputStreamWriter (NewFileOutputStream ("E:\\data\\outputtxt\\" +textfile), encoding); //Pdftextstripper to extract textPdftextstripper stripper =NULL; Stripper=NewPdftextstripper (); //set whether to sortstripper.setsortbyposition (sort); //Set Start Pagestripper.setstartpage (startpage); //Set End pagestripper.setendpage (EndPage); //call Pdftextstripper's WRITETEXT to extract and output textstripper.writetext (document, output); } finally { if(Output! =NULL) { //turn off the output streamOutput.close (); } if(Document! =NULL) { //Close PDF DocumentDocument.close (); } } } Public Static voidMain (string[] args) {//@3 "e:\\data\\inputpdf\\" is the PDF folder root directory, all the PDF files are placed in this directory (you can set)File input =NewFile ("e:\\data\\inputpdf\\"); if(Input.isdirectory ()) {string[] fileList=input.list (); Pdfboxtest Test=Newpdfboxtest (); System.out.println (input.tostring ()+ "\ n"); for(String file:filelist) {Try{System.out.println (" "+file+ "is prepared converting to text ..."); Test.gettext (file); System.out.println ("" +file + "is done.\n"); } Catch(Exception e) {//TODO auto-generated Catch blockE.printstacktrace (); } } } }}
Small article to say: @1 "e:\\data\\inputpdf\\" is the PDF folder root directory, all the PDF files are placed in this directory (you can set)
@2 "E:\\data\\outputtxt\\" is the text document output directory (you can set it)
@3 "E:\\data\\inputpdf\\" is the PDF folder root directory, all the PDF files are placed in this directory (you can set)
These 3 lines of comments under the code can be changed according to their own specific circumstances, has been introduced so detailed, I hope you praise!!
3. The results of the experiment are as follows:
How to convert a PDF document to a text document using Pdfbox-app-1.8.10.jar batch processing