How to convert a PDF document into a text document using Pdfbox-app-1.8.10.jar batch processing

Source: Internet
Author: User
Tags gettext

1. First download Pdfbox-app-1.8.10.jar (: http://pdfbox.apache.org/download.html)

2. Load the Pdfbox-app-1.8.10.jar into the Eclipse project

1. New Java Project: Flie->new->java project, such as Pdftotext project, then right-click the project Buildpath->configure Bulid Path. Click Add External JARs, add the pdfbox-app-1.8.10.jar you just downloaded, click on Order and Export, just tick the package, and finally click OK.

2. Create a new Pdfboxtest class, the following is the source code

ImportJava.io.File;ImportJava.io.FileOutputStream;ImportJava.io.Writer;Importjava.net.MalformedURLException;ImportJava.net.URL;ImportJava.io.OutputStreamWriter;Importorg.apache.pdfbox.pdmodel.PDDocument;ImportOrg.apache.pdfbox.util.PDFTextStripper;//Author:yiutto//destination: Mainly used for PDF file batch conversion to text document Public classPdfboxtest { Public voidGetText (String file)throwsException {//whether to sort        BooleanSort =false; //PDF file name @1 "e:\\data\\inputpdf\\" is the PDF folder root, all the PDF files are placed in this directory (you can set)String pdffile = "e:\\data\\inputpdf\\" +file; //Enter a text file nameString textfile =NULL; //Encoding MethodString encoding = "UTF-8"; //Start extracting pages        intStartPage = 1; //End Fetch Pages        intEndPage =Integer.max_value; //file input stream, generating a text fileWriter output =NULL; //in-memory stored PDF DocumentPDDocument document =NULL; Try {            Try {                //load the file first as a URL, if you get an exception and then load the file from the local systemURL url =NewURL (Pdffile); Document=pddocument.load (URL); //get the file name of the PDF//String fileName = Url.getfile (); //name the newly generated TXT file in the original PDF                if(File.length () > 4) {File outputFile=NewFile (file.substring (0, File.length ()-4) + ". txt"); Textfile=Outputfile.getname (); }            } Catch(malformedurlexception e) {//load from File system if exception is loaded as URLDocument =pddocument.load (Pdffile); if(File.length () > 4) {textfile= file.substring (0, File.length ()-4) + ". txt"; }            }            //file input stream, write file to Textfile @2 "e:\\data\\outputtxt\\" is the text document output directory (you can set it)Output =NewOutputStreamWriter (NewFileOutputStream ("E:\\data\\outputtxt\\" +textfile), encoding); //Pdftextstripper to extract textPdftextstripper stripper =NULL; Stripper=NewPdftextstripper (); //set whether to sortstripper.setsortbyposition (sort); //Set Start Pagestripper.setstartpage (startpage); //Set End pagestripper.setendpage (EndPage); //call Pdftextstripper's WRITETEXT to extract and output textstripper.writetext (document, output); } finally {            if(Output! =NULL) {                //turn off the output streamOutput.close (); }            if(Document! =NULL) {                //Close PDF DocumentDocument.close (); }        }    }     Public Static voidMain (string[] args) {//@3 "e:\\data\\inputpdf\\" is the PDF folder root directory, all the PDF files are placed in this directory (you can set)File input =NewFile ("e:\\data\\inputpdf\\"); if(Input.isdirectory ()) {string[] fileList=input.list (); Pdfboxtest Test=Newpdfboxtest (); System.out.println (input.tostring ()+ "\ n");  for(String file:filelist) {Try{System.out.println (" "+file+ "is prepared converting to text ...");                    Test.gettext (file); System.out.println ("" +file + "is done.\n"); } Catch(Exception e) {//TODO auto-generated Catch blockE.printstacktrace (); }            }        }    }}

Small article to say: @1 "e:\\data\\inputpdf\\" is the PDF folder root directory, all the PDF files are placed in this directory (you can set)

@2 "E:\\data\\outputtxt\\" is the text document output directory (you can set it)

@3 "E:\\data\\inputpdf\\" is the PDF folder root directory, all the PDF files are placed in this directory (you can set)

These 3 lines of comments under the code can be changed according to their own specific circumstances, has been introduced so detailed, I hope you praise!!

3. The results of the experiment are as follows:

How to convert a PDF document to a text document using Pdfbox-app-1.8.10.jar batch processing

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.