Reprint please indicate source: http://blog.csdn.net/loongshawn/article/details/51542309
Related articles:
- "Apache PDFBox Development Guide PDF text Content mining"
- PDF document Reading of the Apache PDFBox Development Guide
1. Introduction
Apache PDFBox is an open source, Java-based, PDF document-generated tool library that can be used to create new PDF documents, modify existing PDF documents, and extract the required content from a PDF document. Apache PDFBox also contains a number of command-line tools.
Apache PDFBox released its latest version of 2.0.1 on April 26, 2016.
Note: This code is written based on version 2.0 and above.
Website address: https://pdfbox.apache.org/index.html
PDFBox 2.0.1 API online Documentation: https://pdfbox.apache.org/docs/2.0.1/javadocs/
2. Characteristics
Apache PDFBox has the following main features:
PDF read, create, print, convert, validate, merge and split features.
3, the development of actual combat 3.1, scene description
1, read the PDF text content, in the sample to read the medical report text content.
2. Extract the images from the PDF document. Here just to save the picture in the PDF as a separate PDF, as for the need to directly output image files (not implemented), you can refer to my code to expand, mainly to deal with Pdimagexobject objects.
3.2, the required jar package
Pdfbox-2.0.1.jar
Fontbox-2.0.1.jar
Add the above two jar packages to the project library, as follows:
3.3, text content extraction 3.3.1, text content extraction
Create the Pdfreader class and write the following function functions.
Package Com.loongshaw;Import Java.io.File;Import Java.io.FileInputStream;Import Java.io.InputStream;Import Org.apache.pdfbox.io.RandomAccessBuffer;Import Org.apache.pdfbox.pdfparser.PDFParser;Import org.apache.pdfbox.pdmodel.PDDocument;Import Org.apache.pdfbox.text.PDFTextStripper;PublicClassPdfreader {PublicStaticvoidMain (string[] args) {File pdffile =New File ("/users/dddd/downloads/0571888890423433356rrrr_182-93201510313223336-2.pdf"); PDDocument document =Nulltry {//mode one: /** InputStream input = NULL; input = new FileInputStream (pdffile); Load PDF document Pdfparser parser = new Pdfparser (new Randomaccessbuffer (input)); Parser.parse (); Document = Parser.getpddocument (); **///Way II: Document=pddocument.load (Pdffile); //get page number int pages = Document.getnumberofpages (); //Read text content pdftextstripper Stripper=new pdftextstripper (); //Settings output stripper.setsortbyposition sequentially (true); Stripper.setstartpage (1); Stripper.setendpage (pages); String content = Stripper.gettext (document); SYSTEM.OUT.PRINTLN (content); } catch (Exception e) {System.out.println (e);}}
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21st
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 41
- 42
- 43
- 44
- 45
- 46
- 47
- 48
- 49
- 50
- 51
- 52
3.3.2, Process description
PDF file loading in two ways, no obvious differences, the way two code is more concise:
// 方式一: InputStream input = null; input = new FileInputStream( pdfFile ); //加载 pdf 文档 PDFParser parser = new PDFParser(new RandomAccessBuffer(input)); parser.parse(); document = parser.getPDDocument(); // 方式二: document=PDDocument.load(pdfFile);
3.3.3, execution results
3.4, Picture extraction (2016-12-02 Add) 3.3.1, picture extraction
PublicStaticvoidReadimage () {Unresolved pdf File Pdffile =New File ("/users/xiaolong/downloads/test.pdf");Blank PDF File Pdffile_out =New File ("/users/xiaolong/downloads/testout.pdf"); PDDocument document =Null PDDocument document_out =Nulltry {document = Pddocument.load (pdffile); document_out = Pddocument.load (pdffile_out);}catch (IOException e) {e.printstacktrace ();}int pages_size = Document.getnumberofpages (); System.out.println ("getallpages===============" +pages_size);int j=0;Forint i=0;i<pages_size;i++) {pdpage page = document.getpage (i); Pdpage Page1 = Document_out.getpage (0); Pdresources resources = page.getresources (); Iterable xobjects = Resources.getxobjectnames ();if (xobjects! =NULL) {Iterator imageiter = Xobjects.iterator ();while (Imageiter.hasnext ()) {cosname key = (cosname) imageiter.next ();if (Resources.isimagexobject (key)) {try {pdimagexobject image = (pdimagexobject) resources.getxobject (key);Method One: Save the pictures in the PDF document to a blank PDF. Pdpagecontentstream Contentstream =New Pdpagecontentstream (Document_out,page1,appendmode.append,true);Float scale =1f; Contentstream.drawimage (Image,20,20,image.getwidth () *scale,image.getheight () *scale); Contentstream.close (); Document_out.save ("/users/xiaolong/downloads/123" +j+ ". pdf"); System.out.println (Image.getsuffix () + "," +image.getheight () +"," + image.getwidth ()); /**//Way two: Save the pictures in the PDF document as pictures separately. File File = new file ("/users/xiaolong/downloads/123" +j+ ". png"); FileOutputStream out = new FileOutputStream (file); InputStream input = Image.createinputstream (); int byteCount = 0; byte[] bytes = new byte[1024]; while ((byteCount = input.read (bytes)) > 0) {out.write (bytes,0,bytecount);} out.close (); Input.close (); **/} catch (IOException e) { //TODO auto-generated catch block E.printstacktrace ();} //image Count J + +;} }}} System.out.println (j); }
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21st
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 41
- 42
- 43
- 44
- 45
- 46
- 47
- 48
- 49
- 50
- 51
- 52
- 53
- 54
- 55
- 56
- 57
- 58
- 59
- 60
- 61
- 62
- 63
- 64
- 65
- 66
- 67
- 68
- 69
- 70
- 71
- 72
- 73
- 74
- 75
- 76
- 77
3.4.2, Process description
This method can take the picture object Pdimagexobject in the source PDF and then handle the object, and this code implements each picture object that is extracted and inserts it into a blank PDF document.
One thing to note, the above code comment part of the original intention is to directly generate the picture file, but after the attempt to find the file exception. So you have a new idea based on this code that you can keep trying.
3.4.3, execution results
The source PDF file contains 19 pictures
Generate 19 PDFs with only a single picture
4. Summary
This article only introduces the use of Apache PDFBox related development package to read PDF text, other complex features are not involved, we need to explore and try their own offline.
Apache PDFBox Development Guide PDF document Read