Apache PDFBox Development Guide PDF document Read

Source: Internet
Author: User

Reprint please indicate source: http://blog.csdn.net/loongshawn/article/details/51542309

Related articles:

    • "Apache PDFBox Development Guide PDF text Content mining"
    • PDF document Reading of the Apache PDFBox Development Guide
1. Introduction

Apache PDFBox is an open source, Java-based, PDF document-generated tool library that can be used to create new PDF documents, modify existing PDF documents, and extract the required content from a PDF document. Apache PDFBox also contains a number of command-line tools.
Apache PDFBox released its latest version of 2.0.1 on April 26, 2016.

Note: This code is written based on version 2.0 and above.

Website address: https://pdfbox.apache.org/index.html

PDFBox 2.0.1 API online Documentation: https://pdfbox.apache.org/docs/2.0.1/javadocs/

2. Characteristics

Apache PDFBox has the following main features:
PDF read, create, print, convert, validate, merge and split features.

3, the development of actual combat 3.1, scene description
    • 1, read the PDF text content, in the sample to read the medical report text content.

    • 2. Extract the images from the PDF document. Here just to save the picture in the PDF as a separate PDF, as for the need to directly output image files (not implemented), you can refer to my code to expand, mainly to deal with Pdimagexobject objects.

3.2, the required jar package

Pdfbox-2.0.1.jar

Fontbox-2.0.1.jar

Add the above two jar packages to the project library, as follows:

3.3, text content extraction 3.3.1, text content extraction

Create the Pdfreader class and write the following function functions.

Package Com.loongshaw;Import Java.io.File;Import Java.io.FileInputStream;Import Java.io.InputStream;Import Org.apache.pdfbox.io.RandomAccessBuffer;Import Org.apache.pdfbox.pdfparser.PDFParser;Import org.apache.pdfbox.pdmodel.PDDocument;Import Org.apache.pdfbox.text.PDFTextStripper;PublicClassPdfreader {PublicStaticvoidMain (string[] args) {File pdffile =New File ("/users/dddd/downloads/0571888890423433356rrrr_182-93201510313223336-2.pdf"); PDDocument document =Nulltry {//mode one: /** InputStream input = NULL; input = new FileInputStream (pdffile); Load PDF document Pdfparser parser = new Pdfparser (new Randomaccessbuffer (input)); Parser.parse (); Document = Parser.getpddocument (); **///Way II: Document=pddocument.load (Pdffile); //get page number int pages = Document.getnumberofpages (); //Read text content pdftextstripper Stripper=new pdftextstripper (); //Settings output stripper.setsortbyposition sequentially (true); Stripper.setstartpage (1); Stripper.setendpage (pages); String content = Stripper.gettext (document); SYSTEM.OUT.PRINTLN (content); } catch (Exception e) {System.out.println (e);}}       
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21st
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45
    • 46
    • 47
    • 48
    • 49
    • 50
    • 51
    • 52
3.3.2, Process description

PDF file loading in two ways, no obvious differences, the way two code is more concise:

// 方式一:                 InputStream input = null;        input = new FileInputStream( pdfFile );        //加载 pdf 文档        PDFParser parser = new PDFParser(new RandomAccessBuffer(input)); parser.parse(); document = parser.getPDDocument(); // 方式二: document=PDDocument.load(pdfFile); 
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
3.3.3, execution results

3.4, Picture extraction (2016-12-02 Add) 3.3.1, picture extraction
PublicStaticvoidReadimage () {Unresolved pdf File Pdffile =New File ("/users/xiaolong/downloads/test.pdf");Blank PDF File Pdffile_out =New File ("/users/xiaolong/downloads/testout.pdf"); PDDocument document =Null PDDocument document_out =Nulltry {document = Pddocument.load (pdffile); document_out = Pddocument.load (pdffile_out);}catch (IOException e) {e.printstacktrace ();}int pages_size = Document.getnumberofpages (); System.out.println ("getallpages===============" +pages_size);int j=0;Forint i=0;i<pages_size;i++) {pdpage page = document.getpage (i); Pdpage Page1 = Document_out.getpage (0); Pdresources resources = page.getresources (); Iterable xobjects = Resources.getxobjectnames ();if (xobjects! =NULL) {Iterator imageiter = Xobjects.iterator ();while (Imageiter.hasnext ()) {cosname key = (cosname) imageiter.next ();if (Resources.isimagexobject (key)) {try {pdimagexobject image = (pdimagexobject) resources.getxobject (key);Method One: Save the pictures in the PDF document to a blank PDF. Pdpagecontentstream Contentstream =New Pdpagecontentstream (Document_out,page1,appendmode.append,true);Float scale =1f; Contentstream.drawimage (Image,20,20,image.getwidth () *scale,image.getheight () *scale); Contentstream.close (); Document_out.save ("/users/xiaolong/downloads/123" +j+ ". pdf"); System.out.println (Image.getsuffix () + "," +image.getheight () +"," + image.getwidth ()); /**//Way two: Save the pictures in the PDF document as pictures separately. File File = new file ("/users/xiaolong/downloads/123" +j+ ". png"); FileOutputStream out = new FileOutputStream (file); InputStream input = Image.createinputstream (); int byteCount = 0; byte[] bytes = new byte[1024]; while ((byteCount = input.read (bytes)) > 0) {out.write (bytes,0,bytecount);} out.close (); Input.close (); **/} catch (IOException e) { //TODO auto-generated catch block E.printstacktrace ();} //image Count J + +;} }}} System.out.println (j); } 
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21st
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71
  • 72
  • 73
  • 74
  • 75
  • 76
  • 77
3.4.2, Process description

This method can take the picture object Pdimagexobject in the source PDF and then handle the object, and this code implements each picture object that is extracted and inserts it into a blank PDF document.

One thing to note, the above code comment part of the original intention is to directly generate the picture file, but after the attempt to find the file exception. So you have a new idea based on this code that you can keep trying.

3.4.3, execution results


The source PDF file contains 19 pictures


Generate 19 PDFs with only a single picture

4. Summary

This article only introduces the use of Apache PDFBox related development package to read PDF text, other complex features are not involved, we need to explore and try their own offline.

Apache PDFBox Development Guide PDF document Read

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.