Apache PDFBox Development Guide PDF document Read

Last Update:2017-11-15 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Reprint please indicate source: http://blog.csdn.net/loongshawn/article/details/51542309

Related articles:

"Apache PDFBox Development Guide PDF text Content mining"
PDF document Reading of the Apache PDFBox Development Guide

1. Introduction

Apache PDFBox is an open source, Java-based, PDF document-generated tool library that can be used to create new PDF documents, modify existing PDF documents, and extract the required content from a PDF document. Apache PDFBox also contains a number of command-line tools.
Apache PDFBox released its latest version of 2.0.1 on April 26, 2016.

Note: This code is written based on version 2.0 and above.

Website address: https://pdfbox.apache.org/index.html

PDFBox 2.0.1 API online Documentation: https://pdfbox.apache.org/docs/2.0.1/javadocs/

2. Characteristics

Apache PDFBox has the following main features:
PDF read, create, print, convert, validate, merge and split features.

3, the development of actual combat 3.1, scene description

1, read the PDF text content, in the sample to read the medical report text content.
2. Extract the images from the PDF document. Here just to save the picture in the PDF as a separate PDF, as for the need to directly output image files (not implemented), you can refer to my code to expand, mainly to deal with Pdimagexobject objects.

3.2, the required jar package

Pdfbox-2.0.1.jar

Fontbox-2.0.1.jar

Add the above two jar packages to the project library, as follows:

3.3, text content extraction 3.3.1, text content extraction

Create the Pdfreader class and write the following function functions.

Package Com.loongshaw;Import Java.io.File;Import Java.io.FileInputStream;Import Java.io.InputStream;Import Org.apache.pdfbox.io.RandomAccessBuffer;Import Org.apache.pdfbox.pdfparser.PDFParser;Import org.apache.pdfbox.pdmodel.PDDocument;Import Org.apache.pdfbox.text.PDFTextStripper;PublicClassPdfreader {PublicStaticvoidMain (string[] args) {File pdffile =New File ("/users/dddd/downloads/0571888890423433356rrrr_182-93201510313223336-2.pdf"); PDDocument document =Nulltry {//mode one: /** InputStream input = NULL; input = new FileInputStream (pdffile); Load PDF document Pdfparser parser = new Pdfparser (new Randomaccessbuffer (input)); Parser.parse (); Document = Parser.getpddocument (); **///Way II: Document=pddocument.load (Pdffile); //get page number int pages = Document.getnumberofpages (); //Read text content pdftextstripper Stripper=new pdftextstripper (); //Settings output stripper.setsortbyposition sequentially (true); Stripper.setstartpage (1); Stripper.setendpage (pages); String content = Stripper.gettext (document); SYSTEM.OUT.PRINTLN (content); } catch (Exception e) {System.out.println (e);}}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21st
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52

3.3.2, Process description

PDF file loading in two ways, no obvious differences, the way two code is more concise:

// 方式一：                 InputStream input = null;        input = new FileInputStream( pdfFile );        //加载 pdf 文档        PDFParser parser = new PDFParser(new RandomAccessBuffer(input)); parser.parse(); document = parser.getPDDocument(); // 方式二： document=PDDocument.load(pdfFile);

3.3.3, execution results

3.4, Picture extraction (2016-12-02 Add) 3.3.1, picture extraction

PublicStaticvoidReadimage () {Unresolved pdf File Pdffile =New File ("/users/xiaolong/downloads/test.pdf");Blank PDF File Pdffile_out =New File ("/users/xiaolong/downloads/testout.pdf"); PDDocument document =Null PDDocument document_out =Nulltry {document = Pddocument.load (pdffile); document_out = Pddocument.load (pdffile_out);}catch (IOException e) {e.printstacktrace ();}int pages_size = Document.getnumberofpages (); System.out.println ("getallpages===============" +pages_size);int j=0;Forint i=0;i<pages_size;i++) {pdpage page = document.getpage (i); Pdpage Page1 = Document_out.getpage (0); Pdresources resources = page.getresources (); Iterable xobjects = Resources.getxobjectnames ();if (xobjects! =NULL) {Iterator imageiter = Xobjects.iterator ();while (Imageiter.hasnext ()) {cosname key = (cosname) imageiter.next ();if (Resources.isimagexobject (key)) {try {pdimagexobject image = (pdimagexobject) resources.getxobject (key);Method One: Save the pictures in the PDF document to a blank PDF. Pdpagecontentstream Contentstream =New Pdpagecontentstream (Document_out,page1,appendmode.append,true);Float scale =1f; Contentstream.drawimage (Image,20,20,image.getwidth () *scale,image.getheight () *scale); Contentstream.close (); Document_out.save ("/users/xiaolong/downloads/123" +j+ ". pdf"); System.out.println (Image.getsuffix () + "," +image.getheight () +"," + image.getwidth ()); /**//Way two: Save the pictures in the PDF document as pictures separately. File File = new file ("/users/xiaolong/downloads/123" +j+ ". png"); FileOutputStream out = new FileOutputStream (file); InputStream input = Image.createinputstream (); int byteCount = 0; byte[] bytes = new byte[1024]; while ((byteCount = input.read (bytes)) > 0) {out.write (bytes,0,bytecount);} out.close (); Input.close (); **/} catch (IOException e) { //TODO auto-generated catch block E.printstacktrace ();} //image Count J + +;} }}} System.out.println (j); }

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21st
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77

3.4.2, Process description

This method can take the picture object Pdimagexobject in the source PDF and then handle the object, and this code implements each picture object that is extracted and inserts it into a blank PDF document.

One thing to note, the above code comment part of the original intention is to directly generate the picture file, but after the attempt to find the file exception. So you have a new idea based on this code that you can keep trying.

3.4.3, execution results

The source PDF file contains 19 pictures

Generate 19 PDFs with only a single picture

4. Summary

This article only introduces the use of Apache PDFBox related development package to read PDF text, other complex features are not involved, we need to explore and try their own offline.

Apache PDFBox Development Guide PDF document Read

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Apache PDFBox Development Guide PDF document Read

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Apache PDFBox Development Guide PDF document Read

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support