7.1 Using PDFBox to process PDF documents __ Documents

Source: Internet
Author: User
Tags gettext unique id

In the content described earlier in this book, all of the processing is a plain text file. But in fact, the files that people use to save information are not in plain text format. Now the more popular file storage formats are Adobe's PDF and Microsoft Word, Excel, and so on. When processing these files, you cannot simply read the characters from the file, and you need to extract the content according to their special format. This chapter will introduce you to the more popular PDF, Word, and Excel format processing tools. 7.1 using PDFBox to process PDF documents

PDF full Name Portable Document format, is an electronic file form developed by Adobe Company. This file format is independent of the operating system platform and can be used on operating systems such as Windows, UNIX, or Mac OS.

The PDF file format encapsulates text, fonts, formatting, colors, and graphics images that are independent of the device and resolution in one file. If you want to extract textual information from it, you need to parse it according to its file format. Fortunately, there are many tools available to help us do these things. download of 7.1.1 PDFBox

One of the most common PDF text extraction Tools is PDFBox, accessing the URL http://sourceforge.net/projects/pdfbox/and entering the download interface as shown in Figure 7-1.

Figure 7-1 PDFBox download page

Readers can download their latest version on this page. This book is based on the PDFBox-0.7.3 version. PDFBox is an open source Java PDF Library that allows you to access the various information in PDF files. In the next example, you will demonstrate how to extract text information from a PDF file using the API provided by PDFBox. 7.1.2 is configured in Eclipse

The following is the process of creating a project in Eclipse and creating a tool class to parse a PDF file.

(1) Create an ordinary Java project in Eclipse's workspace: Ch7.

(2) The downloaded Pdfbox-0.7.3.zip decompression, the directory structure after decompression as shown in Figure 7-2.

Fig. 7-2 The PDFBox package after decompression

(3) Enter the external directory, you can see that this includes the PDFBox all the used external packages. Copy the following jar package into the project Ch7 Lib directory (if you have not yet created a Lib directory, create one first).

L Bcmail-jdk14-132.jar

L Bcprov-jdk14-132.jar

L Checkstyle-all-4.2.jar

L Fontbox-0.1.0-dev.jar

L Lucene-core-2.0.0.jar

Then, from the PDFBox lib directory, copy the Pdfbox-0.7.3.jar to the project's Lib directory.

(4) Right click on the project, select the "Build path->config build path->add Jars" command in the pop-up shortcut menu, and add the package below the project Lib directory to the project's building Path. The complete engineering catalogue on the author's machine is shown in Figure 7-3:

Fig. 7-3 Project screenshot 7.1.3 use PDFBox to parse PDF content

In the Eclipse project that you just created, create a Ch7.pdfbox package and create a Pdfboxtest class. The class contains a GetText method for obtaining textual information from a PDF with the following code.

Code 7.1

public void Getext (String file) throws Exception {

Whether to sort

Boolean sort = false;

PDF file Name

String pdffile = file;

Enter text file name

String textfile = null;

Encoding method

String encoding = "UTF-8";

Start extracting pages

int startpage = 1;

End Fetch Pages

int endpage = Integer.max_value;

File input stream, generating text file

Writer output = null;

PDF Document stored in memory

PDDocument document = null;

try {

try {

Load the file as a URL first, and then load the file from the local file system///To get an exception

URL url = new URL (pdffile);

Document = Pddocument.load (URL);

Get the file name of the PDF

String fileName = Url.getfile ();

Name the newly generated TXT file as the original PDF

if (Filename.length () > 4) {

File outputfile = new file (filename.substring (0, Filename.length ()-4) + ". txt");

Textfile = Outputfile.getname ();

}

catch (Malformedurlexception e) {

Load from File system if exception is loaded as URL

Document = Pddocument.load (Pdffile);

if (Pdffile.length () > 4) {

Textfile = pdffile.substring (0, Pdffile.length ()-4) + ". txt";

}

}

File input stream, write file down textfile

Output = new OutputStreamWriter (new FileOutputStream (textfile), encoding);

Pdftextstripper to extract text

Pdftextstripper stripper = null;

Stripper = new Pdftextstripper ();

Set whether to sort

Stripper.setsortbyposition (sort);

Set Start Page

Stripper.setstartpage (StartPage);

Set End page

Stripper.setendpage (EndPage);

Call Pdftextstripper's WRITETEXT extract and output text

Stripper.writetext (document, output);

finally {

if (output!= null) {

Turn off the output stream

Output.close ();

}

if (document!= null) {

Close PDF Document

Document.close ();

}

}

}

In the above code, the GetText method receives a string parameter that specifies the path of the PDF file to extract. This location can be a URL or a local file. The function then invokes the Pdftextstripper class provided by PDFBox, setting some attributes (such as starting page, sorting, and so on) during the extraction. Finally, the text is extracted and written to the file. 7.1.4 Operation Effect

Look at the effect of this function, add a main function in Pdfboxtest, and the code below.

public static void Main (string[] args) {

Pdfboxtest test = new Pdfboxtest ();

try {

Get the contents of Index.pdf under C disk

Test.getext ("C://index.pdf");

catch (Exception e) {

E.printstacktrace ();

}

}

Here's a index.pdf file, and the content of the PDF file is shown in Figure 7-4.

Figure 7-4 Content of PDF document to parse

The text file processed by Pdfboxtest is shown in Figure 7-5.

Figure 7-5 The results of the processing

As you can see, the text in the PDF has been extracted and saved in a text file. The Hyperlinks section of line 4th, "POI News Webblog", has been replaced with plain plain text in a text file. Readers can further query for additional functionality based on the API documentation provided by PDFBox. integration of 7.1.5 and Lucene

PDFBox also provides integration with Lucene, which provides an easy way to add PDF documents to Lucene's index, see the following code:

Document lucenedocument = lucenepdfdocument.getdocument (...);

Wherein, lucenepdfdocument is a class provided in PDFBox, and its getdocument is overloaded with 3 methods that receive a file object, InputStream object, or URL object as parameters, respectively. The Document object for Lucene is then extracted and generated from the PDF file that the parameter passes in.

When you get a Lucene document from a PDF file via PDFBox, you can add it to the Lucene index directly using IndexWriter. Lucenepdfdocument automatically extracts various metadata field from the PDF file and adds them to the document. It extracts information as shown in table 7-1.

Table 7-1 PDFBox generated Lucene document format

Lucene Field Name

Description

Path

File system path (if document is loaded from file)

Url

URL address (if the document is loaded from the network)

Contents

The contents of the entire document, indexed but not stored

Summary

Document Content First 500 characters

Modified

Last modification time

Uid

Unique ID of document

CreationDate

Get from the Meta-data of PDF

Creator

Get from the Meta-data of PDF

Keywords

Get from the Meta-data of PDF

Modificationdate

Get from the Meta-data of PDF

Producer

Get from the Meta-data of PDF

Subject

Get from the Meta-data of PDF

Trapped

Get from the Meta-data of PDF

The following lucenepdfdocument, indexed directly to the PDF, creates a new Pdflucenetest class below the Ch7.pdfbox package, the code for the class is as follows.

Code 7.2

public class Pdflucenetest {

public static void Main (string[] args) {

try {

IndexWriter Storage Index to D:/index

IndexWriter writer = new IndexWriter ("D://index",

New StandardAnalyzer (), true);

Lucenepdfdocument returns the Lucene docuement generated by PDF

Document d = lucenepdfdocument

. GetDocument (New File ("C://index.pdf"));

Write index

Writer.adddocument (d);

Close Index file stream

Writer.close ();

Read the index file under D:/index to establish Indexsearcher

Indexsearcher searcher = new Indexsearcher ("D://index");

Find keywords in the contents field of an index poi

Term t = new Term ("Contents", "poi");

Generate query based on term

Query q = new Termquery (t);

Search returns result set

Hits Hits = Searcher.search (q);

Print result Sets

for (int i = 0; i < hits.length (); i++) {

System.out.println (Hits.doc (i));

}

catch (Exception e) {

E.printstacktrace ();

}

}

}

The function uses the Lucenepdfdocument getdocument function to return a Lucene document directly from a PDF file that contains path, URL, modified, contents, Summary and so on, write them directly to index, and then create a indexsearcher, search the Contents field by line, find the keyword "poi" (note must be lowercase), the program's execution results are shown in Figure 7-6.

Figure 7-6 Running results of the search code


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.