7.1 use product_box to process PDF documents

Source: Internet
Author: User
Tags parsing pdf files
Document directory
  • 7.1 use product_box to process PDF documents
  • 7.1.1 download of product_box
  • 7.1.2 configure in eclipse
  • 7.1.3 use product_box to parse PDF content
  • 7.1.4 Running Effect
  • 7.1.5 integration with Lucene

In the content introduced earlier in this book, all files are handled in plain text. But in fact, the files that people use to save information are not in plain text format. Currently, the most popular file storage formats are Adobe PDF, Microsoft Word, and excel. When processing these files, you cannot simply read characters from the files. You need to extract the content according to their special format. This chapter will introduce popular processing tools in PDF, word, and Excel format one by one.

7.1 use product_box to process PDF documents

The full name of PDF is portable document format, which is an electronic file format developed by Adobe. This file format is independent of the operating system platform and can be used in windows, UNIX, Mac OS, and other operating systems.

The PDF File Format encapsulates text, fonts, formats, colors, and image images independent of devices and resolutions in one file. If you want to extract the text information, you need to parse it according to its file format. Fortunately, there are already many tools that can help us with these tasks.

7.1.1 download of product_box

The most common PDF Text Extraction Tool is javasbox. Visit http://sourceforge.net/projects/#box/and enter the download interface 7-1.

Figure 7-1 download page of product_box

You can download the latest version from this webpage. This book uses the PDFBox-0.7.3 version. Product_box is an open-source Java PDF library that allows you to access the information of PDF files. In the following example, we will demonstrate how to extract text information from a PDF file using the APIS provided by product_box.

7.1.2 configure in eclipse

The following is the process of creating a project in eclipse and creating a tool class for parsing PDF files.

(1) create a common Java project in eclipse Workspace: ch7.

(2) decompress the downloaded box-0.7.3.zip file. The decompressed directory structure is 7-2.

Figure 7-2 unzip the product_box package

(3) Go to the external directory and you can see that all external packages used in the product_box are included. Copy the following jar package to the lib directory of the Project ch7 (if the lib directory has not been created, create one first ).

Bcmail-jdk14-132.jar

Bcprov-jdk14-132.jar

Checkstyle-all-4.2.jar

FontBox-0.1.0-dev.jar

Lucene-core-2.0.0.jar

Then copy the PDFBox-0.7.3.jar to the lib directory of the project from the lib directory of the Consumer box.

(4) Right-click the project and choose "build path-> config build path-> Add jars" from the shortcut menu, add the packages under the lib directory of the project to the build path of the project. The complete project directory 7-3 on the author's machine is shown in:

Figure 7-3 Project

7.1.3 use product_box to parse PDF content

In the just-created eclipseproject, create a ch7.pdf Box package and create a eclipboxtest class. This class contains a gettext method used to obtain text information from a pdf. The Code is as follows.

Code 7.1:

Public void getext (string file) throws exception {

// Sort or not

Boolean sort = false;

// PDF file name

String pdffile = file;

// Enter the text file name

String textfile = NULL;

// Encoding method

String encoding = "UTF-8 ";

// Start page Extraction

Int startpage = 1;

// End number of extracted pages

Int endpage = integer. max_value;

// File input stream to generate a text file

Writer output = NULL;

// PDF document stored in memory

Pddocument document = NULL;

Try {

Try {

// First load the file as a URL, and then load the file from the local file system if an exception occurs. //

URL url = new URL (pdffile );

Document = pddocument. Load (URL );

// Obtain the PDF file name

String filename = URL. GetFile ();

// Name the generated TXT file with the original PDF name

If (filename. Length ()> 4 ){

File outputfile = new file (filename. substring (0, filename. Length ()-4) + ". txt ");

Textfile = outputfile. getname ();

}

} Catch (malformedurlexception e ){

// If an exception occurs during URL loading, it will be loaded from the file system

Document = pddocument. Load (pdffile );

If (pdffile. Length ()> 4 ){

Textfile = pdffile. substring (0, pdffile. Length ()-4) + ". txt ";

}

}

// File input stream, which is written into the file inverted textfile

Output = new outputstreamwriter (New fileoutputstream (textfile), encoding );

// Extract textstripper to extract text

Optional textstripper stripper = NULL;

Stripper = new jsontextstripper ();

// Set whether to sort

Stripper. setsortbyposition (SORT );

// Set the start page

Stripper. setstartpage (startpage );

// Set the end page

Stripper. setendpage (endpage );

// Call javastextstripper's writetext to extract and output the text

Stripper. writetext (document, output );

} Finally {

If (output! = NULL ){

// Close the output stream

Output. Close ();

}

If (document! = NULL ){

// Close the PDF document

Document. Close ();

}

}

}

In the above Code, the gettext method receives a string-type parameter and specifies the path of the PDF file to be extracted. This location can be a URL or local file. Then, the function calls the javastextstripper class provided by javasbox to set some attributes (such as the start page and whether to sort the attributes) in the extraction process ). Finally, extract and write the text into the file.

7.1.4 Running Effect

Next, let's take a look at the running effect of this function and add a main function in javasboxtest. The Code is as follows.

Public static void main (string [] ARGs ){

Export boxtest test = new export boxtest ();

Try {

// Obtain the content of indexstores in the C drive

Test. getext ("C: // index.html ");

} Catch (exception e ){

E. printstacktrace ();

}

}

Here we need to process an index.html file, as shown in 7-4.

Figure 7-4 PDF document to be parsed

The text file 7-5 after being processed using export boxtest is shown in.

Figure 7-5 processing results

We can see that the text in PDF has been extracted and saved in the text file. The hyperlink section "Poi news webblog" in line 1 has been replaced with plain text in a text file. Readers can further query other functions based on the API documentation provided by consumer box.

7.1.5 integration with Lucene

Product_box also provides integration with Lucene. It provides a simple method to add PDF documents to Lucene's index. See the following code:

Document effecedocument = effecedocument. getdocument (...);

In this example, javaseworkflow document is a class provided in the product_box. Its getdocument is overloaded into three methods, which receive a file object, inputstream object, or URL object as parameters respectively, then, extract and generate the Document Object of Lucene from the PDF file passed in by this parameter.

After obtaining a Lucene document from a PDF document using the Export box, you can directly use indexwriter to add it to Lucene's index. Luceneworkflow document automatically extracts various metadata fields from the PDF file and adds them to the document. The extracted information is shown in Table 7-1.

Table 7-1 Lucene document format generated by product_box

Lucene field name

Description

Path

File System Path (if the document is loaded from a file)

URL

URL address (if the document is loaded from the Network)

Contents

The content of the entire document, indexed but not stored

Summary

Document contains the first 500 characters

Modified

Last modification time

UID

Unique id of document

Creationdate

Retrieve from meta-data in PDF

Creator

Retrieve from meta-data in PDF

Keywords

Retrieve from meta-data in PDF

Modificationdate

Retrieve from meta-data in PDF

Producer

Retrieve from meta-data in PDF

Subject

Retrieve from meta-data in PDF

Trapped

Retrieve from meta-data in PDF

The following describes how to create a new pdflucenetest class under the ch7.pdf box using javase‑document‑directly. The code for this class is as follows.

Code 7.2:

Public class pdflucenetest {

Public static void main (string [] ARGs ){

Try {

// Indexwriter stores the index under D:/Index

Indexwriter writer = new indexwriter ("D: // Index ",

New standardanalyzer (), true );

// Returns the Lucene docuement generated by the PDF file.

Document d = policedocument

. Getdocument (new file ("C: // index.html "));

// Write the index

Writer. adddocument (d );

// Close the index file stream

Writer. Close ();

// Read the index file in D:/index to create indexsearcher

Indexsearcher searcher = new indexsearcher ("D: // Index ");

// Search for the keyword poi for the index contents Field

Term T = new term ("contents", "Poi ");

// Generate query based on Term

Query q = new termquery (t );

// Search for the returned result set

Hits hits = searcher. Search (Q );

// Print the result set

For (INT I = 0; I

System.out.println(hits.doc (I ));

}

} Catch (exception e ){

E. printstacktrace ();

}

}

}

The function uses the getdocument function of javasepolicdocument to directly return a Lucene document from a PDF file, which contains fields such as path, URL, modified, contents, and summary and writes them directly to the index, then create an indexsearcher to search the contents field and search for the keyword "Poi" (note that it must be in lower case). The execution result of the program is 7-6.

Figure 7-6 search the running result of the Code

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.