Document directory
- 7.1 use product_box to process PDF documents
- 7.1.1 download of product_box
- 7.1.2 configure in eclipse
- 7.1.3 use product_box to parse PDF content
- 7.1.4 Running Effect
- 7.1.5 integration with Lucene
In the content introduced earlier in this book, all files are handled in plain text. But in fact, the files that people use to save information are not in plain text format. Currently, the most popular file storage formats are Adobe PDF, Microsoft Word, and excel. When processing these files, you cannot simply read characters from the files. You need to extract the content according to their special format. This chapter will introduce popular processing tools in PDF, word, and Excel format one by one.
7.1 use product_box to process PDF documents
The full name of PDF is portable document format, which is an electronic file format developed by Adobe. This file format is independent of the operating system platform and can be used in windows, UNIX, Mac OS, and other operating systems.
The PDF File Format encapsulates text, fonts, formats, colors, and image images independent of devices and resolutions in one file. If you want to extract the text information, you need to parse it according to its file format. Fortunately, there are already many tools that can help us with these tasks.
7.1.1 download of product_box
The most common PDF Text Extraction Tool is javasbox. Visit http://sourceforge.net/projects/#box/and enter the download interface 7-1.
Figure 7-1 download page of product_box
You can download the latest version from this webpage. This book uses the PDFBox-0.7.3 version. Product_box is an open-source Java PDF library that allows you to access the information of PDF files. In the following example, we will demonstrate how to extract text information from a PDF file using the APIS provided by product_box.
7.1.2 configure in eclipse
The following is the process of creating a project in eclipse and creating a tool class for parsing PDF files.
(1) create a common Java project in eclipse Workspace: ch7.
(2) decompress the downloaded box-0.7.3.zip file. The decompressed directory structure is 7-2.
Figure 7-2 unzip the product_box package
(3) Go to the external directory and you can see that all external packages used in the product_box are included. Copy the following jar package to the lib directory of the Project ch7 (if the lib directory has not been created, create one first ).
Bcmail-jdk14-132.jar
Bcprov-jdk14-132.jar
Checkstyle-all-4.2.jar
FontBox-0.1.0-dev.jar
Lucene-core-2.0.0.jar
Then copy the PDFBox-0.7.3.jar to the lib directory of the project from the lib directory of the Consumer box.
(4) Right-click the project and choose "build path-> config build path-> Add jars" from the shortcut menu, add the packages under the lib directory of the project to the build path of the project. The complete project directory 7-3 on the author's machine is shown in:
Figure 7-3 Project
7.1.3 use product_box to parse PDF content
In the just-created eclipseproject, create a ch7.pdf Box package and create a eclipboxtest class. This class contains a gettext method used to obtain text information from a pdf. The Code is as follows.
Code 7.1:
Public void getext (string file) throws exception {
// Sort or not
Boolean sort = false;
// PDF file name
String pdffile = file;
// Enter the text file name
String textfile = NULL;
// Encoding method
String encoding = "UTF-8 ";
// Start page Extraction
Int startpage = 1;
// End number of extracted pages
Int endpage = integer. max_value;
// File input stream to generate a text file
Writer output = NULL;
// PDF document stored in memory
Pddocument document = NULL;
Try {
Try {
// First load the file as a URL, and then load the file from the local file system if an exception occurs. //
URL url = new URL (pdffile );
Document = pddocument. Load (URL );
// Obtain the PDF file name
String filename = URL. GetFile ();
// Name the generated TXT file with the original PDF name
If (filename. Length ()> 4 ){
File outputfile = new file (filename. substring (0, filename. Length ()-4) + ". txt ");
Textfile = outputfile. getname ();
}
} Catch (malformedurlexception e ){
// If an exception occurs during URL loading, it will be loaded from the file system
Document = pddocument. Load (pdffile );
If (pdffile. Length ()> 4 ){
Textfile = pdffile. substring (0, pdffile. Length ()-4) + ". txt ";
}
}
// File input stream, which is written into the file inverted textfile
Output = new outputstreamwriter (New fileoutputstream (textfile), encoding );
// Extract textstripper to extract text
Optional textstripper stripper = NULL;
Stripper = new jsontextstripper ();
// Set whether to sort
Stripper. setsortbyposition (SORT );
// Set the start page
Stripper. setstartpage (startpage );
// Set the end page
Stripper. setendpage (endpage );
// Call javastextstripper's writetext to extract and output the text
Stripper. writetext (document, output );
} Finally {
If (output! = NULL ){
// Close the output stream
Output. Close ();
}
If (document! = NULL ){
// Close the PDF document
Document. Close ();
}
}
}
In the above Code, the gettext method receives a string-type parameter and specifies the path of the PDF file to be extracted. This location can be a URL or local file. Then, the function calls the javastextstripper class provided by javasbox to set some attributes (such as the start page and whether to sort the attributes) in the extraction process ). Finally, extract and write the text into the file.
7.1.4 Running Effect
Next, let's take a look at the running effect of this function and add a main function in javasboxtest. The Code is as follows.
Public static void main (string [] ARGs ){
Export boxtest test = new export boxtest ();
Try {
// Obtain the content of indexstores in the C drive
Test. getext ("C: // index.html ");
} Catch (exception e ){
E. printstacktrace ();
}
}
Here we need to process an index.html file, as shown in 7-4.
Figure 7-4 PDF document to be parsed
The text file 7-5 after being processed using export boxtest is shown in.
Figure 7-5 processing results
We can see that the text in PDF has been extracted and saved in the text file. The hyperlink section "Poi news webblog" in line 1 has been replaced with plain text in a text file. Readers can further query other functions based on the API documentation provided by consumer box.
7.1.5 integration with Lucene
Product_box also provides integration with Lucene. It provides a simple method to add PDF documents to Lucene's index. See the following code:
Document effecedocument = effecedocument. getdocument (...);
In this example, javaseworkflow document is a class provided in the product_box. Its getdocument is overloaded into three methods, which receive a file object, inputstream object, or URL object as parameters respectively, then, extract and generate the Document Object of Lucene from the PDF file passed in by this parameter.
After obtaining a Lucene document from a PDF document using the Export box, you can directly use indexwriter to add it to Lucene's index. Luceneworkflow document automatically extracts various metadata fields from the PDF file and adds them to the document. The extracted information is shown in Table 7-1.
Table 7-1 Lucene document format generated by product_box
Lucene field name |
Description |
Path |
File System Path (if the document is loaded from a file) |
URL |
URL address (if the document is loaded from the Network) |
Contents |
The content of the entire document, indexed but not stored |
Summary |
Document contains the first 500 characters |
Modified |
Last modification time |
UID |
Unique id of document |
Creationdate |
Retrieve from meta-data in PDF |
Creator |
Retrieve from meta-data in PDF |
Keywords |
Retrieve from meta-data in PDF |
Modificationdate |
Retrieve from meta-data in PDF |
Producer |
Retrieve from meta-data in PDF |
Subject |
Retrieve from meta-data in PDF |
Trapped |
Retrieve from meta-data in PDF |
The following describes how to create a new pdflucenetest class under the ch7.pdf box using javase‑document‑directly. The code for this class is as follows.
Code 7.2:
Public class pdflucenetest {
Public static void main (string [] ARGs ){
Try {
// Indexwriter stores the index under D:/Index
Indexwriter writer = new indexwriter ("D: // Index ",
New standardanalyzer (), true );
// Returns the Lucene docuement generated by the PDF file.
Document d = policedocument
. Getdocument (new file ("C: // index.html "));
// Write the index
Writer. adddocument (d );
// Close the index file stream
Writer. Close ();
// Read the index file in D:/index to create indexsearcher
Indexsearcher searcher = new indexsearcher ("D: // Index ");
// Search for the keyword poi for the index contents Field
Term T = new term ("contents", "Poi ");
// Generate query based on Term
Query q = new termquery (t );
// Search for the returned result set
Hits hits = searcher. Search (Q );
// Print the result set
For (INT I = 0; I
System.out.println(hits.doc (I ));
}
} Catch (exception e ){
E. printstacktrace ();
}
}
}
The function uses the getdocument function of javasepolicdocument to directly return a Lucene document from a PDF file, which contains fields such as path, URL, modified, contents, and summary and writes them directly to the index, then create an indexsearcher to search the contents field and search for the keyword "Poi" (note that it must be in lower case). The execution result of the program is 7-6.
Figure 7-6 search the running result of the Code