In the content described earlier in this book, all of the processing is a plain text file. But in fact, the files that people use to save information are not in plain text format. Now the more popular file storage formats are Adobe's PDF and Microsoft Word, Excel, and so on. When processing these files, you cannot simply read the characters from the file, and you need to extract the content according to their special format. This chapter will introduce you to the more popular PDF, Word, and Excel format processing tools. 7.1 using PDFBox to process PDF documents
PDF full Name Portable Document format, is an electronic file form developed by Adobe Company. This file format is independent of the operating system platform and can be used on operating systems such as Windows, UNIX, or Mac OS.
The PDF file format encapsulates text, fonts, formatting, colors, and graphics images that are independent of the device and resolution in one file. If you want to extract textual information from it, you need to parse it according to its file format. Fortunately, there are many tools available to help us do these things. download of 7.1.1 PDFBox
One of the most common PDF text extraction Tools is PDFBox, accessing the URL http://sourceforge.net/projects/pdfbox/and entering the download interface as shown in Figure 7-1.
Figure 7-1 PDFBox download page
Readers can download their latest version on this page. This book is based on the PDFBox-0.7.3 version. PDFBox is an open source Java PDF Library that allows you to access the various information in PDF files. In the next example, you will demonstrate how to extract text information from a PDF file using the API provided by PDFBox. 7.1.2 is configured in Eclipse
The following is the process of creating a project in Eclipse and creating a tool class to parse a PDF file.
(1) Create an ordinary Java project in Eclipse's workspace: Ch7.
(2) The downloaded Pdfbox-0.7.3.zip decompression, the directory structure after decompression as shown in Figure 7-2.
Fig. 7-2 The PDFBox package after decompression
(3) Enter the external directory, you can see that this includes the PDFBox all the used external packages. Copy the following jar package into the project Ch7 Lib directory (if you have not yet created a Lib directory, create one first).
L Bcmail-jdk14-132.jar
L Bcprov-jdk14-132.jar
L Checkstyle-all-4.2.jar
L Fontbox-0.1.0-dev.jar
L Lucene-core-2.0.0.jar
Then, from the PDFBox lib directory, copy the Pdfbox-0.7.3.jar to the project's Lib directory.
(4) Right click on the project, select the "Build path->config build path->add Jars" command in the pop-up shortcut menu, and add the package below the project Lib directory to the project's building Path. The complete engineering catalogue on the author's machine is shown in Figure 7-3:
Fig. 7-3 Project screenshot 7.1.3 use PDFBox to parse PDF content
In the Eclipse project that you just created, create a Ch7.pdfbox package and create a Pdfboxtest class. The class contains a GetText method for obtaining textual information from a PDF with the following code.
Code 7.1
public void Getext (String file) throws Exception {
Whether to sort
Boolean sort = false;
PDF file Name
String pdffile = file;
Enter text file name
String textfile = null;
Encoding method
String encoding = "UTF-8";
Start extracting pages
int startpage = 1;
End Fetch Pages
int endpage = Integer.max_value;
File input stream, generating text file
Writer output = null;
PDF Document stored in memory
PDDocument document = null;
try {
try {
Load the file as a URL first, and then load the file from the local file system///To get an exception
URL url = new URL (pdffile);
Document = Pddocument.load (URL);
Get the file name of the PDF
String fileName = Url.getfile ();
Name the newly generated TXT file as the original PDF
if (Filename.length () > 4) {
File outputfile = new file (filename.substring (0, Filename.length ()-4) + ". txt");
Textfile = Outputfile.getname ();
}
catch (Malformedurlexception e) {
Load from File system if exception is loaded as URL
Document = Pddocument.load (Pdffile);
if (Pdffile.length () > 4) {
Textfile = pdffile.substring (0, Pdffile.length ()-4) + ". txt";
}
}
File input stream, write file down textfile
Output = new OutputStreamWriter (new FileOutputStream (textfile), encoding);
Pdftextstripper to extract text
Pdftextstripper stripper = null;
Stripper = new Pdftextstripper ();
Set whether to sort
Stripper.setsortbyposition (sort);
Set Start Page
Stripper.setstartpage (StartPage);
Set End page
Stripper.setendpage (EndPage);
Call Pdftextstripper's WRITETEXT extract and output text
Stripper.writetext (document, output);
finally {
if (output!= null) {
Turn off the output stream
Output.close ();
}
if (document!= null) {
Close PDF Document
Document.close ();
}
}
}
In the above code, the GetText method receives a string parameter that specifies the path of the PDF file to extract. This location can be a URL or a local file. The function then invokes the Pdftextstripper class provided by PDFBox, setting some attributes (such as starting page, sorting, and so on) during the extraction. Finally, the text is extracted and written to the file. 7.1.4 Operation Effect
Look at the effect of this function, add a main function in Pdfboxtest, and the code below.
public static void Main (string[] args) {
Pdfboxtest test = new Pdfboxtest ();
try {
Get the contents of Index.pdf under C disk
Test.getext ("C://index.pdf");
catch (Exception e) {
E.printstacktrace ();
}
}
Here's a index.pdf file, and the content of the PDF file is shown in Figure 7-4.
Figure 7-4 Content of PDF document to parse
The text file processed by Pdfboxtest is shown in Figure 7-5.
Figure 7-5 The results of the processing
As you can see, the text in the PDF has been extracted and saved in a text file. The Hyperlinks section of line 4th, "POI News Webblog", has been replaced with plain plain text in a text file. Readers can further query for additional functionality based on the API documentation provided by PDFBox. integration of 7.1.5 and Lucene
PDFBox also provides integration with Lucene, which provides an easy way to add PDF documents to Lucene's index, see the following code:
Document lucenedocument = lucenepdfdocument.getdocument (...);
Wherein, lucenepdfdocument is a class provided in PDFBox, and its getdocument is overloaded with 3 methods that receive a file object, InputStream object, or URL object as parameters, respectively. The Document object for Lucene is then extracted and generated from the PDF file that the parameter passes in.
When you get a Lucene document from a PDF file via PDFBox, you can add it to the Lucene index directly using IndexWriter. Lucenepdfdocument automatically extracts various metadata field from the PDF file and adds them to the document. It extracts information as shown in table 7-1.
Table 7-1 PDFBox generated Lucene document format
Lucene Field Name |
Description |
Path |
File system path (if document is loaded from file) |
Url |
URL address (if the document is loaded from the network) |
Contents |
The contents of the entire document, indexed but not stored |
Summary |
Document Content First 500 characters |
Modified |
Last modification time |
Uid |
Unique ID of document |
CreationDate |
Get from the Meta-data of PDF |
Creator |
Get from the Meta-data of PDF |
Keywords |
Get from the Meta-data of PDF |
Modificationdate |
Get from the Meta-data of PDF |
Producer |
Get from the Meta-data of PDF |
Subject |
Get from the Meta-data of PDF |
Trapped |
Get from the Meta-data of PDF |
The following lucenepdfdocument, indexed directly to the PDF, creates a new Pdflucenetest class below the Ch7.pdfbox package, the code for the class is as follows.
Code 7.2
public class Pdflucenetest {
public static void Main (string[] args) {
try {
IndexWriter Storage Index to D:/index
IndexWriter writer = new IndexWriter ("D://index",
New StandardAnalyzer (), true);
Lucenepdfdocument returns the Lucene docuement generated by PDF
Document d = lucenepdfdocument
. GetDocument (New File ("C://index.pdf"));
Write index
Writer.adddocument (d);
Close Index file stream
Writer.close ();
Read the index file under D:/index to establish Indexsearcher
Indexsearcher searcher = new Indexsearcher ("D://index");
Find keywords in the contents field of an index poi
Term t = new Term ("Contents", "poi");
Generate query based on term
Query q = new Termquery (t);
Search returns result set
Hits Hits = Searcher.search (q);
Print result Sets
for (int i = 0; i < hits.length (); i++) {
System.out.println (Hits.doc (i));
}
catch (Exception e) {
E.printstacktrace ();
}
}
}
The function uses the Lucenepdfdocument getdocument function to return a Lucene document directly from a PDF file that contains path, URL, modified, contents, Summary and so on, write them directly to index, and then create a indexsearcher, search the Contents field by line, find the keyword "poi" (note must be lowercase), the program's execution results are shown in Figure 7-6.
Figure 7-6 Running results of the search code