Document directory
- 7.2 use xpdf to process Chinese PDF documents
- 7.2.1 download xpdf
- 7.2.2 Configuration
- 7.2.3 extract Chinese Characters
- 7.2.4 Running Effect
7.2 use xpdf to process Chinese PDF documents
Consumer box looks very convenient, and its API is powerful. It can even be seamlessly integrated with Lucene. However, it has a fatal weakness, that is, it does not support Chinese characters. To extract Chinese text, you can use another excellent tool, xpdf.
7.2.1 download xpdf
For more information, see http://www.foolabs.com/x#/download.html. 7-7.
Figure 7-7 xpdf Download Page
This document describes xpdf-3.01pl2-win32.zip. Download a Chinese package xpdf-chinese-simplified.tar.gz.
7.2.2 Configuration
Decompress xpdf-3.01pl2-win32.zip to the C:/xforwartestdirectory. Then, decompress xpdf-chinese-simplified.tar.gz to the C:/xforwartest/xpdf/directory. The decompressed directory structure is 7-8.
Figure 7-8 directory extracted from xpdf
Open the xpdfrc file in the directory and edit the file content, as shown in the following code.
Code 7.3:
Cidtounicode Adobe-GB1 C:/xpdftest/xpdf-Chinese-Simplified/Adobe-GB1.cidToUnicode
Unicodemap ISO-2022-CN C:/xpdftest/xpdf-Chinese-Simplified/ISO-2022-CN.unicodeMap
Unicodemap EUC-CN C:/xpdftest/xpdf-Chinese-Simplified/EUC-CN.unicodeMap
Unicodemap gbk c:/xjavastest/xpdf-Chinese-Simplified/GBK. unicodemap
Cmapdir Adobe-GB1 C:/xpaitest/xpdf-Chinese-Simplified/cmap
Tounicodedir C:/xunitest/xpdf-Chinese-Simplified/cmap
Fontdir C:/Windows/Fonts
Displaycidfonttt Adobe-GB1 C:/Windows/fonts/simhei. TTF
Texteol Cr + LF
File Path readers can change according to their own environment, such as in Windows 2000, fontdir is located in C:/winnt/fonts, displaycidfonttt Adobe-GB1 is located in C: /winnt/fonts/simhei. TTF.
7.2.3 extract Chinese Characters
Create a ch7.xpdf package in the project and create a json2text class. This class is used to extract text by using pdftotext.exe. The specific implementation code is as follows.
Code 7.4:
Public class extends 2text {
// PDF file name
Private file pdffile;
// The storage location of the converter, which is under C:/xpdf by default
Private string convertor_stored_path = "C: // xpdf ";
// The name of the converter. The default value is pdftotext.
Private string convertor_name = "pdftotext ";
// Constructor. The parameter is the path of the PDF file.
Public ipv2text (string pdffile) throws ioexception {
This (new file (pdffile ));
}
// Constructor. The parameter is the object of the PDF file.
Public ipv2text (File pdffile) throws ioexception {
This.pdf file = pdffile;
}
// Convert a PDF file to a text file
Public void totextfile () throws ioexception {
Totextfile (pdffile, true );
}
// Convert a PDF file to a text file. The parameter is the path of the target file. The layout in the PDF file is used by default.
Public void totextfile (string targetfile) throws ioexception {
Totextfile (new file (targetfile), true );
}
// Convert a PDF file to a text file. Parameter 1 is the path of the target file,
// If parameter 2 is true, the layout in the PDF file is used.
Public void totextfile (string targetfile, Boolean islayout)
Throws ioexception {
Totextfile (new file (targetfile), islayout );
}
// Convert a PDF file to a text file. The parameter is the target file.
Public void totextfile (File targetfile) throws ioexception {
Totextfile (targetfile, true );
}
// Convert a PDF file to a text file. Parameter 1 is the target file,
// If parameter 2 is true, the layout in the PDF file is used.
Public void totextfile (File targetfile, Boolean islayout)
Throws ioexception {
String [] cmd = getcmd (targetfile, islayout );
PROCESS p = runtime.getruntime(cmd.exe C (CMD );
}
// Obtain the path of the PDF Converter
Public String getconvertor_stored_path (){
Return convertor_stored_path;
}
// Set the path of the PDF Converter
Public void setconvertor_stored_path (string path ){
If (! Path. Trim (). endswith ("//"))
Path = path. Trim () + "//";
This. convertor_stored_path = path;
}
// Parse command line parameters
Private string [] getcmd (File targetfile, Boolean islayout ){
// Command character
String command = convertor_stored_path + convertor_name;
// Absolute path of the PDF file
String source_absolutepath = pdffile. getabsolutepath ();
// Absolute path of the output text file
String target_absolutepath = targetfile. getabsolutepath ();
// Keep the original layout
String layout = "-layout ";
// Set the encoding method
String encoding = "-ENC ";
String character = "GBK ";
// Set to not print any messages or errors
String mistake = "-Q ";
// No paging is added between pages.
String nopagebrk = "-nopgbrk ";
// If islayout is false, the original layout is not maintained.
If (! Islayout)
Layout = "";
Return New String [] {command, layout, encoding, character, mistake,
Nopagebrk, source_absolutepath, target_absolutepath };
}
}
This class provides a totextfile () method. This method receives two parameters: targetfile is the target PDF file, and islayout indicates whether the layout in the original PDF file is used. The getcmd () method in the class parses the passed parameters and generates a string array. The string array represents a command in the operating system. Each item represents the parameters following the command. Run the runtime.getruntime(cmd.exe C (string []) function to execute the command.
Note: When you set the encoding method in the getcmd () method, GBK is used. This is not applicable to all files, because there are more than one Chinese encoding method. You can select different encoding methods based on the PDF encoding type. The encoding methods for all Simplified Chinese are defined in the unicodemap file xpdfrc and now support three encoding methods: ISO-2022-CN, EUC-CN, and GBK, as shown in figure 7-9.
Figure 7-9 content of xpdfrc
7.2.4 Running Effect
The following uses a function to test the json2text class. In the ch7.xpdf package, create a json2texttest class that contains a main function. The Code is as follows.
Code 7.5:
Public class extends 2texttest {
Public static void main (string [] ARGs ){
Try {
// Storage location of the input PDF File
20172text p2t = new 20172text ("C: // test.pdf ");
// Set the location of the converter
P2t. setconvertor_stored_path ("C: // xw.test // xpdf ");
// Set the storage location of text files
P2t. totextfile ("C: // test.txt ");
} Catch (exception e ){
E. printstacktrace ();
}
}
}
The PDF file for conversion is 7-10.
Figure 7-10 conversion of a Chinese PDF File
After code 7.5 is run, the result is 7-11.
Figure 7-11 running result