7.2 use xpdf to process Chinese PDF documents

Source: Internet
Author: User
Document directory
  • 7.2 use xpdf to process Chinese PDF documents
  • 7.2.1 download xpdf
  • 7.2.2 Configuration
  • 7.2.3 extract Chinese Characters
  • 7.2.4 Running Effect
7.2 use xpdf to process Chinese PDF documents

Consumer box looks very convenient, and its API is powerful. It can even be seamlessly integrated with Lucene. However, it has a fatal weakness, that is, it does not support Chinese characters. To extract Chinese text, you can use another excellent tool, xpdf.

7.2.1 download xpdf

For more information, see http://www.foolabs.com/x#/download.html. 7-7.

Figure 7-7 xpdf Download Page

This document describes xpdf-3.01pl2-win32.zip. Download a Chinese package xpdf-chinese-simplified.tar.gz.

7.2.2 Configuration

Decompress xpdf-3.01pl2-win32.zip to the C:/xforwartestdirectory. Then, decompress xpdf-chinese-simplified.tar.gz to the C:/xforwartest/xpdf/directory. The decompressed directory structure is 7-8.

Figure 7-8 directory extracted from xpdf

Open the xpdfrc file in the directory and edit the file content, as shown in the following code.

Code 7.3:

Cidtounicode Adobe-GB1 C:/xpdftest/xpdf-Chinese-Simplified/Adobe-GB1.cidToUnicode

Unicodemap ISO-2022-CN C:/xpdftest/xpdf-Chinese-Simplified/ISO-2022-CN.unicodeMap

Unicodemap EUC-CN C:/xpdftest/xpdf-Chinese-Simplified/EUC-CN.unicodeMap

Unicodemap gbk c:/xjavastest/xpdf-Chinese-Simplified/GBK. unicodemap

Cmapdir Adobe-GB1 C:/xpaitest/xpdf-Chinese-Simplified/cmap

Tounicodedir C:/xunitest/xpdf-Chinese-Simplified/cmap

Fontdir C:/Windows/Fonts

Displaycidfonttt Adobe-GB1 C:/Windows/fonts/simhei. TTF

Texteol Cr + LF

File Path readers can change according to their own environment, such as in Windows 2000, fontdir is located in C:/winnt/fonts, displaycidfonttt Adobe-GB1 is located in C: /winnt/fonts/simhei. TTF.

7.2.3 extract Chinese Characters

Create a ch7.xpdf package in the project and create a json2text class. This class is used to extract text by using pdftotext.exe. The specific implementation code is as follows.

Code 7.4:

Public class extends 2text {

// PDF file name

Private file pdffile;

// The storage location of the converter, which is under C:/xpdf by default

Private string convertor_stored_path = "C: // xpdf ";

// The name of the converter. The default value is pdftotext.

Private string convertor_name = "pdftotext ";

// Constructor. The parameter is the path of the PDF file.

Public ipv2text (string pdffile) throws ioexception {

This (new file (pdffile ));

}

// Constructor. The parameter is the object of the PDF file.

Public ipv2text (File pdffile) throws ioexception {

This.pdf file = pdffile;

}

// Convert a PDF file to a text file

Public void totextfile () throws ioexception {

Totextfile (pdffile, true );

}

// Convert a PDF file to a text file. The parameter is the path of the target file. The layout in the PDF file is used by default.

Public void totextfile (string targetfile) throws ioexception {

Totextfile (new file (targetfile), true );

}

// Convert a PDF file to a text file. Parameter 1 is the path of the target file,

// If parameter 2 is true, the layout in the PDF file is used.

Public void totextfile (string targetfile, Boolean islayout)

Throws ioexception {

Totextfile (new file (targetfile), islayout );

}

// Convert a PDF file to a text file. The parameter is the target file.

Public void totextfile (File targetfile) throws ioexception {

Totextfile (targetfile, true );

}

// Convert a PDF file to a text file. Parameter 1 is the target file,

// If parameter 2 is true, the layout in the PDF file is used.

Public void totextfile (File targetfile, Boolean islayout)

Throws ioexception {

String [] cmd = getcmd (targetfile, islayout );

PROCESS p = runtime.getruntime(cmd.exe C (CMD );

}

// Obtain the path of the PDF Converter

Public String getconvertor_stored_path (){

Return convertor_stored_path;

}

// Set the path of the PDF Converter

Public void setconvertor_stored_path (string path ){

If (! Path. Trim (). endswith ("//"))

Path = path. Trim () + "//";

This. convertor_stored_path = path;

}

// Parse command line parameters

Private string [] getcmd (File targetfile, Boolean islayout ){

// Command character

String command = convertor_stored_path + convertor_name;

// Absolute path of the PDF file

String source_absolutepath = pdffile. getabsolutepath ();

// Absolute path of the output text file

String target_absolutepath = targetfile. getabsolutepath ();

// Keep the original layout

String layout = "-layout ";

// Set the encoding method

String encoding = "-ENC ";

String character = "GBK ";

// Set to not print any messages or errors

String mistake = "-Q ";

// No paging is added between pages.

String nopagebrk = "-nopgbrk ";

// If islayout is false, the original layout is not maintained.

If (! Islayout)

Layout = "";

Return New String [] {command, layout, encoding, character, mistake,

Nopagebrk, source_absolutepath, target_absolutepath };

}

}

This class provides a totextfile () method. This method receives two parameters: targetfile is the target PDF file, and islayout indicates whether the layout in the original PDF file is used. The getcmd () method in the class parses the passed parameters and generates a string array. The string array represents a command in the operating system. Each item represents the parameters following the command. Run the runtime.getruntime(cmd.exe C (string []) function to execute the command.

Note: When you set the encoding method in the getcmd () method, GBK is used. This is not applicable to all files, because there are more than one Chinese encoding method. You can select different encoding methods based on the PDF encoding type. The encoding methods for all Simplified Chinese are defined in the unicodemap file xpdfrc and now support three encoding methods: ISO-2022-CN, EUC-CN, and GBK, as shown in figure 7-9.

Figure 7-9 content of xpdfrc

7.2.4 Running Effect

The following uses a function to test the json2text class. In the ch7.xpdf package, create a json2texttest class that contains a main function. The Code is as follows.

Code 7.5:

Public class extends 2texttest {

Public static void main (string [] ARGs ){

Try {

// Storage location of the input PDF File

20172text p2t = new 20172text ("C: // test.pdf ");

// Set the location of the converter

P2t. setconvertor_stored_path ("C: // xw.test // xpdf ");

// Set the storage location of text files

P2t. totextfile ("C: // test.txt ");

} Catch (exception e ){

E. printstacktrace ();

}

}

}

The PDF file for conversion is 7-10.

Figure 7-10 conversion of a Chinese PDF File

After code 7.5 is run, the result is 7-11.

Figure 7-11 running result

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.