Extract Chinese text from PDF using xpdf
1. Download xpdf,: ftp://ftp.foolabs.com/pub/xpdf/xpdf-3.02pl4-win32.zip
2. Download the font gbsn00lp. TTF and gkai00mp. TTF: ftp://ftp.foolabs.com/pub/xpdf/xpdf-chinese-simplified.tar.gz
3. decompress the xpdf and font files and place them in the xpdf \ Chinese-Simplified \ cmap directory.
4. Modify the address in the add-to-xpdfrc file and set the path to the local installation path.
# ----- Begin Chinese Simplified support package (2004-jul-27) <br/> cidtounicode Adobe-GB1 E: \ study \ flex \ xpdf-Chinese-Simplified \ xpdf \ Chinese-Simplified/Adobe-GB1.cidToUnicode <br/> unicodemap ISO-2022-CN E: \ study \ flex \ xpdf-Chinese-Simplified \ xpdf \ Chinese-Simplified/ISO-2022-CN.unicodeMap <br/> unicodemap EUC-CN E: \ study \ flex \ xpdf-Chinese-Simplified \ xpdf \ Chinese-Simplified/EUC-CN.unicodeMap <br/> unicodemap gbk e: \ study \ flex \ xpdf-Chinese-Simplified \ xpdf \ Chinese-Simplified/GBK. unicodemap <br/> cmapdir Adobe-GB1 E: \ study \ flex \ xpdf-Chinese-Simplified \ xpdf \ Chinese-Simplified/cmap <br/> tounicodedir E: \ study \ flex \ xpdf-Chinese-Simplified \ xpdf \ Chinese-Simplified/cmap <br/> displayCIDFontTTAdobe-GB1E: \ study \ flex \ xpdf-Chinese-Simplified \ xpdf \ Chinese-Simplified/cmap \ gkai00mp. TTF <br/> # ----- end Chinese Simplified support package </P> <p>
5. Modify the xpdfrc file and the address to the local address.
Cidtounicode Adobe-GB1 E: \ study \ flex \ xpdf-Chinese-Simplified \ xpdf \ Chinese-Simplified/Adobe-GB1.cidToUnicode </P> <p> unicodemap ISO-2022-CN E: \ study \ flex \ xpdf-Chinese-Simplified \ xpdf \ Chinese-Simplified/ISO-2022-CN.unicodeMap </P> <p> unicodemap EUC-CN E: \ study \ flex \ xpdf-Chinese-Simplified \ xpdf \ Chinese-Simplified/EUC-CN.unicodeMap </P> <p> unicodemap gbk e: \ study \ flex \ xpdf-Chinese-Simplified \ xpdf \ Chinese-Simplified/GBK. unicodemap </P> <p> cmapdir Adobe-GB1 E: \ study \ flex \ xpdf-Chinese-Simplified \ xpdf \ Chinese-Simplified/cmap </P> <p> tounicodedir E: \ study \ flex \ xpdf-Chinese-Simplified \ xpdf \ Chinese-Simplified/cmap </P> <p> displaycidfonttt Adobe-GB1 E: \ study \ flex \ xpdf-Chinese-Simplified \ xpdf \ Chinese-Simplified \ cmap \ gkai00mp. TTF <br/>
6. Write a simple program
String xpdfpath = @ "E: \ study \ flex \ xpdf-Chinese-Simplified \ xpdf \ pdftotext.exe"; <br/> string filename = @ "E: \ work \ flashviewer \ flex \ PDF \ mayun.pdf "; </P> <p> string strcmd ="-CFG xpdfrc-Q "+ filename + "-"; </P> <p> PROCESS p = new process (); <br/> P. startinfo. filename = xpdfpath; // EXE, bat and so on <br/> P. startinfo. windowstyle = processwindowstyle. hidden; <br/> P. startinfo. arguments = strcmd; <br/> P. startinfo. redirectstandardoutput = true; <br/> P. startinfo. useshellexecute = false; <br/> try <br/>{< br/> P. start (); </P> <p> string strmsg = P. standardoutput. readtoend (); <br/> iohelp. writefile (path, strmsg, false); </P> <p> P. waitforexit (); <br/> P. close (); <br/>}< br/> catch (exception e) <br/>{< br/> console. writeline (E. message. tostring (); <br/>}< br/>