Transferred from: https://www.ibm.com/developerworks/cn/java/l-java-tips/thanks to the author's published article with Jacob
In fact, Jacob is a bridage, connecting Java and COM or Win32 function of a middleware, Jacob can not directly extract word,excel and other files, need to write their own DLL Oh, but has been written for you, is Jacob's author to provide The
Jacob jar and DLL file download: http://www.matrix.org.cn/down_view.asp?id=13
After you have downloaded Jacob and placed it in the specified path (the DLL is placed in the Path,jar file to Classpath), you can write your own extraction program, here is a simple example:
ImportJava.io.File;Importcom.jacob.com.*; Importcom.jacob.activex.*; /*** Title:pdf Extraction * Description:email:[email protected] * Copyright:matrix Copyright (c) 2003 * Company:matr ix.org.cn *@authorChris *@version1.0,who Use this example pls remain the declare*/ Public classfileextracter{ Public Static voidMain (string[] args) {activexcomponent component=NewActivexcomponent ("Word.Application"); String InFile= "C:\\test.doc"; String Tpfile= "C:\\temp.htm"; String Otfile= "C:\\temp.xml"; BooleanFlag =false; Try{Component.setproperty ("Visible",NewVariant (false)); Object WORDACC= Component.getproperty ("document. "). Todispatch (); Object Wordfile= Dispatch.invoke (WORDACC, "Open", Dispatch.method,NewObject[]{infile,NewVariant (false),NewVariant (true)}, New int[1]). Todispatch (); Dispatch.invoke (Wordfile,"SaveAs", Dispatch.method,NewObject[]{tpfile,NewVariant (8)},New int[1]); Variant F=NewVariant (false); Dispatch.call (Wordfile,"Close", F); Flag=true; } Catch(Exception e) {e.printstacktrace (); } finally{Component.invoke ("Quit",Newvariant[] {}); } }}
Use Apache POI to extract the word,excel.
POI is an Apache project, but even with poi you may feel annoying, but it doesn't matter, here is a simpler interface for you:
Download the encapsulated POI package: http://www.matrix.org.cn/down_view.asp?id=14
After the download, put it on your classpath, here is an example of how to use it:
Import java.io.*;import org.textmining.text.extraction.wordextractor;/*** <p>title:word extraction</ p>* <p>description:email:[email protected]</p>* <p>copyright:matrix Copyright (c) 2003</p >* <p>company:matrix.org.cn</p>* @author chris* @version 1.0,who Use this example pls remain the declare * /public class Pdfextractor {public Pdfextractor () {} public static void Main (String args[]) throws Exception {FileInput Stream in = new FileInputStream ("C:\\a.doc"); Wordextractor extractor = new Wordextractor (); String str = Extractor.extracttext (in); SYSTEM.OUT.PRINTLN ("The result length is" +str.length ()); SYSTEM.OUT.PRINTLN ("The result is" +STR);}}
pdfbox-for extracting PDF files
But PDFBox to Chinese support is not good, first download pdfbox:http://www.matrix.org.cn/down_view.asp?id=12
Here is an example of how to extract a PDF file using PDFBox:
Importorg.pdfbox.pdmodel.PDdocument. ImportOrg.pdfbox.pdfparser.PDFParser;ImportJava.io.*;ImportOrg.pdfbox.util.PDFTextStripper;Importjava.util.Date;/*** <p>title:pdf extraction</p>* <p>description:email:[email protected]</p>* <p> Copyright:matrix Copyright (c) 2003</p>* <p>company:matrix.org.cn</p>*@authorchris*@version1.0,who Use this example pls remain the declare*/ Public classpdfextracter{ PublicPdfextracter () {} PublicString gettextfrompdf (String filename)throwsException {String temp=NULL; PDDocument. Nbsppdfdocument. NULL; FileInputStream is=Newfileinputstream (filename); Pdfparser Parser=NewPdfparser (IS); Parser.parse (); pdfdocument. nbsp=parser.getpddocument. ); Bytearrayoutputstream out=NewBytearrayoutputstream (); OutputStreamWriter writer=NewOutputStreamWriter (out); Pdftextstripper Stripper=NewPdftextstripper (); Stripper.writetext (pdfdocument. GetDocument. ), writer); Writer.close (); byte[] contents =Out.tobytearray (); String TS=NewString (contents); System.out.println ("The string length is" +contents.length+ "\ n"); returnts;} Public Static voidMain (String args[]) {Pdfextracter PF=NewPdfextracter (); PDDocument. Nbsppdfdocument. nbsp=NULL;Try{String TS=PF. Gettextfrompdf ("C:\\a.pdf"); SYSTEM.OUT.PRINTLN (TS);}Catch(Exception e) {e.printstacktrace (); }}}
Extracting PDF files that support Chinese-xpdf
Xpdf is an open source project, and we can call his local method to implement extracting Chinese PDF files.
Download Xpdf function Pack: http://www.matrix.org.cn/down_view.asp?id=15
Also need to download support for Chinese patch pack: http://www.matrix.org.cn/down_view.asp?id=16
Follow the Readme in the Chinese patch, you can begin to write the Java program calling the local method.
Here is an example of how to invoke:
ImportJava.io.*;/*** <p>title:pdf extraction</p>* <p>description:email:[email protected]</p>* <p> Copyright:matrix Copyright (c) 2003</p>* <p>company:matrix.org.cn</p>*@authorchris*@version1.0,who Use this example pls remain the declare*/ Public classPdfwin { PublicPdfwin () {} Public Static voidMain (String args[])throwsException {String path_to_xpdf= "C:\\Program Files\\xpdf\\pdftotext.exe"; String filename= "C:\\a.pdf"; string[] cmd=NewString[] {path_to_xpdf, "-enc", "UTF-8", "-Q", filename, "-"}; Process P=runtime.getruntime (). exec (CMD); Bufferedinputstream bis=NewBufferedinputstream (P.getinputstream ()); InputStreamReader Reader=NewInputStreamReader (bis, "UTF-8")); StringWriter out=NewStringWriter (); Char[] buf =New Char[10000]; intLen; while(len = Reader.read (buf)) >= 0) { //out.write (buf, 0, Len);System.out.println ("The length is" +Len); } reader.close (); String TS=NewString (BUF); System.out.println ("The STR is" +ts); }}
Java extracts four of word,pdf weapons