Java extracts four of word,pdf weapons

Source: Internet
Author: User

Transferred from: https://www.ibm.com/developerworks/cn/java/l-java-tips/thanks to the author's published article with Jacob

In fact, Jacob is a bridage, connecting Java and COM or Win32 function of a middleware, Jacob can not directly extract word,excel and other files, need to write their own DLL Oh, but has been written for you, is Jacob's author to provide The

Jacob jar and DLL file download: http://www.matrix.org.cn/down_view.asp?id=13

After you have downloaded Jacob and placed it in the specified path (the DLL is placed in the Path,jar file to Classpath), you can write your own extraction program, here is a simple example:

ImportJava.io.File;Importcom.jacob.com.*; Importcom.jacob.activex.*; /*** Title:pdf Extraction * Description:email:[email protected] * Copyright:matrix Copyright (c) 2003 * Company:matr ix.org.cn *@authorChris *@version1.0,who Use this example pls remain the declare*/  Public classfileextracter{ Public Static voidMain (string[] args) {activexcomponent component=NewActivexcomponent ("Word.Application"); String InFile= "C:\\test.doc"; String Tpfile= "C:\\temp.htm"; String Otfile= "C:\\temp.xml"; BooleanFlag =false; Try{Component.setproperty ("Visible",NewVariant (false)); Object WORDACC= Component.getproperty ("document. "). Todispatch (); Object Wordfile= Dispatch.invoke (WORDACC, "Open", Dispatch.method,NewObject[]{infile,NewVariant (false),NewVariant (true)},       New int[1]). Todispatch (); Dispatch.invoke (Wordfile,"SaveAs", Dispatch.method,NewObject[]{tpfile,NewVariant (8)},New int[1]); Variant F=NewVariant (false); Dispatch.call (Wordfile,"Close", F); Flag=true; } Catch(Exception e) {e.printstacktrace (); } finally{Component.invoke ("Quit",Newvariant[] {}); } }}

Use Apache POI to extract the word,excel.

POI is an Apache project, but even with poi you may feel annoying, but it doesn't matter, here is a simpler interface for you:

Download the encapsulated POI package: http://www.matrix.org.cn/down_view.asp?id=14

After the download, put it on your classpath, here is an example of how to use it:

Import java.io.*;import  org.textmining.text.extraction.wordextractor;/*** <p>title:word extraction</ p>* <p>description:email:[email protected]</p>* <p>copyright:matrix Copyright (c) 2003</p >* <p>company:matrix.org.cn</p>* @author chris* @version 1.0,who Use this example pls remain the declare * /public class Pdfextractor {public Pdfextractor () {} public static void Main (String args[]) throws Exception {FileInput Stream in = new FileInputStream ("C:\\a.doc"); Wordextractor extractor = new Wordextractor (); String str = Extractor.extracttext (in); SYSTEM.OUT.PRINTLN ("The result length is" +str.length ());   SYSTEM.OUT.PRINTLN ("The result is" +STR);}}

  

pdfbox-for extracting PDF files

But PDFBox to Chinese support is not good, first download pdfbox:http://www.matrix.org.cn/down_view.asp?id=12

Here is an example of how to extract a PDF file using PDFBox:

Importorg.pdfbox.pdmodel.PDdocument. ImportOrg.pdfbox.pdfparser.PDFParser;ImportJava.io.*;ImportOrg.pdfbox.util.PDFTextStripper;Importjava.util.Date;/*** <p>title:pdf extraction</p>* <p>description:email:[email protected]</p>* <p> Copyright:matrix Copyright (c) 2003</p>* <p>company:matrix.org.cn</p>*@authorchris*@version1.0,who Use this example pls remain the declare*/ Public classpdfextracter{ PublicPdfextracter () {} PublicString gettextfrompdf (String filename)throwsException {String temp=NULL; PDDocument. Nbsppdfdocument. NULL; FileInputStream is=Newfileinputstream (filename); Pdfparser Parser=NewPdfparser (IS); Parser.parse (); pdfdocument. nbsp=parser.getpddocument. ); Bytearrayoutputstream out=NewBytearrayoutputstream (); OutputStreamWriter writer=NewOutputStreamWriter (out); Pdftextstripper Stripper=NewPdftextstripper (); Stripper.writetext (pdfdocument. GetDocument. ), writer); Writer.close (); byte[] contents =Out.tobytearray (); String TS=NewString (contents); System.out.println ("The string length is" +contents.length+ "\ n"); returnts;} Public Static voidMain (String args[]) {Pdfextracter PF=NewPdfextracter (); PDDocument. Nbsppdfdocument. nbsp=NULL;Try{String TS=PF. Gettextfrompdf ("C:\\a.pdf"); SYSTEM.OUT.PRINTLN (TS);}Catch(Exception e) {e.printstacktrace (); }}}

Extracting PDF files that support Chinese-xpdf

Xpdf is an open source project, and we can call his local method to implement extracting Chinese PDF files.

Download Xpdf function Pack: http://www.matrix.org.cn/down_view.asp?id=15

Also need to download support for Chinese patch pack: http://www.matrix.org.cn/down_view.asp?id=16

Follow the Readme in the Chinese patch, you can begin to write the Java program calling the local method.

Here is an example of how to invoke:

ImportJava.io.*;/*** <p>title:pdf extraction</p>* <p>description:email:[email protected]</p>* <p> Copyright:matrix Copyright (c) 2003</p>* <p>company:matrix.org.cn</p>*@authorchris*@version1.0,who Use this example pls remain the declare*/ Public classPdfwin { PublicPdfwin () {} Public Static voidMain (String args[])throwsException {String path_to_xpdf= "C:\\Program Files\\xpdf\\pdftotext.exe"; String filename= "C:\\a.pdf"; string[] cmd=NewString[] {path_to_xpdf, "-enc", "UTF-8", "-Q", filename, "-"}; Process P=runtime.getruntime (). exec (CMD); Bufferedinputstream bis=NewBufferedinputstream (P.getinputstream ()); InputStreamReader Reader=NewInputStreamReader (bis, "UTF-8")); StringWriter out=NewStringWriter (); Char[] buf =New Char[10000]; intLen;  while(len = Reader.read (buf)) >= 0) {   //out.write (buf, 0, Len);System.out.println ("The length is" +Len);   } reader.close (); String TS=NewString (BUF); System.out.println ("The STR is" +ts); }}

Java extracts four of word,pdf weapons

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.