Java extracts four of word,pdf weapons

Last Update:2017-06-13 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Transferred from: https://www.ibm.com/developerworks/cn/java/l-java-tips/thanks to the author's published article with Jacob

In fact, Jacob is a bridage, connecting Java and COM or Win32 function of a middleware, Jacob can not directly extract word,excel and other files, need to write their own DLL Oh, but has been written for you, is Jacob's author to provide The

Jacob jar and DLL file download: http://www.matrix.org.cn/down_view.asp?id=13

After you have downloaded Jacob and placed it in the specified path (the DLL is placed in the Path,jar file to Classpath), you can write your own extraction program, here is a simple example:

ImportJava.io.File;Importcom.jacob.com.*; Importcom.jacob.activex.*; /*** Title:pdf Extraction * Description:email:[email protected] * Copyright:matrix Copyright (c) 2003 * Company:matr ix.org.cn *@authorChris *@version1.0,who Use this example pls remain the declare*/  Public classfileextracter{ Public Static voidMain (string[] args) {activexcomponent component=NewActivexcomponent ("Word.Application"); String InFile= "C:\\test.doc"; String Tpfile= "C:\\temp.htm"; String Otfile= "C:\\temp.xml"; BooleanFlag =false; Try{Component.setproperty ("Visible",NewVariant (false)); Object WORDACC= Component.getproperty ("document. "). Todispatch (); Object Wordfile= Dispatch.invoke (WORDACC, "Open", Dispatch.method,NewObject[]{infile,NewVariant (false),NewVariant (true)},       New int[1]). Todispatch (); Dispatch.invoke (Wordfile,"SaveAs", Dispatch.method,NewObject[]{tpfile,NewVariant (8)},New int[1]); Variant F=NewVariant (false); Dispatch.call (Wordfile,"Close", F); Flag=true; } Catch(Exception e) {e.printstacktrace (); } finally{Component.invoke ("Quit",Newvariant[] {}); } }}

Use Apache POI to extract the word,excel.

POI is an Apache project, but even with poi you may feel annoying, but it doesn't matter, here is a simpler interface for you:

Download the encapsulated POI package: http://www.matrix.org.cn/down_view.asp?id=14

After the download, put it on your classpath, here is an example of how to use it:

Import java.io.*;import  org.textmining.text.extraction.wordextractor;/*** <p>title:word extraction</ p>* <p>description:email:[email protected]</p>* <p>copyright:matrix Copyright (c) 2003</p >* <p>company:matrix.org.cn</p>* @author chris* @version 1.0,who Use this example pls remain the declare * /public class Pdfextractor {public Pdfextractor () {} public static void Main (String args[]) throws Exception {FileInput Stream in = new FileInputStream ("C:\\a.doc"); Wordextractor extractor = new Wordextractor (); String str = Extractor.extracttext (in); SYSTEM.OUT.PRINTLN ("The result length is" +str.length ());   SYSTEM.OUT.PRINTLN ("The result is" +STR);}}

pdfbox-for extracting PDF files

But PDFBox to Chinese support is not good, first download pdfbox:http://www.matrix.org.cn/down_view.asp?id=12

Here is an example of how to extract a PDF file using PDFBox:

Importorg.pdfbox.pdmodel.PDdocument. ImportOrg.pdfbox.pdfparser.PDFParser;ImportJava.io.*;ImportOrg.pdfbox.util.PDFTextStripper;Importjava.util.Date;/*** <p>title:pdf extraction</p>* <p>description:email:[email protected]</p>* <p> Copyright:matrix Copyright (c) 2003</p>* <p>company:matrix.org.cn</p>*@authorchris*@version1.0,who Use this example pls remain the declare*/ Public classpdfextracter{ PublicPdfextracter () {} PublicString gettextfrompdf (String filename)throwsException {String temp=NULL; PDDocument. Nbsppdfdocument. NULL; FileInputStream is=Newfileinputstream (filename); Pdfparser Parser=NewPdfparser (IS); Parser.parse (); pdfdocument. nbsp=parser.getpddocument. ); Bytearrayoutputstream out=NewBytearrayoutputstream (); OutputStreamWriter writer=NewOutputStreamWriter (out); Pdftextstripper Stripper=NewPdftextstripper (); Stripper.writetext (pdfdocument. GetDocument. ), writer); Writer.close (); byte[] contents =Out.tobytearray (); String TS=NewString (contents); System.out.println ("The string length is" +contents.length+ "\ n"); returnts;} Public Static voidMain (String args[]) {Pdfextracter PF=NewPdfextracter (); PDDocument. Nbsppdfdocument. nbsp=NULL;Try{String TS=PF. Gettextfrompdf ("C:\\a.pdf"); SYSTEM.OUT.PRINTLN (TS);}Catch(Exception e) {e.printstacktrace (); }}}

Extracting PDF files that support Chinese-xpdf

Xpdf is an open source project, and we can call his local method to implement extracting Chinese PDF files.

Download Xpdf function Pack: http://www.matrix.org.cn/down_view.asp?id=15

Also need to download support for Chinese patch pack: http://www.matrix.org.cn/down_view.asp?id=16

Follow the Readme in the Chinese patch, you can begin to write the Java program calling the local method.

Here is an example of how to invoke:

ImportJava.io.*;/*** <p>title:pdf extraction</p>* <p>description:email:[email protected]</p>* <p> Copyright:matrix Copyright (c) 2003</p>* <p>company:matrix.org.cn</p>*@authorchris*@version1.0,who Use this example pls remain the declare*/ Public classPdfwin { PublicPdfwin () {} Public Static voidMain (String args[])throwsException {String path_to_xpdf= "C:\\Program Files\\xpdf\\pdftotext.exe"; String filename= "C:\\a.pdf"; string[] cmd=NewString[] {path_to_xpdf, "-enc", "UTF-8", "-Q", filename, "-"}; Process P=runtime.getruntime (). exec (CMD); Bufferedinputstream bis=NewBufferedinputstream (P.getinputstream ()); InputStreamReader Reader=NewInputStreamReader (bis, "UTF-8")); StringWriter out=NewStringWriter (); Char[] buf =New Char[10000]; intLen;  while(len = Reader.read (buf)) >= 0) {   //out.write (buf, 0, Len);System.out.println ("The length is" +Len);   } reader.close (); String TS=NewString (BUF); System.out.println ("The STR is" +ts); }}

Java extracts four of word,pdf weapons

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Java extracts four of word,pdf weapons

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Java extracts four of word,pdf weapons

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support