Recently in the system to the customer, the user asked, to be able to import Word files, now Microsoft Word has several versions 97, 2003, 2007, the three versions of the storage data format are quite different, and now 97 has basically exited the market, Few people use this version, so in our system only consider 2003 version and 2007 version, because we only want to be able to read word text content can be, the text style, pictures and other information to ignore, also do not have to directly manipulate the word file, so we choose to use Apache POI for reading.
A Word file that reads Version 2003 (. doc) is relatively simple, requiring only Poi-3.5-beta6-20090622.jar and Poi-scratchpad-3.5-beta6-20090622.jar two jar packages, and 2007 version (. docx) on more trouble, I said this trouble is not the time we write code trouble, is to import more jar package, there are as many as the following 7:
1. Openxml4j-bin-beta.jar
2. Poi-3.5-beta6-20090622.jar
3. Poi-ooxml-3.5-beta6-20090622.jar
4. Dom4j-1.6.1.jar
5. Geronimo-stax-api_1.0_spec-1.0.jar
6. Ooxml-schemas-1.0.jar
7. Xmlbeans-2.3.0.jar
4-7 of these are the jar packages that Poi-ooxml-3.5-beta6-20090622.jar relies on (found in the Ooxml-lib directory in poi-bin-3.5-beta6-20090622.tar.gz).
We need to download the jar pack first before we write the code, we just download poi-bin-3.5-beta6-20090622.tar.gz and Openxml4j-bin-beta.jar, Because the other jar packages you need are available in poi-bin-3.5-beta6-20090622.tar.gz, the following is the download address:
poi-bin-3.5-beta6-20090622.tar.gz:http:// Apache.etoak.com/poi/dev/bin/poi-bin-3.5-beta6-20090622.tar.gz
openxml4j-bin-beta.jar:http:// Mirror.optus.net/sourceforge/o/op/openxml4j/openxml4j-bin-beta.jar
Read the Word file below. Java code, it is noteworthy: POI read the Word file in the Word file will not read the picture information, and for the 2007 version of Word (. docx), if the Word file has a table, all the tables in the data will be read out of the end of the string.
Import Java.io.File; Import Java.io.FileInputStream; Import Java.io.InputStream; Import org.apache.poi.POIXMLDocument; Import Org.apache.poi.POIXMLTextExtractor; Import Org.apache.poi.hwpf.extractor.WordExtractor; Import Org.apache.poi.openxml4j.opc.OPCPackage; Import Org.apache.poi.xwpf.extractor.XWPFWordExtractor; /** * POI the test class that reads the text in Word 2003 and Word 2007 <br/> * @createDate 2009-07-25 * @author Carl He/public class Test { public static void Main (string[] args) {try {//word 2003: The picture will not be read InputStream is = new FileInputStream (New File ("C://fi Les//2003.doc ")); Wordextractor ex = new Wordextractor (IS); String text2003 = Ex.gettext (); System.out.println (text2003); Word 2007 pictures will not be read, the data in the table will be placed at the end of the string opcpackage opcpackage = Poixmldocument.openpackage ("C://files//2007.docx"); Poixmltextextractor extractor = new Xwpfwordextractor (opcpackage); String text2007 = Extractor.gettext (); System.out.println (text2007); catch (Exception e) {e.printstacktrace ();}} }
If you want to download the complete sample code, you can download it here, which has POI to read all the jar packages and Word 2003, Word 2007 sample files you need for Word 2003 and Word 2007.