Because Microsoft does not disclose the source code of the word, the consequence of Directly Reading the word using the Java stream is that all the source code is garbled. So you must use Jacob's intermediate bridge. Of course, it can also be read using poi.
Let's talk about how to use poi to read data. With poi read, first download tm-extractors-0.4.jar Baidu can be found. The Code is as follows:
Import org. textmining. Text. extraction. wordextractor;
Try {
Fileinputstream = new fileinputstream (
Filepath );
Wordextractor extractor = new wordextractor ();
Temp = extractor. extracttext (fileinputstream );
System. Out. println (temp + "= Temp ");
Fileinputstream. Close ();
} Catch (exception ex ){
System. Out. println ("filenotfoundexception error" +
Ex. getmessage ());
}
Filepath is the path of a Word document, and a temp string is returned. The read results are not garbled, but the results are not satisfactory. Because the Word format is lost.
Again with Jacob. Go to the official website to download: http://sourceforge.net/project/showfiles.php? Group_id = 109543 & package_id = 118368 Jacob. Zip. After downloading the file, decompress it and put Jacob. Jar under project/Web-INF/lib. Put Jacob. dll under C:/WINDOS/system32/and Java/JDK *. */JRE/bin. In this way, the configuration is complete. Code:
Import com. Jacob. ActiveX. activexcomponent;
Import com.jacb.com. Dispatch;
Import com.jacb.com. Variant;
Public Boolean chageformat (string folderpath, string filename ){
String fileformat = "";
System. Out. println (folderpath );
Fileformat = filename. substring (filename. Length ()-4, filename. Length ());
System. Out. println (fileformat );
If (fileformat. equalsignorecase (". Doc "))
{
String docfile = folderpath + "//" + filename;
System. Out. println ("Word file path:" + docfile );
// Full path of the Word file
String htmlfile = docfile. substring (0, (docfile. Length ()-4) + ". htm ";
System. Out. println ("HTM file path:" + htmlfile );
// Complete path of the HTML file
Activexcomponent APP = new activexcomponent ("word. application ");
// Start WORD
Try
{
App. setproperty ("visible", new variant (false ));
// Set the word program to run in non-visual mode
Dispatch docs = app. getproperty ("events"). todispatch ();
Dispatch Doc = dispatch. invoke (Docs, "open", dispatch. method, new object [] {docfile, new variant (false), new variant (true)}, new int [1]). todispatch ();
// Open the Word file
Dispatch. Invoke (Doc, "saveas", dispatch. method, new object [] {htmlfile, new variant (8)}, new int [1]);
// Save the file as an HTM format
Dispatch. Call (Doc, "close", new variant (false ));
// Close the file
}
Catch (exception E)
{
E. printstacktrace ();
}
Finally
{
App. Invoke ("quit", new variant [] {});
// Exit the word Program
}
// The conversion is complete.
Return true;
}
Return false;
}
Folderpath is the word storage path. Filename is the word name. This method is used to convert a Word file into an HTM file. At this time, you can use a stream to read the HTM file. The read is not garbled. It is in the format.
In addition, it should be emphasized that the components of Jacob are related to JDK and Windows versions. Therefore, the version must match. Otherwise, an error is reported. You have to try the version one by one. There are no shortcuts.