------------------------------------------ 2013.7.26 ------------------------------------------
The weather is fine when the temperature near the ground is 31 °C.
Yesterday I learned that Java has a Jacob open-source library that can convert Word to HTML format.
[Conjecture] chart information in Word is stored in HTML and converted into similar tables and other labels.
If the conjecture is correct, you can analyze and extract valid information from the HTML document to generate an XML document in the specified format.
[Afternoon work]
The morning conjecture is completely correct.
After browsing some documents about Jacob, the document containing the table is successfully referenced into HTML and TXT formats.
I would like to express my special thanks to the documents in the Document ID = Wuhan County Magistrate.
Reference: Jacob Office Word file format conversion .. http://blog.csdn.net/laoyaotask/article/details/9391435
During this period, the source word document cannot be read or written, which is caused by the read-only format of the source file. After the modification, the problem is solved.
[Next plan]
Because the converted HTML document is only plain text without tags, after comparison, it is decided to directly use the TXT format document for Natural Language Processing (NLP) data Mining.
------------------------------------------ 2013.7.31 ------------------------------------------
Near the ground, the air temperature is 29 °C, the weather is fine, slightly haze.
I will embark on my journey home tomorrow ~ Although I was a little happy, I encountered some small twists and turns when I received the ticket.
[Idea] I hope that batch conversion can be implemented today. I plan to continue using Java, mainly to facilitate the combination of previous code.
[Morning work]
Today's ideas are realized. Converts all Word documents in the specified directory into TXT documents in batches.
I have learned some methods about file calling in Java.
// Get the file name of the object before the file, convert it to a string format, and end the file file.getname().tostring().endwith(.doc) with a. Doc; // determine whether the object is a folder file. isdirectory (); // obtain all files and folders in the path 【. listfile ()] File [] files = path. listfiles (New filefilter ()
Reference Source:
Java traverses all files in the directory with the suffix. Java: http://zhidao.baidu.com/question/229445883.html
Java filefilter filter only reserved folder and. XLS files: http://zhidao.baidu.com/question/538907121.html
[Next plan]
It is actually the next step of the 26th plan, but the method converted into XML will be changed to a regular expression at the beginning.