[Learning Archives] convert Word documents to XML format

Source: Internet
Author: User

------------------------------------------ 2013.7.26 ------------------------------------------

The weather is fine when the temperature near the ground is 31 °C.

Yesterday I learned that Java has a Jacob open-source library that can convert Word to HTML format.

[Conjecture] chart information in Word is stored in HTML and converted into similar tables and other labels.

If the conjecture is correct, you can analyze and extract valid information from the HTML document to generate an XML document in the specified format.

[Afternoon work]

The morning conjecture is completely correct.

After browsing some documents about Jacob, the document containing the table is successfully referenced into HTML and TXT formats.

I would like to express my special thanks to the documents in the Document ID = Wuhan County Magistrate.

Reference: Jacob Office Word file format conversion .. http://blog.csdn.net/laoyaotask/article/details/9391435

During this period, the source word document cannot be read or written, which is caused by the read-only format of the source file. After the modification, the problem is solved.

[Next plan]

Because the converted HTML document is only plain text without tags, after comparison, it is decided to directly use the TXT format document for Natural Language Processing (NLP) data Mining.

------------------------------------------ 2013.7.31 ------------------------------------------

Near the ground, the air temperature is 29 °C, the weather is fine, slightly haze.

I will embark on my journey home tomorrow ~ Although I was a little happy, I encountered some small twists and turns when I received the ticket.

[Idea] I hope that batch conversion can be implemented today. I plan to continue using Java, mainly to facilitate the combination of previous code.

[Morning work]

Today's ideas are realized. Converts all Word documents in the specified directory into TXT documents in batches.

I have learned some methods about file calling in Java.

// Get the file name of the object before the file, convert it to a string format, and end the file file.getname().tostring().endwith(.doc) with a. Doc; // determine whether the object is a folder file. isdirectory (); // obtain all files and folders in the path 【. listfile ()] File [] files = path. listfiles (New filefilter ()

Reference Source:

Java traverses all files in the directory with the suffix. Java: http://zhidao.baidu.com/question/229445883.html

Java filefilter filter only reserved folder and. XLS files: http://zhidao.baidu.com/question/538907121.html

[Next plan]

It is actually the next step of the 26th plan, but the method converted into XML will be changed to a regular expression at the beginning.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.