Poi operations on Word documents

Last Update:2014-10-23 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Import Org. apache. poi. poitextextractor; import Org. apache. poi. hwpf. extractor. wordextractor; // get the. DOC file Extraction Tool Org. apache. poi. hwpf. extractor. wordextractor Doc = new wordextractor (New fileinputstream (filepath); // extract .doc text string text = Doc. gettext (); // extract .doc annotation string [] comments = Doc. getcommentstext (); 2007 Import Org. apache. poi. poitextextractor; import Org. apache. poi. xwpf. extractor. xwpfwordextracto R; import Org. apache. poi. xwpf. usermodel. xwpfcomment; import Org. apache. poi. xwpf. usermodel. xwpfdocument; // get to .docx file Extraction Tool Org. apache. poi. xwpf. extractor. xwpfwordextractor docx = new xwpfwordextractor (poixmldocument. openpackage (filepath); // extract .docx text string text = docx. gettext (); // extract .docx annotation Org. apache. poi. xwpf. usermodel. xwpfcomment [] comments = docx. getdocument ()). getcomments (); For (xwpfcomment com Ment: Comments) {comment. GETID (); // extract the comment ID comment. getauthor (); // extract comments and modify comment. gettext (); // extract annotation content} 5: Use poi to extract the total number of pages and characters of the word... 97-2003 wordextractor Doc = new wordextractor (New fileinputstream (filepath ));//. DOC format Word file Extraction Machine int pages = Doc. getsummaryinformation (). getpagecount (); // the total number of pages int wordcount = Doc. getsummaryinformation (). getwordcount (); // The total number of characters: 2007: xwpfdocument docx = nNew xwpfdocument (poixmldoc Ument. openpackage (filepath); int pages = docx. getproperties (). getextendedproperties (). getunderlyingproperties (). getpages (); // the total number of pages int characters = docx. getproperties (). getextendedproperties (). getunderlyingproperties (). getcharacters (); // The total number of characters that ignore spaces. In addition, the getcharacterswitheat ACES () method obtains the total number of characters with spaces.

TIPS:
2007 the new office open XML format is used for storage, which is similar to the Office secrets in the previous binary file format. You can use WinRAR to open the office2007 storage file, where word/document. XML saves the most important body content, word/comments. XML stores the annotation content. You can study these files to help developers ~

Introduction to office open XML file formats www.microsoft.com/china/msdn/library/office/office/officeopenxmlformats.mspx
With the emergence of XML in 1990s, enterprise computing customers began to realize the commercial value of adopting open formats and standards in their computer products and applications. IT professionals will benefit from common data formats, which may be XML because it has the ability to be read by applications, platforms, and Internet browsers.

Similarly, with the support and adoption of the XML format in Microsoft Office 2000, developers began to realize that they needed to convert the binary file format in earlier Microsoft Office versions to the XML format. Binary documents (.doc,.dot,.xls,.ppt .ppt files) have been shouldering the burden of data storage and conversion over the past few years, but now they cannot meet new market demands, including the ease of transferring data between heterogeneous applications, and allows users to collect business information from the data.

2007 Microsoft Office System uses XML-based file formats for Microsoft Office Excel 2007, Microsoft Office Word 2007, and Microsoft Office PowerPoint 2007, continuing this transfer. The new file format, known as the office open XML format, solves the above market requirements and changes the way you build solutions based on Microsoft Office documents.

Poi is an open-source project of Apache. You can download the corresponding jar package files and their source files from the Apache website.

Poi provides APIs for extracting text content from non-TXT text, such as Word and Excel, which is very convenient to use.

To make it easier and easier for poi to mention a word file, you can extract the text of a word file to understand the functions of POI APIs.

Assume that a Word file exists in the local disk.

E: \ poi \ word \ jboss3.0 configuration and deployment ejb .doc file are in the format, content:

Let's take a look at how simple it is to extract content.

Download the relevant jar package of POI from the Apache website.

Create a test class:

Package Org. shirdrn. word; import Java. io. file; import Java. io. fileinputstream; import Java. io. filenotfoundexception; import Java. io. ioexception; import Org. apache. poi. hwpf. extractor. wordextractor; public class mywordextractor {public static void main (string [] ARGs) {file = new file ("e :\\ poi \ word \ jboss3.0 ejand ejb .doc "); try {fileinputstream FCM = new fileinputstream (File); wordextractor = new wordextractor (FS); system. out. println ("[the content of the Word file extracted using the gettext () method is as follows:]"); system. out. println (wordextractor. gettext ();} catch (filenotfoundexception e) {e. printstacktrace ();} catch (ioexception e) {e. printstacktrace ();}}}

Extract the text of the Word file and print it to the console, as shown below:

Use the gettextfrompieces () method of the wordextractor class to extract:

Wordextractor. gettextfrompieces ();

The result is the same as above.

The wordextractor class also provides the getparagraphtext () method that can extract paragraphs of a Word file and returns a string [] array. Each element in the array is the text content of a segment.

Here, the line feed in the Word file is also considered as a segment. The test is as follows:

Package Org. shirdrn. word; import Java. io. file; import Java. io. fileinputstream; import Java. io. filenotfoundexception; import Java. io. ioexception; import Org. apache. poi. hwpf. extractor. wordextractor; public class mywordextractor {public static void main (string [] ARGs) {file = new file ("e :\\ poi \ word \ jboss3.0 ejand ejb .doc "); try {fileinputstream FCM = new fileinputstream (File); wordextracto R wordextractor = new wordextractor (FCM); system. out. println ("[the content of the Word file extracted using the gettext () method is as follows:]"); string [] Paragraph = wordextractor. getparagraphtext (); system. out. println ("Total Word Files" + paragraph. length + "segment. "); For (INT I = 0; I <Paragraph. length; I ++) {system. out. println ("<No." + (I + 1) + "content>"); system. out. println (paragraph [I]);} catch (filenotfoundexception e) {e. printstacktrace ();} catch (ioexception e) {e. printstacktrace ();}}}

Extract the text of the Word file and print it to the console, as shown below:

From the preceding Word file, we can see that the last line is a line break of the Word file. When using wordextractor to extract the line break, it is also converted into a segment by default, because there should be a line break after a segment ends.

If multiple Word files are stored in different directories and their text content needs to be extracted, a recursive function can be implemented to extract each word file through in-depth traversal.

If necessary, you can output the extracted text of the Word file to a local disk, for example, save it as the root of TXT notepad.

We can see from the above that extracting the text content of a word file actually removes the format of the Word file and obtains the text content.

Poi operations on Word documents

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Poi operations on Word documents

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Poi operations on Word documents

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support