Python parsing methods for docx documents, and extracting inserted text objects and pictures

Last Update:2017-06-18 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Install the Docx module first, either via the PIP install docx or by downloading the installation on the Docx official link

Here's how to parse a docx document: The document format is as follows

There are 3 parts composed of 1 body: Text Document 21 tables. 31 inserted file objects. 41 Photos These 4 sections are the most common formats we have in the docx documentation. The parsing code is as follows

Import docx

Docx_try ():
Doc=docx. Document (R ' E:\py_prj\test.docx ')
Doc.paragraphs:
P.text
Doc.tables:
T.rows:
R.cells:
C.text

E:\python2.7.11\python.exe e:/py_prj/test3.py

Test document

Name

Role

Python

Parsing data

C language

Invoking the underlying interface

Html

Web page data

The first is with docx. Document opens the corresponding file directory. DOCX file structure is more complex, divided into three layers, 1, docment objects representing the entire document, 2, Docment contains a list of paragraph objects, paragraph objects used to represent the paragraphs in the document; 3. A Paragraph object contains a list of run objects. So P.text will print out the entire text document. and use Doc.tables to traverse all the tables. and to each table by traversing rows, columns to get all the content.

But in the running results we did not find the file objects and pictures we inserted, text.txt the document. How to parse this part. First we need to know the format of the docx document:

Docx was used after Microsoft Office2007, replacing its current proprietary default file format with the new XML-based compressed file format, with the addition of the letter "X" (That is, ". docx" instead of ". Doc", ". xlsx") after the traditional filename extension ". xls", ". pptx" instead of ". ppt").

The docx format file is essentially a zip file. The suffix of a docx file can be opened or decompressed with the Unzip tool after it has been changed to zip. In fact, Word2007 's basic file is the ZIP format, which he can count as a container for docx files.

The main content of the docx format file is saved in XML format, but the file is not stored directly on disk. It is saved in a zip file and then takes the extension to docx. To unzip the file suffix in. docx format to zip, you can see that the extracted folder has a folder in Word that contains most of the Word document. The Document.xml file contains the main text content of the document.

From the above document we can see that the docx document is actually packaged in an XML document. So we're going to get all the parts, and we can get all the parts in a zip-decompression way. Let's try it first and see if we can

1 changing the docx document to a zip suffix

2 Extracting files

The following files are obtained after decompression

To open the Word folder, click the following folder. Document.xml is the file that describes the text object

Where the embeddings file is the text object we inserted text.txt. is a bin file

The media file is the stored picture:

By manually parsing the inserted text and the image, the code can be parsed as well. The code is as follows.

Os.chdir (R ' E:\py_prj ') #首先改变目录到文件的目录
Os.rename (' Test.docx ',' test. Zip ') # renamed to ZIP file
F=zipfile. ZipFile (' Test.zip ',' R ') #进行解压
F.namelist ():
F.extract (file)
File=open (R ' E:\py_prj\word\embeddings\oleObject1.bin ',' RB '). Read () #进入文件路径, reads a binary file.
File
F


In the above way, you can parse all the files and images inserted in docx. Specific docx writing method can refer to the Official document introduction

Python parsing methods for docx documents, and extracting inserted text objects and pictures

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python parsing methods for docx documents, and extracting inserted text objects and pictures

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python parsing methods for docx documents, and extracting inserted text objects and pictures

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support