Python parsing methods for docx documents, and extracting inserted text objects and pictures

Source: Internet
Author: User

Install the Docx module first, either via the PIP install docx or by downloading the installation on the Docx official link

Here's how to parse a docx document: The document format is as follows

There are 3 parts composed of 1 body: Text Document 21 tables. 31 inserted file objects. 41 Photos These 4 sections are the most common formats we have in the docx documentation. The parsing code is as follows

Import docx

Docx_try ():
Doc=docx. Document (R ' E:\py_prj\test.docx ')
Doc.paragraphs:
P.text
Doc.tables:
T.rows:
R.cells:
C.text

E:\python2.7.11\python.exe e:/py_prj/test3.py

Test document

Name

Role

Python

Parsing data

C language

Invoking the underlying interface

Html

Web page data

The first is with docx. Document opens the corresponding file directory. DOCX file structure is more complex, divided into three layers, 1, docment objects representing the entire document, 2, Docment contains a list of paragraph objects, paragraph objects used to represent the paragraphs in the document; 3. A Paragraph object contains a list of run objects. So P.text will print out the entire text document. and use Doc.tables to traverse all the tables. and to each table by traversing rows, columns to get all the content.

But in the running results we did not find the file objects and pictures we inserted, text.txt the document. How to parse this part. First we need to know the format of the docx document:

Docx was used after Microsoft Office2007, replacing its current proprietary default file format with the new XML-based compressed file format, with the addition of the letter "X" (That is, ". docx" instead of ". Doc", ". xlsx") after the traditional filename extension ". xls", ". pptx" instead of ". ppt").

The docx format file is essentially a zip file. The suffix of a docx file can be opened or decompressed with the Unzip tool after it has been changed to zip. In fact, Word2007 's basic file is the ZIP format, which he can count as a container for docx files.

The main content of the docx format file is saved in XML format, but the file is not stored directly on disk. It is saved in a zip file and then takes the extension to docx. To unzip the file suffix in. docx format to zip, you can see that the extracted folder has a folder in Word that contains most of the Word document. The Document.xml file contains the main text content of the document.

From the above document we can see that the docx document is actually packaged in an XML document. So we're going to get all the parts, and we can get all the parts in a zip-decompression way. Let's try it first and see if we can

1 changing the docx document to a zip suffix

2 Extracting files

The following files are obtained after decompression

To open the Word folder, click the following folder. Document.xml is the file that describes the text object

Where the embeddings file is the text object we inserted text.txt. is a bin file

The media file is the stored picture:

By manually parsing the inserted text and the image, the code can be parsed as well. The code is as follows.

Os.chdir (R ' E:\py_prj ') #首先改变目录到文件的目录
Os.rename (' Test.docx ',' test. Zip ') # renamed to ZIP file
F=zipfile. ZipFile (' Test.zip ',' R ') #进行解压
F.namelist ():
F.extract (file)
File=open (R ' E:\py_prj\word\embeddings\oleObject1.bin ',' RB '). Read () #进入文件路径, reads a binary file.
File
F


In the above way, you can parse all the files and images inserted in docx. Specific docx writing method can refer to the Official document introduction

Python parsing methods for docx documents, and extracting inserted text objects and pictures

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.