Install the Docx module first, either via the PIP install docx or by downloading the installation on the Docx official link
Here's how to parse a docx document: The document format is as follows
There are 3 parts composed of 1 body: Text Document 21 tables. 31 inserted file objects. 41 Photos These 4 sections are the most common formats we have in the docx documentation. The parsing code is as follows
Import docx
Docx_try ():
Doc=docx. Document (R ' E:\py_prj\test.docx ')
Doc.paragraphs:
P.text
Doc.tables:
T.rows:
R.cells:
C.text
E:\python2.7.11\python.exe e:/py_prj/test3.py
Test document
Name
Role
Python
Parsing data
C language
Invoking the underlying interface
Html
Web page data
The first is with docx. Document opens the corresponding file directory. DOCX file structure is more complex, divided into three layers, 1, docment objects representing the entire document, 2, Docment contains a list of paragraph objects, paragraph objects used to represent the paragraphs in the document; 3. A Paragraph object contains a list of run objects. So P.text will print out the entire text document. and use Doc.tables to traverse all the tables. and to each table by traversing rows, columns to get all the content.
But in the running results we did not find the file objects and pictures we inserted, text.txt the document. How to parse this part. First we need to know the format of the docx document:
Docx was used after Microsoft Office2007, replacing its current proprietary default file format with the new XML-based compressed file format, with the addition of the letter "X" (That is, ". docx" instead of ". Doc", ". xlsx") after the traditional filename extension ". xls", ". pptx" instead of ". ppt").
The docx format file is essentially a zip file. The suffix of a docx file can be opened or decompressed with the Unzip tool after it has been changed to zip. In fact, Word2007 's basic file is the ZIP format, which he can count as a container for docx files.
The main content of the docx format file is saved in XML format, but the file is not stored directly on disk. It is saved in a zip file and then takes the extension to docx. To unzip the file suffix in. docx format to zip, you can see that the extracted folder has a folder in Word that contains most of the Word document. The Document.xml file contains the main text content of the document.
From the above document we can see that the docx document is actually packaged in an XML document. So we're going to get all the parts, and we can get all the parts in a zip-decompression way. Let's try it first and see if we can
1 changing the docx document to a zip suffix
2 Extracting files
The following files are obtained after decompression
To open the Word folder, click the following folder. Document.xml is the file that describes the text object
Where the embeddings file is the text object we inserted text.txt. is a bin file
The media file is the stored picture:
By manually parsing the inserted text and the image, the code can be parsed as well. The code is as follows.
Os.chdir (R ' E:\py_prj ') #首先改变目录到文件的目录
Os.rename (' Test.docx ',' test. Zip ') # renamed to ZIP file
F=zipfile. ZipFile (' Test.zip ',' R ') #进行解压
F.namelist ():
F.extract (file)
File=open (R ' E:\py_prj\word\embeddings\oleObject1.bin ',' RB '). Read () #进入文件路径, reads a binary file.
File
F
In the above way, you can parse all the files and images inserted in docx. Specific docx writing method can refer to the Official document introduction
Python parsing methods for docx documents, and extracting inserted text objects and pictures