about how Python handles Word docs doc docx, you can focus on the Python-docx and python-docx2txt two projects, python-docx more complex and appropriate to create documents, Python-docx2txt makes it easy to convert documents to txt:
https://python-docx.readthedocs.org/en/latest/
Https://github.com/python-openxml/python-docx
In addition DOC file itself is a compressed file, the actual document content is XML structure, you can use unzip decompression:
# Unzip Test.docx
Archive:test.docx
Inflating: _rels/.rels
Inflating:word/settings.xml
Inflating:word/_rels/document.xml.rels
Inflating:word/fonttable.xml
Inflating:word/styles.xml
Inflating:word/document.xml
Inflating:docprops/app.xml
Inflating:docprops/core.xml
Inflating: [Content_types].xml
# ls
[Content_types].xml docprops _rels Test.docx Word
# ls
Document.xml fonttable.xml _rels settings.xml styles.xml
# Cat Document.xml
<?xml version= "1.0" encoding= "UTF-8" standalone= "yes"?>
<w:document xmlns:o= "Urn:schemas-microsoft-com:office:office" xmlns:r= "http://schemas.openxmlformats.org/ Officedocument/2006/relationships "xmlns:v=" urn:schemas-microsoft-com:vml "xmlns:w=" http:// Schemas.openxmlformats.org/wordprocessingml/2006/main "xmlns:w10=" Urn:schemas-microsoft-com:office:word "xmlns: wp= "Http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" ><w:body><w:p><w: Ppr><w:pstyle w:val= "Heading2"/><w:spacing w:linerule= "Auto" w:line= "0" w:before= "0" w:after= <w:rpr></w:rpr></w:ppr><w:r><w:rpr></w:rpr></w:r></w:p><w:p ><w:ppr><w:pstyle w:val= "Heading5"/><w:spacing w:linerule= "Auto" w:line= "the" "/><w:rpr>" <w:rfonts w:ascii= "Times New Roman" w:hansi= "Times New Roman"/><w:b w:val= "false"/><w:sz w:val= "24"/ ><w:szcs w:val= "/></w:rpr></w:ppr><w:r><w:rpr><w:rfonts w:ascii=" Times New Roman "W:hanSi= "Times New Roman"/><w:b w:val= "false"/><w:sz w:val= "a"/><w:szcs w:val= "a"/></w:rpr> <w:t>summary:02</w:t></w:r><w:r><w:rpr><w:rfonts w:ascii= "Times New Roman" W: Hansi= "Times New Roman"/><w:b w:val= "false"/><w:sz w:val= "a"/><w:szcs w:val= "a"/></w:rpr ><w:t> system basic function </w:t></w:r><w:r><w:rpr><w:rfonts w:ascii= "Times New Roman" W: Hansi= "Times New Roman"/><w:b w:val= "false"/><w:sz w:val= "a"/><w:szcs w:val= "a"/></w:rpr ><w:t>-01</w:t></w:r><w:r><w:rpr><w:rfonts w:ascii= "Times New Roman" w:hAnsi= " Times New Roman "/><w:b w:val= false"/><w:sz w:val= "/><w:szcs w:val=" "/></w:rpr><" W:t> System core Functions </w:t></w:r><w:r>
You can use ZipFile to extract directly without using a ready-made library:
Import ZipFile
Document = ZipFile. ZipFile (' Test.docx ')
Xml_content = Document.read (' Word/document.xml ')
reparsed = minidom.parsestring (xml_content)
Print Reparsed.toprettyxml (indent= "", encoding= "Utf-8")