This article mainly introduces about PHP parsing Word, get the document in the picture, has a certain reference value, now share to everyone, the need for friends can refer to
Background
Some time ago write a function: Using native PHP will get the content in Word and import it into the website system. Because there are formulas, pictures, tables, etc. in the document, it is more troublesome to write.
Ideas
The general idea is to convert the DOC-formatted document in Word to docx, using a preprocessor to convert the formulas in the document into a SWF picture format, convert Word to XML format, and convert the contents of the XML into JSON format.
Pre-knowledge
1. Understanding XML Fundamentals
XML is an Extensible Markup language, is an important tool for Internet data transmission, XML can be implemented across the Internet platform without the limitations of programming languages and operating systems, can be said to have the highest level of Internet access to the data carrier.
XML is the technology currently dealing with structured document information, which facilitates the transfer of structured delivery between servers, making it easier for developers to control the storage and transmission of data.
XML is used to tag electronic files with a structured markup language that can be used to tag data, define data types, and is a source language that allows users to define their own markup language. It is a subset of the standard common language and is ideal for web transport.
2. Two different ways to store word
Two storage formats for Word documents: Doc and docx
Doc: Traditionally referred to as word, using binary to store data
Docx: word2007, which uses XML to store data
So the suffix is obviously in docx format, why is it in XML format?
Select a test.docx, change the suffix name to. zip, then unzip it to get the following directory structure:
So you think the docx document is actually a compressed file ~
3. Understanding DOM and PHP Dom XML parsing
The DOM provides a standard set of objects for HTML and XML documents, as well as a standard interface for accessing and manipulating these documents. The XML Dom is the set of objects that define a standard for a document. PHP DOM extensions allow you to implement a series of PHP operations with the DOM tree.
Use the PHP DOM to read an XML document:
Test.xml:
<?xml version= "1.0" encoding= "Utf-8"?><teststore><test> <name>php Dom Test</name > <author>test-one</author></test><test> <title>php dom Test 2</title > <author>test-two</author></test></teststore>
test.php:
<?php $doc = new DOMDocument (); $doc->load ("Test.xml"); Gets the Label object $book = $doc->getelementsbytagname ("test"); Output the value in the first echo $book->item (0)->nodevalue; echo "<br>----------------<br>"; $title = $doc->getelementsbytagname ("name"); echo $title->item (0)->nodevalue; echo "<br>----------------<br>"; Iterate through the contents of all book Tags foreach ($book as $note) { echo $note->nodevalue; echo "<br>"; }
Results:
4. The XML definition format in Word
How is the data in Word defined??
We will only introduce a single L two files/folders:
A file is Word/document.xml, which defines the contents of the entire document in Word.
Another folder is Word/media, which holds the multimedia content of the document, in other words all the pictures in the document, and the audio and video are stored under this folder.
The overall structure definition in DOCUMENT.ML:
<w:document mc:ignorable= "W14 w15 wp14" xmlns:m= "Http://schemas.openxmlformats.org/officeDocument/2006/math" Xmlns:mc= "http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:o= "Urn:schemas-microsoft-com:o Ffice:office "xmlns:r=" http://schemas.openxmlformats.org/officeDocument/2006/relationships "xmlns:v=" urn: SCHEMAS-MICROSOFT-COM:VML "xmlns:w=" Http://schemas.openxmlformats.org/wordprocessingml/2006/main "xmlns:w10=" urn : Schemas-microsoft-com:office:word "xmlns:w14=" Http://schemas.microsoft.com/office/word/2010/wordml "xmlns:w15=" Http://schemas.microsoft.com/office/word/2012/wordml "Xmlns:wne=" http://schemas.microsoft.com/office/word/2006/ WordML "xmlns:wp=" http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing "xmlns:wp14="/HTTP/ Schemas.microsoft.com/office/word/2010/wordprocessingdrawing "xmlns:wpc=" http://schemas.microsoft.com/office/ Word/2010/wordprocessingcanvas "xmlns:wpg=" Http://schemas.microsoft.com/office/word/2010/wordprocessingGroup " Xmlns:wpi= "HTTp://schemas.microsoft.com/office/word/2010/wordprocessingink "xmlns:wps=" http://schemas.microsoft.com/office/ Word/2010/wordprocessingshape "Xmlns:wpscustomdata=" Http://www.wps.cn/officeDocument/2013/wpsCustomData "> <w:body> <w:p> <w:ppr> <w:pstyle w:val= "2" > </w :p style> <w:keepnext w:val= "0" > </w:keepnext> <w:keeplines W:val= "0" > </w:keeplines> <w:widowcontrol> </w:widowcontrol > <w:suppresslinenumbers w:val= "0" > </w:suppresslinenumbers> & lt;w:pbdr> <w:top w:color= "Auto" w:space= "0" w:sz= "0" w:val= "None" > </w: top> <w:left w:color= "Auto" w:space= "0" w:sz= "0" w:val= "none" > </w:left& Gt <w:bottom w:color= "AUto "w:space=" 0 "w:sz=" 0 "w:val=" none "> </w:bottom> <w:right w:color=" au To "w:space=" 0 "w:sz=" 0 "w:val=" none "> </w:right> </w:pbdr>
Document paragraph content:
<w:p> <w:ppr> <w:pstyle w:val= "2" > </w:pstyle> <w:keepnext w:val= "0" > </w:keepnext> <w:keeplines w:val= "0" > </w:keeplines> <w:widowcontrol> </w:widowcontrol> &L T;w:suppresslinenumbers w:val= "0" > </w:suppresslinenumbers> <w:pbdr> <w:top w:color= "Auto" w:space= "0" w:sz= "0" w:val= "none" > </w:top> <w:left w:color= "Auto" w:space= "0" w:sz= "0" w:val= "none" > </w:left> <w:bottom w:color= "Auto" w:space= "0" w:sz= "0" w:val= "none" > </w:bottom> <w:right w:color= "Auto" w:space= "0" w:sz= "0" w:val= "none" > </w:right> </ W:pbdr> <w:shd w:fill= "Fafafa" w:val= "clear" > </w:shd> <w:spacing w:after= "W:a" fterautospacing= "0" w:before= "w:beforeautospacing=" 0 "w:line=" 378 "w:linerule=" AtLeast "> </w:spa cing> <w:ind w:firstline= "0" w:left= "0" w:right= "0" > </w:ind> <w:rpr> <w:rfonts w:ascii= "Verdana" w:cs= "Verdana" w:hansi= "Verdana" w:hint= "Default" > </w:rfonts> <w:i w:val= "0" > </w:i> <w:caps w:val= "0" > </w:caps> <w:color w:val= "404040" > </w:color> <w:spacing w:val= "0" > </w:spacing> <w:sz w:val= > </w:sz> <w:szcs w:val= "+" > </w:szcs> </w:rpr> </w:ppr> <w:r> <w:rpr> <w:rfonts w:ascii= "Verdana" w:cs= "Verdana" w:hansi= "Verdana" w:hint= "Default" > </w:rfonts > <w:i w:val= "0" > </w:i> <w:caps w:val= "0" > </w:caps> <w:color w:val= "404040" > </w:color> <w:spacing w:val= "0" > </w:spacing> <w:sz w:val= "21" > </w:sz> <w:szcs w:val= "+" > </w:szcs> <W:BDR w:color= "Auto" w:space= "0" w:sz= "0" w:val= "none" > </w:bdr> <w:shd w:fill= "Fafafa" w:val= "clear" > </w:shd> </w:rpr> <w:t> Author: Test </w:t> </w:r> </w:p>
Picture content definition:
<w:r> <w:rpr> <w:rfonts w:ascii= "Verdana" w:cs= "Verdana" w:hansi= "Verdana "w:hint=" Default "> </w:rfonts> <w:i w:val=" 0 "> &L t;/w:i> <w:caps w:val= "0" > </w:caps> <w:color w : val= "404040" > </w:color> <w:spacing w:val= "0" > &L t;/w:spacing> <w:sz w:val= "> </w:sz> <w:szcs w:val= > </w:szcs> <w:bdr w:color= "Auto" w:space= "0" w:sz= "0" w:val= "None" > </w:bdr> <w:shd w:fill= "Fafafa" w:val= "clear" > </w:shd> </w:rpr> <w:drawing> <wp:inline distb= "0" distl= "114300" Distr= "114300" distt= "0" > <wp:extent cx= "5543550" cy= "5543550" > </wp: extent> <wp:effectextent b= "0" l= "0" r= "0" t= "0" > </wp:effectext ent> <WP:DOCPR descr= "img_256" id= "1" name= "Picture 1" > </wp:doc pr> <wp:cnvgraphicframepr> <a:graphicframelocks Nochangeaspe ct= "1" xmlns:a= "Http://schemas.openxmlformats.org/drawingml/2006/main" > </a:graphicframel ocks> </wp:cnvgraphicframepr> <a:graphic xmlns:a= "Http://schemas . Openxmlformats.org/drawingml/2006/main "> <a:graphicdata uri=" Http://schemas.openxmlforma Ts.org/drawingml/2006/picture "> <pic:pic xmlns:pic=" http://schemas.openxmlformats.org /drawingml/2006/picture "> <pic:nvpicpr> <pic:cnvpr descr= "IM g_256 "id=" 1 "name=" Picture 1 "> </pic:cnvpr> <pic:cnvpicpr> <a:piclocks nochangeaspect= "1" > </a:piclocks> </pic:cnvpicpr> </pic:nvpicpr> <pic:blipfill> <a:blip r:embed= "RId4" > </a:blip> <a:stretch> <a:fillrect> </a:fillrect> </a:stretch> </pic:Blipfill> <pic:sppr> <a:xfrm> <a:off x= "0" y= "0" > </a:off> ; <a:ext cx= "5543550" cy= "5543550" > </a:ext> </a:xfrm> <a:prstgeom prst= "rect" > <a:avlst> </a:avlst> </a:prstgeom> <a:nofill> </a:nofill> <a:ln w= "9525" > <a:nofill> </a:nofill> </a:ln> </pic:sppr> </pic:pic> </a:graphicdata> </a:graphic> </wp:inline> </w:drawing> </w:r>
Conclusion:
<w:document> Define the beginning of the entire document <w:body> the child node of the document , the body of the documents <w:p> the child nodes of the body, a paragraph is the child node of the paragraph <w:r> p element in the Word document , and a run defines a section of content in the same format in the paragraph <w:t> the child node of the Run element node, which is the contents of the document <w:drawing> the child node of the run element, defines a picture <w:inline> Drawing child node, the application does not study <a:graphic> defines the child nodes of the picture content <pic:blipfill> graphic document, and defines the index of the picture content.
Specifically, if you use Java, then XWPF parse the docx document is to do XML document parsing, get all the nodes and transform into a better use of properties to provide APIs to use, in Java POI can be based on this name to get the corresponding resources of the picture, and the key to get the picture location is here.
Unfortunately, I'm using php~~~ so we need to manually implement the image via the relevant interface of PHP.
Here's my specific idea : Get the XML node of the Docx document via PHP's built-in DOMDocument interface, traverse the XML node to find the node element that holds the picture, and traverse the image node down to the value of the r:embed index. Because the docx document is a compressed package format, traversing the docx document via the PHP built-in interface ziparchive interface (essentially traversing the. zip archive), finds the corresponding picture by index, transforms it into binary data, The stitching img tag displays the image data in the format base64.
Convert to XML:
Private $rels _xml; Private $doc _xml; Private Function Readzippart ($filename) {$zip = new ziparchive (); $_xml = ' word/document.xml '; $_xml_rels = ' word/_rels/document.xml.rels '; if (true = = = $zip->open ($filename)) {if ($index = $zip->locatename ($_xml))!== false) { $xml = $zip->getfromindex ($index); } $zip->close (); } else die (' non zip file '); if (true = = = $zip->open ($filename)) {if ($index = $zip->locatename ($_xml_rels))!== false) { $xml _rels = $zip->getfromindex ($index); } $zip->close (); } else die (' non zip file '); $this->doc_xml = new DOMDocument (); $this->doc_xml->encoding = mb_detect_encoding ($xml); $this->doc_xml->preservewhitespace = false; $this->doc_xml->formatoutput = true; $this->DOC_XML->LOADXML ($xml); $this->doc_xml->savexml (); $this->rels_xml = new DOMDocument (); $this->rels_xml->encoding = mb_detect_encoding ($xml); $this->rels_xml->preservewhitespace = false; $this->rels_xml->formatoutput = true; $this->rels_xml->loadxml ($xml _rels); $this->rels_xml->savexml (); }
Determine if the picture node is:
if ($paragraph->name = = = ' w:drawing ') { (Strstr ($ts, ' ... ... ')! = False | | Strstr ($ts, ' ... Line ... ')! = False)? $t. = ': $t. = $this->analysisdrawing ($paragraph);}
Get image index:
Private Function analysisdrawing (& $drawingXml) {while ($drawingXml->read ()) { if ($drawingXml NodeType = = Xmlreader::element && $drawingXml->name = = = ' A:blip ') { $rId = $drawingXml->getattribute ( ' r:embed '); $rIdIndex = substr ($rId, 3); return $this->checkimageformating ($rIdIndex);}}
To display a picture file in a compressed package:
Private Function checkimageformating ($rIdIndex) {$imgname = ' word/media/image '. $rIdIndex-8); $zipfileName = __dir__. Directory_separator. ' B '. Directory_separator. ' Test.docx '; $zip =zip_open ($zipfileName); while ($zip _entry = Zip_read ($zip)) {//Read the file in the package sequentially $file _name=zip_entry_name ($zip _entry);//Get the file name in the zip if (Strstr ($file _name, $imgname)! = ") {$a = ($rIdIndex-8 <)? Mb_substr ($file _name,mb_strlen ($imgna Me, "Utf-8"), 1, ' Utf-8 '): '; if ($rIdIndex-8 < && $a! = '. ') continue; if ($enter _zp = Zip_entry_open ($zip, $zip _entry, "R")) {//Read files in package $ext = PathInfo (Zip_entry_name ($zip _entry), pathinfo_extension);//Get picture file extension $content = Zip_entry_read ($zip _entry,zip_entry_filesize ($zip _en try));//Read file binary data return sprintf (' ", $ext, Base64_encode ($cont ENT));//Use Base64_encode function to convert the read binary numberInput and output to the page} zip_entry_close ($zip _entry); Close open item in Zip}} zip_close ($zip);//close Zip file}
The above is the whole content of this article, I hope that everyone's learning has helped, more relevant content please pay attention to topic.alibabacloud.com!