How to use PHP to get the resolution of a picture in a document

Last Update:2018-07-10 Source: Internet

Author: User

Tags ziparchive

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This article mainly introduces about PHP parsing Word, get the document in the picture, has a certain reference value, now share to everyone, the need for friends can refer to

Background

Some time ago write a function: Using native PHP will get the content in Word and import it into the website system. Because there are formulas, pictures, tables, etc. in the document, it is more troublesome to write.

Ideas

The general idea is to convert the DOC-formatted document in Word to docx, using a preprocessor to convert the formulas in the document into a SWF picture format, convert Word to XML format, and convert the contents of the XML into JSON format.

Pre-knowledge

1. Understanding XML Fundamentals

XML is an Extensible Markup language, is an important tool for Internet data transmission, XML can be implemented across the Internet platform without the limitations of programming languages and operating systems, can be said to have the highest level of Internet access to the data carrier.

XML is the technology currently dealing with structured document information, which facilitates the transfer of structured delivery between servers, making it easier for developers to control the storage and transmission of data.

XML is used to tag electronic files with a structured markup language that can be used to tag data, define data types, and is a source language that allows users to define their own markup language. It is a subset of the standard common language and is ideal for web transport.

2. Two different ways to store word

Two storage formats for Word documents: Doc and docx

Doc: Traditionally referred to as word, using binary to store data

Docx: word2007, which uses XML to store data

So the suffix is obviously in docx format, why is it in XML format?

Select a test.docx, change the suffix name to. zip, then unzip it to get the following directory structure:

So you think the docx document is actually a compressed file ~

3. Understanding DOM and PHP Dom XML parsing

The DOM provides a standard set of objects for HTML and XML documents, as well as a standard interface for accessing and manipulating these documents. The XML Dom is the set of objects that define a standard for a document. PHP DOM extensions allow you to implement a series of PHP operations with the DOM tree.

Use the PHP DOM to read an XML document:

Test.xml:

<?xml version= "1.0" encoding= "Utf-8"?><teststore><test>    <name>php Dom Test</name >    <author>test-one</author></test><test>    <title>php dom Test 2</title >    <author>test-two</author></test></teststore>

test.php:

<?php    $doc = new DOMDocument ();    $doc->load ("Test.xml");    Gets the Label object    $book = $doc->getelementsbytagname ("test");    Output the value in the first    echo $book->item (0)->nodevalue;        echo "<br>----------------<br>";        $title = $doc->getelementsbytagname ("name");        echo $title->item (0)->nodevalue;        echo "<br>----------------<br>";    Iterate through the contents of all book Tags    foreach ($book as $note)    {            echo $note->nodevalue;            echo "<br>";    }

Results:

4. The XML definition format in Word

How is the data in Word defined??

We will only introduce a single L two files/folders:

A file is Word/document.xml, which defines the contents of the entire document in Word.

Another folder is Word/media, which holds the multimedia content of the document, in other words all the pictures in the document, and the audio and video are stored under this folder.

The overall structure definition in DOCUMENT.ML:

<w:document mc:ignorable= "W14 w15 wp14" xmlns:m= "Http://schemas.openxmlformats.org/officeDocument/2006/math" Xmlns:mc= "http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:o= "Urn:schemas-microsoft-com:o Ffice:office "xmlns:r=" http://schemas.openxmlformats.org/officeDocument/2006/relationships "xmlns:v=" urn: SCHEMAS-MICROSOFT-COM:VML "xmlns:w=" Http://schemas.openxmlformats.org/wordprocessingml/2006/main "xmlns:w10=" urn : Schemas-microsoft-com:office:word "xmlns:w14=" Http://schemas.microsoft.com/office/word/2010/wordml "xmlns:w15=" Http://schemas.microsoft.com/office/word/2012/wordml "Xmlns:wne=" http://schemas.microsoft.com/office/word/2006/ WordML "xmlns:wp=" http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing "xmlns:wp14="/HTTP/ Schemas.microsoft.com/office/word/2010/wordprocessingdrawing "xmlns:wpc=" http://schemas.microsoft.com/office/ Word/2010/wordprocessingcanvas "xmlns:wpg=" Http://schemas.microsoft.com/office/word/2010/wordprocessingGroup " Xmlns:wpi= "HTTp://schemas.microsoft.com/office/word/2010/wordprocessingink "xmlns:wps=" http://schemas.microsoft.com/office/    Word/2010/wordprocessingshape "Xmlns:wpscustomdata=" Http://www.wps.cn/officeDocument/2013/wpsCustomData "> <w:body> <w:p> <w:ppr> <w:pstyle w:val= "2" > </w :p style> <w:keepnext w:val= "0" > </w:keepnext> <w:keeplines W:val= "0" > </w:keeplines> <w:widowcontrol> </w:widowcontrol > <w:suppresslinenumbers w:val= "0" > </w:suppresslinenumbers> & lt;w:pbdr> <w:top w:color= "Auto" w:space= "0" w:sz= "0" w:val= "None" > </w: top> <w:left w:color= "Auto" w:space= "0" w:sz= "0" w:val= "none" > </w:left&                    Gt <w:bottom w:color= "AUto "w:space=" 0 "w:sz=" 0 "w:val=" none "> </w:bottom> <w:right w:color=" au To "w:space=" 0 "w:sz=" 0 "w:val=" none "> </w:right> </w:pbdr>

Document paragraph content:

<w:p> <w:ppr> <w:pstyle w:val= "2" > </w:pstyle>                <w:keepnext w:val= "0" > </w:keepnext> <w:keeplines w:val= "0" > </w:keeplines> <w:widowcontrol> </w:widowcontrol> &L                    T;w:suppresslinenumbers w:val= "0" > </w:suppresslinenumbers> <w:pbdr>                    <w:top w:color= "Auto" w:space= "0" w:sz= "0" w:val= "none" > </w:top>                    <w:left w:color= "Auto" w:space= "0" w:sz= "0" w:val= "none" > </w:left>                    <w:bottom w:color= "Auto" w:space= "0" w:sz= "0" w:val= "none" > </w:bottom> <w:right w:color= "Auto" w:space= "0" w:sz= "0" w:val= "none" > </w:right> </              W:pbdr>  <w:shd w:fill= "Fafafa" w:val= "clear" > </w:shd> <w:spacing w:after= "W:a" fterautospacing= "0" w:before= "w:beforeautospacing=" 0 "w:line=" 378 "w:linerule=" AtLeast "> </w:spa                cing> <w:ind w:firstline= "0" w:left= "0" w:right= "0" > </w:ind>                    <w:rpr> <w:rfonts w:ascii= "Verdana" w:cs= "Verdana" w:hansi= "Verdana" w:hint= "Default" >                    </w:rfonts> <w:i w:val= "0" > </w:i>                    <w:caps w:val= "0" > </w:caps> <w:color w:val= "404040" >                    </w:color> <w:spacing w:val= "0" > </w:spacing>                    <w:sz w:val= > </w:sz> <w:szcs w:val= "+" > </w:szcs>               </w:rpr> </w:ppr> <w:r> <w:rpr> <w:rfonts w:ascii= "Verdana" w:cs= "Verdana" w:hansi= "Verdana" w:hint= "Default" > </w:rfonts                    > <w:i w:val= "0" > </w:i> <w:caps w:val= "0" >                    </w:caps> <w:color w:val= "404040" > </w:color> <w:spacing w:val= "0" > </w:spacing> <w:sz w:val= "21"                    > </w:sz> <w:szcs w:val= "+" > </w:szcs>                    &LT;W:BDR w:color= "Auto" w:space= "0" w:sz= "0" w:val= "none" > </w:bdr>                <w:shd w:fill= "Fafafa" w:val= "clear" > </w:shd> </w:rpr>   <w:t>                 Author: Test </w:t> </w:r> </w:p>

Picture content definition:

<w:r> <w:rpr> <w:rfonts w:ascii= "Verdana" w:cs= "Verdana" w:hansi= "Verdana "w:hint=" Default "> </w:rfonts> <w:i w:val=" 0 "> &L t;/w:i> <w:caps w:val= "0" > </w:caps> <w:color w : val= "404040" > </w:color> <w:spacing w:val= "0" > &L  t;/w:spacing> <w:sz w:val= "> </w:sz> <w:szcs w:val= > </w:szcs> <w:bdr w:color= "Auto" w:space= "0" w:sz= "0" w:val=                    "None" > </w:bdr> <w:shd w:fill= "Fafafa" w:val= "clear" > </w:shd> </w:rpr> <w:drawing> <wp:inline distb= "0" distl= "114300" Distr= "114300" distt= "0" > <wp:extent cx= "5543550" cy= "5543550" > </wp: extent> <wp:effectextent b= "0" l= "0" r= "0" t= "0" > </wp:effectext ent> &LT;WP:DOCPR descr= "img_256" id= "1" name= "Picture 1" > </wp:doc pr> <wp:cnvgraphicframepr> <a:graphicframelocks Nochangeaspe ct= "1" xmlns:a= "Http://schemas.openxmlformats.org/drawingml/2006/main" > </a:graphicframel ocks> </wp:cnvgraphicframepr> <a:graphic xmlns:a= "Http://schemas . Openxmlformats.org/drawingml/2006/main "> <a:graphicdata uri=" Http://schemas.openxmlforma Ts.org/drawingml/2006/picture "> <pic:pic xmlns:pic=" http://schemas.openxmlformats.org /drawingml/2006/picture "> <pic:nvpicpr> <pic:cnvpr descr= "IM                                        g_256 "id=" 1 "name=" Picture 1 "> </pic:cnvpr>                                            <pic:cnvpicpr> <a:piclocks nochangeaspect= "1" >                                    </a:piclocks> </pic:cnvpicpr>                                        </pic:nvpicpr> <pic:blipfill>                                        <a:blip r:embed= "RId4" > </a:blip>                                            <a:stretch> <a:fillrect>                                    </a:fillrect> </a:stretch> </pic:Blipfill> <pic:sppr> <a:xfrm> <a:off x= "0" y= "0" > </a:off&gt                                            ;                                        <a:ext cx= "5543550" cy= "5543550" > </a:ext>                                            </a:xfrm> <a:prstgeom prst= "rect" >                                        <a:avlst> </a:avlst>                                        </a:prstgeom> <a:nofill>                                            </a:nofill> <a:ln w= "9525" >                <a:nofill> </a:nofill>                        </a:ln> </pic:sppr>                    </pic:pic> </a:graphicdata> </a:graphic> </wp:inline> </w:drawing> </w:r>

Conclusion:

<w:document>  Define the beginning of the entire document <w:body> the child node of the document    , the body of the documents        <w:p> the    child nodes of the body, a paragraph is the child node of the paragraph           <w:r> p element in the Word document    , and a run defines a section of content in the same format in the paragraph                <w:t> the    child node of the Run element node, which is the contents                of the document <w:drawing> the    child node of the run element, defines a picture                    <w:inline>    Drawing child node, the application does not study                    <a:graphic>     defines the child nodes of the picture content                        <pic:blipfill>    graphic document, and defines the index of the picture content.

Specifically, if you use Java, then XWPF parse the docx document is to do XML document parsing, get all the nodes and transform into a better use of properties to provide APIs to use, in Java POI can be based on this name to get the corresponding resources of the picture, and the key to get the picture location is here.

Unfortunately, I'm using php~~~ so we need to manually implement the image via the relevant interface of PHP.

Here's my specific idea : Get the XML node of the Docx document via PHP's built-in DOMDocument interface, traverse the XML node to find the node element that holds the picture, and traverse the image node down to the value of the r:embed index. Because the docx document is a compressed package format, traversing the docx document via the PHP built-in interface ziparchive interface (essentially traversing the. zip archive), finds the corresponding picture by index, transforms it into binary data, The stitching img tag displays the image data in the format base64.

Convert to XML:

   Private $rels _xml;        Private $doc _xml;        Private Function Readzippart ($filename) {$zip = new ziparchive ();        $_xml = ' word/document.xml ';        $_xml_rels = ' word/_rels/document.xml.rels ';                if (true = = = $zip->open ($filename)) {if ($index = $zip->locatename ($_xml))!== false) {            $xml = $zip->getfromindex ($index);        } $zip->close ();                } else die (' non zip file ');                if (true = = = $zip->open ($filename)) {if ($index = $zip->locatename ($_xml_rels))!== false) {                                $xml _rels = $zip->getfromindex ($index);        } $zip->close ();                } else die (' non zip file ');        $this->doc_xml = new DOMDocument ();        $this->doc_xml->encoding = mb_detect_encoding ($xml);        $this->doc_xml->preservewhitespace = false;        $this->doc_xml->formatoutput = true; $this-&GT;DOC_XML-&GT;LOADXML ($xml);                $this->doc_xml->savexml ();        $this->rels_xml = new DOMDocument ();        $this->rels_xml->encoding = mb_detect_encoding ($xml);        $this->rels_xml->preservewhitespace = false;        $this->rels_xml->formatoutput = true;        $this->rels_xml->loadxml ($xml _rels);            $this->rels_xml->savexml (); }

Determine if the picture node is:

if ($paragraph->name = = = ' w:drawing ') {    (Strstr ($ts, ' ... ... ')! = False | | Strstr ($ts, ' ... Line ... ')! = False)? $t. = ': $t. = $this->analysisdrawing ($paragraph);}

Get image index:

   Private Function analysisdrawing (& $drawingXml) {while        ($drawingXml->read ()) {            if ($drawingXml NodeType = = Xmlreader::element && $drawingXml->name = = = ' A:blip ') {                $rId = $drawingXml->getattribute ( ' r:embed ');                $rIdIndex = substr ($rId, 3);                return $this->checkimageformating ($rIdIndex);}}

To display a picture file in a compressed package:

   Private Function checkimageformating ($rIdIndex) {$imgname = ' word/media/image '.        $rIdIndex-8); $zipfileName = __dir__. Directory_separator. ' B '.        Directory_separator. ' Test.docx ';        $zip =zip_open ($zipfileName);            while ($zip _entry = Zip_read ($zip)) {//Read the file in the package sequentially $file _name=zip_entry_name ($zip _entry);//Get the file name in the zip if (Strstr ($file _name, $imgname)! = ") {$a = ($rIdIndex-8 <)? Mb_substr ($file _name,mb_strlen ($imgna                    Me, "Utf-8"), 1, ' Utf-8 '): ';                if ($rIdIndex-8 < && $a! = '. ') continue; if ($enter _zp = Zip_entry_open ($zip, $zip _entry, "R")) {//Read files in package $ext = PathInfo (Zip_entry_name ($zip _entry), pathinfo_extension);//Get picture file extension $content = Zip_entry_read ($zip _entry,zip_entry_filesize ($zip _en try));//Read file binary data return sprintf ('  ", $ext, Base64_encode ($cont ENT));//Use Base64_encode function to convert the read binary numberInput and output to the page} zip_entry_close ($zip _entry); Close open item in Zip}} zip_close ($zip);//close Zip file}

The above is the whole content of this article, I hope that everyone's learning has helped, more relevant content please pay attention to topic.alibabacloud.com!

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More