Php converts HTML pages to word and saves them.

Source: Internet
Author: User
Tags phpword

Php converts HTML pages to word and saves them.

This example describes how php converts HTML pages to word and saves them. We will share this with you for your reference. The details are as follows:

A PHP tool called PHPWord is used here.

The principle of Generating Word is to compress the compiled xml into a zip package and change the suffix to doc or docx.

So to use PHPWord, you need to install zip. dll compression extension in your PHP environment. I wrote a demo.

Function Description:

20150507-obtaining the <p> tag and <ol> list tag in HTML
20150508-added the image retrieval function in the document.
20150509-adds row spacing and filters out incorrect Images
20150514-added table processing and changed the code to object-oriented
20150519-added GD library to process network images

Require_once 'phpword. php '; require_once' SimpleHtmlDom. class. php '; class Word {private $ url; private $ LinetextArr = array (); public $ CurrentDir; public $ error = array (); // error array public $ filename = null; public $ Allowtag = "p, ol, ul, table";/** data statistics **/public $ DownImg = 0; public $ expendTime = 0; public $ HttpRequestTime = 0; public $ ContentLen = 0; public $ HttpRequestArr = array (); public $ expendmemory = 0; public function _ construct ($ url) {$ startTime = $ this-> _ Time (); $ startMemory = $ this-> _ memory (); $ this-> url = $ url; $ UrlArr = parse_url ($ this-> url); $ this-> host = $ UrlArr ["scheme"]. "://". $ UrlArr ['host']; $ this-> CurrentDir = getcwd (); $ this-> LinetextArr ["table"] = array (); $ html = new simple_html_dom ($ this-> url); $ this-> HttpRequestArr [] = $ this-> url; $ this-> HttpRequestTime ++; foreach ($ html-> f Ind ($ this-> Allowtag) as $ key => $ value) {if ($ value-> tag = "table") {$ this-> ParseTable ($ value, 0, $ this-> LinetextArr ["table"]);} else {$ this-> AnalysisHtmlDom ($ value );} $ this-> error [] = error_get_last ();} $ endTime = $ this-> _ Time (); $ endMemory = $ this-> _ memory (); $ this-> expendTime = round ($ endTime-$ startTime), 2); // microsecond $ this-> expendmemory = round ($ endMemory-$ startMemory) /1000,2); // bytes $ this-> C ReateWordDom ();} private function _ Time () {return array_sum (explode ("", microtime ();} private function _ memory () {return memory_get_usage ();} /*** parse the Table in HTML. Here we take into account the multi-layer table nesting situation * @ param $ value HTMLDOM * @ param $ I traversal level **/private function ParseTable ($ value, $ I, $ Arr) {if ($ value-> firstChild () & in_array ($ value-> firstChild ()-> tag, array ("table ", "tbody", "thead", "tfoot", "tr") {foreach ($ Value-> children as $ k =>$ v) {$ this-> ParseTable ($ v, $ I ++, $ Arr );}} else {foreach ($ value-> children as $ k = >$ v) {if ($ v-> firstChild () & $ v-> firstChild ()-> tag! = "Table") {$ Arr [$ I] [] = array ("tag" => $ v-> tag, "text" => trim ($ v-> plaintext);} if (! $ V-> firstChild () {$ Arr [$ I] [] = array ("tag" => $ v-> tag, "text" => trim ($ v-> plaintext ));}}}} /*** parse the expressions in HTML * @ param $ value HTMLDOM ***/private function AnalysisHtmlDom ($ value) {$ tmp = array (); if ($ value-> has_child () {foreach ($ value-> children as $ k = >$ v) {$ this-> AnalysisHtmlDom ($ v );}} else {if ($ value-> tag = "a") {$ tmp = array ("tag" => $ value-> tag, "href" => $ value-> href, "text" => $ value-> inner Text);} else if ($ value-> tag = "img") {$ src = $ this-> unescape ($ value-> src ); $ UrlArr = parse_url ($ src); if (! Isset ($ UrlArr ['host']) {$ src = $ this-> host. $ value-> src; $ UrlArr = parse_url ($ src);} $ src = $ this-> getImageFromNet ($ src, $ UrlArr); // indicates a network image, you need to download if ($ src) {$ imgsArr = $ this-> GD ($ src); $ tmp = array ("tag" => $ value-> tag, "src" => $ src, "text" => $ value-> alt, "width" => $ imgsArr ['width'], "height" =>$ imgsArr ['height']) ;}} else {$ tmp = array ("tag" =>$ value-> tag, "text" => strip_tags ($ value-> innertext);} $ this-> L InetextArr [] = $ tmp ;}/ *** if too many images are obtained based on the GD library, perform proportional compression ***/private function GD ($ src) {list ($ width, $ height, $ type, $ attr) = getimagesize ($ src); if ($ width> 800 | $ height> 800) {$ width = $ width/2; $ height = $ height/2;} return array ("width" => $ width, "height" => $ height );} /*** transfer Uincode encoding back to the original character ***/public function unescape ($ str) {$ str = rawurldecode ($ str); preg_match_all ("/(?: % U. {4}) | & # x. {4}; | & # \ d +; |. +/U ", $ str, $ r); $ ar = $ r [0]; foreach ($ ar as $ k => $ v) {if (substr ($ v,) = "% u") {$ ar [$ k] = iconv ("UCS-2BE", "UTF-8 ", pack ("H4", substr ($ v,-4);} elseif (substr ($ v,) = "& # x ") {$ ar [$ k] = iconv ("UCS-2BE", "UTF-8", pack ("H4", substr ($ v, 3,-1 )));} elseif (substr ($ v,) = "& #") {$ ar [$ k] = iconv ("UCS-2BE", "UTF-8", pack ("n ", substr ($ v, 2,-1) ;}} return join ("", $ ar) ;}/ *** download image * @ Param $ Src target resource * @ param $ array of the UrlArr target URL ***/private function getImageFromNet ($ Src, $ UrlArr) {$ file = basename ($ UrlArr ['path']); $ ext = explode ('. ', $ file); $ this-> ImgDir = $ this-> CurrentDir. "/". $ UrlArr ['host']; $ _ supportedImageTypes = array ('jpg ', 'jpeg', 'gif', 'png ', 'bmp', 'tif ', 'tiff '); if (isset ($ ext ['1']) & in_array ($ ext ['1'], $ _ supportedImageTypes )) {$ file = file_get_contents ($ Src ); $ This-> HttpRequestArr [] = $ Src; $ this-> HttpRequestTime ++; $ this-> _ mkdir (); // create a directory, or the collection error $ imgName = md5 ($ UrlArr ['path']). ". ". $ ext ['1']; file_put_contents ($ this-> ImgDir. "/". $ imgName, $ file); $ this-> DownImg ++; return $ UrlArr ['host']. "/". $ imgName;} return false;}/*** create directory ***/private function _ mkdir () {if (! Is_dir ($ this-> ImgDir) {if (! Mkdir ($ this-> ImgDir, "7777") {$ this-> error [] = error_get_last ();}}} /*** construct WordDom ***/private function CreateWordDom () {$ PHPWord = new PHPWord (); $ PHPWord-> setDefaultFontName (' '); $ PHPWord-> setDefaultFontSize ("11"); $ styleTable = array ('bordersize' => 6, 'bordercolor' => '000000', 'cellmargin '=> 006699 ); // New portrait section $ section = $ PHPWord-> createSection (); $ section-> addText ($ this-> Details (), array (), array ('spacing' => 120); // data processing foreach ($ this-> LinetextArr as $ key => $ lineArr) {if (isset ($ lineArr ['tag']) {if ($ lineArr ['tag'] = "li ") {$ section-> addListItem ($ lineArr ['text'], 0, "", "", array ('spacing' => 120 ));} else if ($ lineArr ['tag'] = "img") {$ section-> addImage ($ lineArr ['src'], array ('width' => $ lineArr ['width'], 'height' => $ lineArr ['height'], 'align '=> 'center '));} else if ($ lineArr ['tag'] = "p") {$ section-> addText ($ lineArr ['text'], array (), array ('spacing' => 120) ;}} else if ($ key = "table") {$ PHPWord-> addTableStyle ('myowntablestyle', $ styleTable ); $ table = $ section-> addTable ("myOwnTableStyle"); foreach ($ lineArr as $ key => $ tr) {$ table-> addRow (); foreach ($ tr as $ ky =>$ td) {$ table-> addCell (2000)-> addText ($ td ['text']) ;}} $ this-> downFile ($ PHPWord);} public function Details () {$ msg = "Total requests: {$ this-> HttpRequestTime, the total number of downloaded images is {$ this-> DownImg}, and the duration of the download is about {$ this-> expendTime} seconds. The memory consumption of the entire program is about: {$ this-> expendmemory} KB, "; return $ msg;} public function downFile ($ PHPWord) {if (empty ($ this-> filename )) {$ UrlArr = parse_url ($ this-> url); $ this-> filename = $ UrlArr ['host']. ". docx ";} // Save File $ objWriter = PHPWord_IOFactory: createWriter ($ PHPWord, 'word2007 '); $ objWriter-> save ($ this-> filename ); header ("Pragma: public"); header ("Expires: 0"); header ("Cache-Control: must-revalidate, post-check = 0, pre-check = 0 "); header (" Cache-Control: public "); header (" Content-Description: File Transfer "); // Use the switch-generated Content-Type header ('content-type: application/msword '); // output type // Force the download $ header = "Content-Disposition: attachment; filename = ". $ this-> filename. ";"; header ($ header); @ readfile ($ this-> filename );}}

The above Code does not focus on word generation, but on Simplehtmldom. This is an open-source HTML Parser. As mentioned earlier, I have been reading his code these days,

Two learning directions are introduced.

① Positive expression

② Sort out the extended functions

Gains from viewing the source code:

PHP exceptions can be captured, and PHP errors can also be captured.

Error_get_last () // use this function to capture PHP errors on the page. Thank you.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.