Php converts HTML pages to word and saves them.

Last Update:2016-10-20 Source: Internet

Author: User

Tags phpword

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Php converts HTML pages to word and saves them.

This example describes how php converts HTML pages to word and saves them. We will share this with you for your reference. The details are as follows:

A PHP tool called PHPWord is used here.

The principle of Generating Word is to compress the compiled xml into a zip package and change the suffix to doc or docx.

So to use PHPWord, you need to install zip. dll compression extension in your PHP environment. I wrote a demo.

Function Description:

20150507-obtaining the <p> tag and <ol> list tag in HTML
20150508-added the image retrieval function in the document.
20150509-adds row spacing and filters out incorrect Images
20150514-added table processing and changed the code to object-oriented
20150519-added GD library to process network images

Require_once 'phpword. php '; require_once' SimpleHtmlDom. class. php '; class Word {private $ url; private $ LinetextArr = array (); public $ CurrentDir; public $ error = array (); // error array public $ filename = null; public $ Allowtag = "p, ol, ul, table";/** data statistics **/public $ DownImg = 0; public $ expendTime = 0; public $ HttpRequestTime = 0; public $ ContentLen = 0; public $ HttpRequestArr = array (); public $ expendmemory = 0; public function _ construct ($ url) {$ startTime = $ this-> _ Time (); $ startMemory = $ this-> _ memory (); $ this-> url = $ url; $ UrlArr = parse_url ($ this-> url); $ this-> host = $ UrlArr ["scheme"]. "://". $ UrlArr ['host']; $ this-> CurrentDir = getcwd (); $ this-> LinetextArr ["table"] = array (); $ html = new simple_html_dom ($ this-> url); $ this-> HttpRequestArr [] = $ this-> url; $ this-> HttpRequestTime ++; foreach ($ html-> f Ind ($ this-> Allowtag) as $ key => $ value) {if ($ value-> tag = "table") {$ this-> ParseTable ($ value, 0, $ this-> LinetextArr ["table"]);} else {$ this-> AnalysisHtmlDom ($ value );} $ this-> error [] = error_get_last ();} $ endTime = $ this-> _ Time (); $ endMemory = $ this-> _ memory (); $ this-> expendTime = round ($ endTime-$ startTime), 2); // microsecond $ this-> expendmemory = round ($ endMemory-$ startMemory) /1000,2); // bytes $ this-> C ReateWordDom ();} private function _ Time () {return array_sum (explode ("", microtime ();} private function _ memory () {return memory_get_usage ();} /*** parse the Table in HTML. Here we take into account the multi-layer table nesting situation * @ param $ value HTMLDOM * @ param $ I traversal level **/private function ParseTable ($ value, $ I, $ Arr) {if ($ value-> firstChild () & in_array ($ value-> firstChild ()-> tag, array ("table ", "tbody", "thead", "tfoot", "tr") {foreach ($ Value-> children as $ k =>$ v) {$ this-> ParseTable ($ v, $ I ++, $ Arr );}} else {foreach ($ value-> children as $ k = >$ v) {if ($ v-> firstChild () & $ v-> firstChild ()-> tag! = "Table") {$ Arr [$ I] [] = array ("tag" => $ v-> tag, "text" => trim ($ v-> plaintext);} if (! $ V-> firstChild () {$ Arr [$ I] [] = array ("tag" => $ v-> tag, "text" => trim ($ v-> plaintext ));}}}} /*** parse the expressions in HTML * @ param $ value HTMLDOM ***/private function AnalysisHtmlDom ($ value) {$ tmp = array (); if ($ value-> has_child () {foreach ($ value-> children as $ k = >$ v) {$ this-> AnalysisHtmlDom ($ v );}} else {if ($ value-> tag = "a") {$ tmp = array ("tag" => $ value-> tag, "href" => $ value-> href, "text" => $ value-> inner Text);} else if ($ value-> tag = "img") {$ src = $ this-> unescape ($ value-> src ); $ UrlArr = parse_url ($ src); if (! Isset ($ UrlArr ['host']) {$ src = $ this-> host. $ value-> src; $ UrlArr = parse_url ($ src);} $ src = $ this-> getImageFromNet ($ src, $ UrlArr); // indicates a network image, you need to download if ($ src) {$ imgsArr = $ this-> GD ($ src); $ tmp = array ("tag" => $ value-> tag, "src" => $ src, "text" => $ value-> alt, "width" => $ imgsArr ['width'], "height" =>$ imgsArr ['height']) ;}} else {$ tmp = array ("tag" =>$ value-> tag, "text" => strip_tags ($ value-> innertext);} $ this-> L InetextArr [] = $ tmp ;}/ *** if too many images are obtained based on the GD library, perform proportional compression ***/private function GD ($ src) {list ($ width, $ height, $ type, $ attr) = getimagesize ($ src); if ($ width> 800 | $ height> 800) {$ width = $ width/2; $ height = $ height/2;} return array ("width" => $ width, "height" => $ height );} /*** transfer Uincode encoding back to the original character ***/public function unescape ($ str) {$ str = rawurldecode ($ str); preg_match_all ("/(?: % U. {4}) | & # x. {4}; | & # \ d +; |. +/U ", $ str, $ r); $ ar = $ r [0]; foreach ($ ar as $ k => $ v) {if (substr ($ v,) = "% u") {$ ar [$ k] = iconv ("UCS-2BE", "UTF-8 ", pack ("H4", substr ($ v,-4);} elseif (substr ($ v,) = "& # x ") {$ ar [$ k] = iconv ("UCS-2BE", "UTF-8", pack ("H4", substr ($ v, 3,-1 )));} elseif (substr ($ v,) = "& #") {$ ar [$ k] = iconv ("UCS-2BE", "UTF-8", pack ("n ", substr ($ v, 2,-1) ;}} return join ("", $ ar) ;}/ *** download image * @ Param $ Src target resource * @ param $ array of the UrlArr target URL ***/private function getImageFromNet ($ Src, $ UrlArr) {$ file = basename ($ UrlArr ['path']); $ ext = explode ('. ', $ file); $ this-> ImgDir = $ this-> CurrentDir. "/". $ UrlArr ['host']; $ _ supportedImageTypes = array ('jpg ', 'jpeg', 'gif', 'png ', 'bmp', 'tif ', 'tiff '); if (isset ($ ext ['1']) & in_array ($ ext ['1'], $ _ supportedImageTypes )) {$ file = file_get_contents ($ Src ); $ This-> HttpRequestArr [] = $ Src; $ this-> HttpRequestTime ++; $ this-> _ mkdir (); // create a directory, or the collection error $ imgName = md5 ($ UrlArr ['path']). ". ". $ ext ['1']; file_put_contents ($ this-> ImgDir. "/". $ imgName, $ file); $ this-> DownImg ++; return $ UrlArr ['host']. "/". $ imgName;} return false;}/*** create directory ***/private function _ mkdir () {if (! Is_dir ($ this-> ImgDir) {if (! Mkdir ($ this-> ImgDir, "7777") {$ this-> error [] = error_get_last ();}}} /*** construct WordDom ***/private function CreateWordDom () {$ PHPWord = new PHPWord (); $ PHPWord-> setDefaultFontName (' '); $ PHPWord-> setDefaultFontSize ("11"); $ styleTable = array ('bordersize' => 6, 'bordercolor' => '000000', 'cellmargin '=> 006699 ); // New portrait section $ section = $ PHPWord-> createSection (); $ section-> addText ($ this-> Details (), array (), array ('spacing' => 120); // data processing foreach ($ this-> LinetextArr as $ key => $ lineArr) {if (isset ($ lineArr ['tag']) {if ($ lineArr ['tag'] = "li ") {$ section-> addListItem ($ lineArr ['text'], 0, "", "", array ('spacing' => 120 ));} else if ($ lineArr ['tag'] = "img") {$ section-> addImage ($ lineArr ['src'], array ('width' => $ lineArr ['width'], 'height' => $ lineArr ['height'], 'align '=> 'center '));} else if ($ lineArr ['tag'] = "p") {$ section-> addText ($ lineArr ['text'], array (), array ('spacing' => 120) ;}} else if ($ key = "table") {$ PHPWord-> addTableStyle ('myowntablestyle', $ styleTable ); $ table = $ section-> addTable ("myOwnTableStyle"); foreach ($ lineArr as $ key => $ tr) {$ table-> addRow (); foreach ($ tr as $ ky =>$ td) {$ table-> addCell (2000)-> addText ($ td ['text']) ;}} $ this-> downFile ($ PHPWord);} public function Details () {$ msg = "Total requests: {$ this-> HttpRequestTime, the total number of downloaded images is {$ this-> DownImg}, and the duration of the download is about {$ this-> expendTime} seconds. The memory consumption of the entire program is about: {$ this-> expendmemory} KB, "; return $ msg;} public function downFile ($ PHPWord) {if (empty ($ this-> filename )) {$ UrlArr = parse_url ($ this-> url); $ this-> filename = $ UrlArr ['host']. ". docx ";} // Save File $ objWriter = PHPWord_IOFactory: createWriter ($ PHPWord, 'word2007 '); $ objWriter-> save ($ this-> filename ); header ("Pragma: public"); header ("Expires: 0"); header ("Cache-Control: must-revalidate, post-check = 0, pre-check = 0 "); header (" Cache-Control: public "); header (" Content-Description: File Transfer "); // Use the switch-generated Content-Type header ('content-type: application/msword '); // output type // Force the download $ header = "Content-Disposition: attachment; filename = ". $ this-> filename. ";"; header ($ header); @ readfile ($ this-> filename );}}

The above Code does not focus on word generation, but on Simplehtmldom. This is an open-source HTML Parser. As mentioned earlier, I have been reading his code these days,

Two learning directions are introduced.

① Positive expression

② Sort out the extended functions

Gains from viewing the source code:

PHP exceptions can be captured, and PHP errors can also be captured.

Error_get_last () // use this function to capture PHP errors on the page. Thank you.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More