For the perfect solution, office to convert PDF or HTML, preferably with Windows Office software, LibreOffice not perfect conversion, WPS has no API.
First confirm that the COM module is not open, phpinfo inside if there is a com_dotnet module, the description has been opened, if not, modify the php.ini,
Com.allow_dcom = True
The previous comments removed, restart on OK, PHP official website said, php5.4.5 before the COM module is built, in fact, is not necessarily all, the official website of the PHP 5.3.39,com module is not built.
If not the built-in module, PHP.ini Plus, the premise of your Ext folder, there is the extension
Extension=php_com_dotnet.dll
And then reboot, OK.
function word2html ($wordname, $htmlname)
{
$word = new COM ("Word.Application") or Die ("Unable to instanciate word");
$word->visible = 1;
$word->documents->open ($wordname);
$word->documents[1]->saveas ($htmlname, 8);
$word->quit ();
$word = null;
Unset ($word);
}
word2html (' D:/www/test/6.docx ', ' d:/www/test/6.html ');
Attention:
1, converted out of the HTML, view the source code, the more chaotic
2, the Winword.exe is invoked during the conversion process
3, if the page has been loaded, rename the document and then turn it back on.
Add an example
function Lego_clean ($text) {
$text = Implode ("\ r", $text);
Normalize White
$text = Eregi_replace ("[[: Space:]]+", "", $text);
$text = Str_replace ("> <", ">\r\r<", $text);
$text = Str_replace ("<br>", "<br>\r", $text);
Remove everything before <body>
$text = Strstr ($text, "<body");
Keep tags, strip attributes
$text = ereg_replace ("<p [^>]*bodytextindent[^>]*>" ([^\n|\n\015|\015\n]*) </p> "," <p>\\1< /p> ", $text);
$text = Eregi_replace ("<p [^>]*margin-left[^>]*>" ([^\n|\n\015|\015\n]*) </p> "," <blockquote> \\1</blockquote> ", $text);
$text = Str_replace ("", "", $text);
Clean up whatever are left inside <p> and <li>
$text = Eregi_replace ("<p [^>]*>", "<p>", $text);
$text = Eregi_replace ("<li [^>]*>", "<li>", $text);
Kill unwanted Tags
$text = Eregi_replace ("</?span[^>]*>", "", $text);
$text = Eregi_replace ("</?body[^>]*>", "", $text);
$text = Eregi_replace ("</?div[^>]*>", "", $text);
$text = Eregi_replace ("<\![ ^>]*> "," ", $text);
$text = Eregi_replace ("</?[ A-z]\:[^>]*> "," ", $text);
Kill style and on mouse* tags
$text = Eregi_replace ("([\f\r\t\n\ ']) style=[^>]+", "\\1", $text);
$text = Eregi_replace ("([\f\r\t\n\ ']) on[a-z]+=[^>]+", "\\1", $text);
Remove empty paragraphs
$text = Str_replace ("<p></p>", "", $text);
Remove closing $text = Str_replace ("
Clean up white space again
$text = Eregi_replace ("[[: Space:]]+", "", $text);
$text = Str_replace ("> <", ">\r\r<", $text);
$text = Str_replace ("<br>", "<br>\r", $text);
}