: This article mainly introduces phpsimpledomhtml parsing garbled characters. For more information about PHP tutorials, see. 1. garbled solution
Without a doubt, a garbled problem occurs as soon as it comes up, although I have followed the instructions in the document, all the characters are encoded using UTF-8:
$ Html ='Hi!
'; $ Dom = new DOMDocument (); @ $ dom-> loadHTML ($ html); echo $ dom-> documentElement-> nodeValue;
However, if it is changed:
$ Html ='Hi!
'; $ Dom = new DOMDocument (); @ $ dom-> loadXML ($ html); echo $ dom-> documentElement-> nodeValue;
No problem. later I discovered that loadHTML relied on the meta tag declared in HTML. if there is no such label, it is regarded as the ISO-8859-1 character set, so garbled. to solve this problem, add such a label to the string in the header:
$meta = '
'; @$dom->loadHTML($meta . $html);
2. recursion
HTML/XML is a recursive layout, so recursive traversal is inevitable:
Function _ pretty_html_node ($ node) {// recursive termination prerequisite // 1. XML_TEXT_NODE // 2. XML_ELEMENT_NODE // 3. no subnode foreach ($ node-> childNodes as $ n) {$ child_text. = _ pretty_html_node ($ n);} // then perform different treatments for different labels. switch ($ tag) {case 'a ': $ href = $ node-> getAttribute ('href '); $ text. = "$ child_text ";...} return $ text ;}
3. penalty for handling escape characters
For a text node, its nodeValue must end with the htmlspeciachars () escape. because the text will be reversed when the HTML/XML is read, for example,> already in memory>.
Download source code: pretty_html.php
Related posts:
- C # SimpleXML
- Web page garbled during self-setup of Apache server
- If-else is disgusted with optimization code redundancy
- Wordpress paging code
- Use Javascript to generate a pop-up window
The above introduces php simple dom html parsing garbled characters, including the content, hope to be helpful to friends who are interested in PHP tutorials.