The origin of the matter is relatively simple, I need to put a navigation page data to write to the database. A more intuitive approach is to analyze the HTML file, the common method is to use the regular expression of PHP to match. But this is difficult to develop and maintain, and the code is very readable.
Navigation page data are arranged in the DOM tree, with JS can use a few loops easy to operate, and JS need to rely on the browser, the operation of the database is very difficult. In fact, PHP has a ready-made class library to the DOM tree node for the increase in the check operation, to do some notes.
This involves 2 classes of DOMDocument and Domxpath.
In fact, the idea is more clear, that is, through the DOMDocument of an HTML file into the DOM tree data structure, and then use the example of Domxpath to search the DOM tree, to get a specific node, then the current node can traverse the subtree, get the desired results.
There is a navigation HTML file "./hao.html" in the current directory.
Now you need to get all the Chinese content of the <a> tags, the PHP code is as follows:
Copy Code code as follows:
<?php
Convert html/xml file to Dom tree
$dom = new DOMDocument ();
$dom->loadhtmlfile ("hao.html");
Get the DL label for fix for all class
Example 1:for everything with a ID
$elements = $xpath->query ("//*[@id]");
Example 2:for node data in a selected ID
$elements = $xpath->query ("/html/body/div[@id = ' yourtagidhere ']");
Example 3:same as above with wildcard
$elements = $xpath->query ("*/div[@id = ' yourtagidhere ']");
$xpath = new Domxpath ($dom);
$dls = $xpath->query ('//dl[@class = ' fix '] ');
foreach ($dls as $DL) {
$spans = $dl->childnodes;
foreach ($spans as $span) {
Echo Trim ($span->textcontent). " \ t ";
}
echo "\ n";
}
?>
The output results are as follows:
Note: It is worth noting that the default encoding of DOMDocument is Latin, so when handling UTF encoded Chinese, you need to follow the
Copy Code code as follows:
<meta http-equiv= "Content-type" content= "text/html; Charset=utf-8 ">
In other locations, or just write <meta content= "Charset=utf-8" > are not recognized OH