Using PHP to parse an HTML document tree has always been a challenge. Simple HTML DOM parser helps us solve this problem well. You can use this PHP class to parse HTML documents and manipulate HTML elements in them (php5+ version above)
Copy Code code as follows:
<?php
//Create a new DOM instance
$html = new Simple_html_dom ();
//Load from URL
$html->load_file (' http://www.jb51.net ');
//Load from string
$html->load (' <html><body> loading HTML document demo </body></html> ' from string);
//Load from file
$html->load_file (' path/file/test.html ');
?>
If you load an HTML document from a string, you need to download it from the network first. It is recommended that you use Curl to crawl HTML documents and load the DOM.
Find HTML elements
You can use the Find function to look up elements in an HTML document. The result returned is an array containing the objects. We use the HTML DOM to parse the functions in the class to access these objects, and here are a few examples:
Copy Code code as follows:
<?php
//Find hyperlink elements in an HTML document
$a = $html->find (' a ');
//Find the (N) hyperlink in the document and return an empty array if it is not found.
$a = $html->find (' A ', 0);
//Find DIV element with ID main
$main = $html->find (' div[id=main] ', 0);
//Find all DIV elements that contain ID attributes
$divs = $html->find (' div[id] ');
//Find all elements that contain an id attribute
$divs = $html->find (' [id] ');
?>
You can also use a jquery-like selector to find the positioning element:
Copy Code code as follows:
<?php
//Find id= ' #container ' elements
$ret = $html->find (' #container ');
//Find all Class=foo elements
$ret = $html->find ('. Foo ');
//Find multiple HTML tags
$ret = $html->find (' A, img ');
//can also use
$ret = $html->find (' a[title], img[title] ');
?>
Parser supports lookup of child elements
Copy Code code as follows:
<?php
//Find all the Li items in the UL list
$ret = $html->find (' ul Li ');
//Find the UL list specify class=selected li item
$ret = $html->find (' ul li.selected ');
?>
If you think this is a hassle, use built-in functions to easily locate the element's parent, child, and adjacent elements
Copy Code code as follows:
<?php
//Returns the parent element
$e->parent;
//Returns an array of child elements
$e->children;
//Returns the specified child element through the index number
$e->children (0);
//Return to the first resource speed
$e->first_child ();
//Returns the last child element
$e->last _child ();
//Returns the previous adjacent element
$e->prev_sibling ();
//Returns the next adjacent element
$e->next_sibling ();
?>
Element Property Action
use a simple regular expression to manipulate the property selector.
[attribute]– Select an HTML element that contains a property
[attribute=value]– selects all HTML elements of the specified value attribute
[attribute!=value]-selects all HTML elements that do not have a specified value attribute
[Attribute^=value]-selects all HTML elements of the specified value at the beginning of the property
[Attribute$=value] Selects all HTML elements of the attribute at the end of the specified value
[Attribute*=value]-selects all HTML elements that contain the specified value attribute
To invoke an element property in the parser
Element attributes are also objects in the DOM:
Copy Code code as follows:
<?php
///This example assigns the $a anchor connection value to the $link variable
$link = $a->href;
?>
Or:
Copy Code code as follows:
<?php
$link = $html->find (' A ', 0)->href;
?
Each object has 4 basic object properties:
tag– return HTML Label signature
innertext– return to innerHTML
outertext– return to outerHTML
plaintext– returns the text in an HTML tag
Editing elements in the parser
Editing the use of element attributes and calling them is similar:
Copy Code code as follows:
<?php
//To $a the anchor chain to assign a new value
$a->href = ' http://www.jb51.net ';
//Remove anchor cable to connect
$a->href = null;
//detection of the existence of anchor chain connection
if (isset ($a->href)) {
//Code
}
?>
There is no specific method in the parser to add or remove elements, but you can work around it:
Copy Code code as follows:
<?php
//Encapsulation element
$e->outertext = ' <div class= ' wrap ' > '. $e->outertext. ' <div> ';
//Delete element
$e->outertext = ';
//Add element
$e->outertext = $e->outertext. ' <div>foo<div> ';
//Insert Element
$e->outertext = ' <div>foo<div> '. $e->outertext;
?
Saving the modified HTML DOM document is also very simple:
Copy Code code as follows:
<?php
$doc = $html;
//Output
Echo $doc;
?>
How to prevent the parser from consuming too much memory
in the beginning of this article, I mentioned the problem that simple HTML DOM parser consumes too much memory. If the PHP script takes up too much memory, it can cause the Web site to stop responding to a series of serious problems. The solution is also simple, after the parser loads the HTML document and uses it, remember to clean out the object. Of course, don't take the problem too seriously. If only 2 or 3 of documents are loaded, cleaning or not cleaning is not much different. When you load 5 10 or more documents, you can clean up the memory after using one. ^_^
Copy Code code as follows:
<?php
$html->clear ();
?>