It has always been a problem to parse the html document tree using php. SimpleHTMLDOMparser helped us solve this problem well. You can use this php class to parse html documents and perform operations on the html elements (PHP5 or a later version ). The parser not only helps us verify html documents, but also resolves html documents that do not comply with W3C standards.
It has always been a problem to parse the html document tree using php. Simple html dom parser helps us solve this problem well. You can use this php class to parse html documents and perform operations on the html elements (PHP5 or a later version ). The parser not only helps us verify html documents, but also resolves html documents that do not comply with W3C standards.
It has always been a problem to parse the html document tree using php.Simple html dom parserIt helped us solve this problem well. You can use this php class to parse html documents and perform operations on the html elements (PHP5 + and later versions ).
The parser not only helps us verify html documents, but also resolves html documents that do not comply with W3C standards. It uses element selectors similar to jQuery to locate and locate elements by id, class, and tag. It also provides the function of adding, deleting, and modifying document trees. Of course, such a powerful html Dom parser is not perfect. You need to be very careful about memory consumption during use. But don't worry. In this article, I will introduce how to avoid excessive memory consumption.
Start to use
After uploading a class file, you can call this class in three ways:
Load html documents from URLs
Load html documents from strings
Load html documents from files
?
1 2 3 4 5 6 7 8 9 10 11 12 13 |
// Create a Dom instance
$html = new simple_html_dom();
// Load from the url
$html ->load_file( 'http://www.cnphp.info/php-simple-html-dom-parser-intro.html' );
// Load from a string
$html ->load( 'Loading html documents from strings' );
// Load from a file
$html ->load_file( 'path/file/test.html' );
?>
|
If you load html documents from strings, you must first download them from the network. We recommend that you use cURL to capture html documents and load them into the DOM.
Search for html elements
You can use the find function to find elements in html documents. The returned result is an array containing objects. We use functions in the html dom parsing class to access these objects. The following is an example:
?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
// Search for hyperlink elements in html documents
$a = $html ->find( 'a' );
// Search for the N hyperlink in the document. If N is not found, an empty array is returned.
$a = $html ->find( 'a' , 0);
// Find the p element whose id is main
$main = $html ->find( 'p[id=main]' ,0);
// Search for all p elements containing the id attribute
$ps = $html ->find( 'p[id]' );
// Search for all elements with the id attribute
$ps = $html ->find( '[id]' );
?>
|
You can also use a selector similar to jQuery to find the positioning element:
?
1 2 3 4 5 6 7 8 9 10 11 12 13 |
// Find the element whose id is '# iner'
$ret = $html ->find( '#container' );
// Find all elements of class = foo
$ret = $html ->find( '.foo' );
$ret = $html ->find( 'a, img' );
// It can also be used like this
$ret = $html ->find( 'a[title], img[title]' );
?>
|
The parser supports searching child elements.
?
1 2 3 4 5 6 7 8 9 |
// Find all li items in the ul list
$ret = $html ->find( 'ul li' );
// Find the li item of the specified class = selected in the ul list
$ret = $html ->find( 'ul li.selected' );
?>
|
If you think this is difficult to use, you can use built-in functions to easily locate the parent element, child element, and adjacent element of an element.
?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
// Returns the parent element.
$e ->parent;
// Returns the array of child elements.
$e ->children;
// Return the specified child element through the index number
$e ->children(0);
// Returns the first resource speed
$e ->first_child ();
// Returns the last child element.
$e ->last _child ();
// Returns the last adjacent element.
$e ->prev_sibling ();
// Returns the next adjacent element.
$e ->next_sibling ();
?>
|
Element attribute operations
Use a simple regular expression to operate the attribute selector.
[Attribute]-select an html element containing an attribute
[Attribute = value]-select all html elements of the specified value attribute
[Attribute! = Value]-select all html elements with unspecified value Attributes
[Attribute ^ = value]-select all html elements starting with the specified value
[Attribute $ = value] select all html elements of the end attribute of the specified value
[Attribute * = value]-select all html elements that contain the specified value attribute
Call element attributes in the parser
Element attributes in the DOM are also objects:
?
1 2 3 4 |
// In this example, assign the $ a anchor value to the $ link variable.
$link = $a ->href;
?>
|
Or:
?
1 2 3 |
$link = $html ->find( 'a' ,0)->href;
?>
|
Each object has four basic object attributes:
Innertext-return innerHTML
Outertext-return outerHTML
Edit element in parser
The usage of editing element attributes is similar to calling them:
?
1 2 3 4 5 6 7 8 9 10 11 12 |
// Assign a new value to the $ a anchor Link
$a ->href = 'http://www.cnphp.info' ;
// Delete the anchor
$a ->href = null;
// Check whether there is an anchor Link
if (isset( $a ->href)) {
// Code
}
?>
|
The parser does not have a special method to add or delete elements, but you can use it as a work und:
?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
// Encapsulate Elements
$e ->outertext = '
' . $e ->outertext . '
' ;
// Delete an element
$e ->outertext = '' ;
// Add Element
$e ->outertext = $e ->outertext . '
foo
' ;
// Insert element
$e ->outertext = '
foo
' . $e ->outertext;
?>
|
Saving the modified html DOM document is also very simple:
?
1 2 3 4 5 6 |
$doc = $html ;
// Output
echo $doc ;
?>
|
How to avoid excessive memory consumption by the parser
In the beginning of this article, I mentioned the problem that the Simple HTML DOM parser consumes too much memory. If the php script occupies too much memory, the website will stop responding and other serious problems. The solution is also very simple. After the parser loads the html document and uses it, remember to clear this object. Of course, do not take the problem too seriously. If only two or three documents are loaded, there are no different regions to clean up or not clean up. When you load 5 or more documents, it is absolutely your responsibility to clean up the memory when you use up one. ^_^
?
45