It has always been a problem to parse the html document tree using php. SimpleHTMLDOMparser helped us solve this problem well. You can use this php class to parse html documents and operate the html elements.
It has always been a problem to parse the html document tree using php. Simple html dom parser helps us solve this problem well. You can use this php class to parse html documents and operate the html elements.
: Https://github.com/samacs/simple_html_dom
The parser not only helps us verify html documents, but also resolves html documents that do not comply with W3C standards. It uses an element selector similar to jQuery to locate and locate the element by its id, Hong Kong virtual host, class, and tag. It also provides the function of adding, deleting, and modifying the document tree. Of course, on the U.S. server, such a powerful html Dom parser is not perfect, and the memory consumption needs to be very careful during use. But don't worry. In this article, I will introduce how to avoid excessive memory consumption.
Start to use
After uploading a class file, you can call this class in three ways:
Load html documents from URLs
Load html documents from strings
Load html documents from files
The Code is as follows:
// Create a Dom instance
$ Html = new simple_html_dom ();
// Load from the url
$ Html-> load_file ('HTTP: // www.jb51.net ');
// Load from a string
$ Html-> load ('Loading html documents from strings');
// Load from a file
$ Html-> load_file ('path/file/test.html ');
?>
If you load an html document from a string in the U.S. space, you must first download it from the Internet. We recommend that you use cURL to capture html documents and load them into the DOM.
Search for html elements
You can use the find function to find elements in html documents. The returned result is an array containing objects. We use functions in the html dom parsing class to access these objects. The following is an example:
The Code is as follows:
// Search for hyperlink elements in html documents
$ A = $ html-> find ('A ');
// Search for the N hyperlink in the document. If N is not found, an empty array is returned.
$ A = $ html-> find ('A', 0 );
// Find the p element whose id is main
$ Main = $ html-> find ('P [id = main] ', 0 );
// Search for all p elements containing the id attribute
$ Ps = $ html-> find ('P [id] ');
// Search for all elements with the id attribute
$ Ps = $ html-> find ('[id]');
?>
You can also use a selector similar to jQuery to find the positioning element:
The Code is as follows:
// Find the element whose id is '# iner'
$ Ret = $ html-> find ('# iner ');
// Find all elements of class = foo
$ Ret = $ html-> find ('. foo ');
// Search for Multiple html tags
$ Ret = $ html-> find ('a, img ');
// It can also be used like this
$ Ret = $ html-> find ('a [title], img [title] ');
?>
The parser supports searching child elements.
The Code is as follows:
// Find all li items in the ul list
$ Ret = $ html-> find ('ul li ');
// Find the li item of the specified class = selected in the ul list
$ Ret = $ html-> find ('ul li. selected ');
?>
If you think this is difficult to use, you can use built-in functions to easily locate the parent element, child element, and adjacent element of an element.
The Code is as follows:
// Returns the parent element.
$ E-> parent;
// Returns the array of child elements.
$ E-> children;
// Return the specified child element through the index number
$ E-> children (0 );
// Returns the first resource speed
$ E-> first_child ();
// Returns the last child element.
$ E-> last _ child ();
// Returns the last adjacent element.
$ E-> prev_sibling ();
// Returns the next adjacent element.
$ E-> next_sibling ();
?>
Element attribute operations
Use a simple regular expression to operate the attribute selector.
[Attribute]-select an html element containing an attribute
[Attribute = value]-select all html elements of the specified value attribute
[Attribute! = Value]-select all html elements with unspecified value Attributes
[Attribute ^ = value]-select all html elements starting with the specified value
[Attribute $ = value] select all html elements of the end attribute of the specified value
[Attribute * = value]-select all html elements that contain the specified value attribute
Call element attributes in the parser
Element attributes in the DOM are also objects:
The Code is as follows:
// In this example, assign the $ a anchor value to the $ link variable.
$ Link = $ a-> href;
?>
Or:
The Code is as follows:
$ Link = $ html-> find ('A', 0)-> href;
?
Each object has four basic object attributes:
Tag-returned html tag Name
Innertext-return innerHTML
Outertext-return outerHTML
Plaintext-return the text in the html Tag
Edit element in parser
The usage of editing element attributes is similar to calling them:
The Code is as follows:
// Assign a new value to the $ a anchor Link
$ A-> href = 'HTTP: // www.jb51.net ';
// Delete the anchor
$ A-> href = null;
// Check whether there is an anchor Link
If (isset ($ a-> href )){
// Code
}
?>
The parser does not have a special method to add or delete elements, but you can use it as a work und:
The Code is as follows:
// Encapsulate Elements
$ E-> outertext ='
'. $ E-> outertext .'
';
// Delete an element
$ E-> outertext = '';
// Add Element
$ E-> outertext = $ e-> outertext .'
Foo
';
// Insert element
$ E-> outertext ='
Foo
'. $ E-> outertext;
?
Saving the modified html DOM document is also very simple:
The Code is as follows:
$ Doc = $ html;
// Output
Echo $ doc;
?>
How to avoid excessive memory consumption by the parser
In the beginning of this article, I mentioned the problem that the Simple HTML DOM parser consumes too much memory. If the php script occupies too much memory, the website will stop responding and other serious problems. The solution is also very simple. After the parser loads the html document and uses it, remember to clear this object. Of course, do not take the problem too seriously. If only two or three documents are loaded, there are no different regions to clean up or not clean up. When you load 5 or more documents, it is absolutely your responsibility to clean up the memory when you use up one. ^_^
The Code is as follows:
$ Html-> clear ();
?>