: Https://github.com/samacs/simple_html_dom
Parsers are more than just helping us validate HTML documents, but we can also parse non-compliant HTML documents. It uses a jquery-like element selector, id,class,tag the elements and so on to find the location, and also provides the ability to add, delete, and modify the document tree. Of course, such a powerful HTML DOM parser is not perfect, and you need to be very careful with memory consumption in the process. However, don't worry; In this article, I'll show you how to avoid consuming too much memory in the end.
Start using
After uploading the class file, there are three ways to call this class:
Loading an HTML document from a URL
Loading an HTML document from a string
Loading an HTML document from a file
Copy CodeThe code is as follows:
<?php
Create a new DOM instance
$html = new Simple_html_dom ();
Load from URL
$html->load_file (' http://www.jb51.net ');
Loading from a string
$html->load ('
Load from File
$html->load_file (' path/file/test.html ');
?>
If you load an HTML document from a string, you need to download it from the network first. It is recommended that you use Curl to crawl an HTML document and load the DOM.
Finding HTML elements
You can use the Find function to find elements in an HTML document. The returned result is an array containing the object. We use the HTML DOM to parse the functions in the class to access these objects, and here are a few examples:
Copy CodeThe code is as follows:
<?php
Find a hyperlink element in an HTML document
$a = $html->find (' a ');
Finds the first (N) hyperlink in the document and returns an empty array if it is not found.
$a = $html->find (' A ', 0);
Find a DIV element with ID main
$main = $html->find (' div[id=main] ', 0);
Find all DIV elements that contain an id attribute
$divs = $html->find (' div[id] ');
Find all elements that contain an id attribute
$divs = $html->find (' [id] ');
?>
You can also use a jquery-like selector to find positioned elements:
Copy CodeThe code is as follows:
<?php
Find elements of id= ' #container '
$ret = $html->find (' #container ');
Find all elements of Class=foo
$ret = $html->find ('. Foo ');
Find multiple HTML tags
$ret = $html->find (' A, img ');
You can also use this
$ret = $html->find (' a[title], img[title] ');
?>
Parser supports lookup of child elements
Copy CodeThe code is as follows:
<?php
Find all the Li items in the UL list
$ret = $html->find (' ul Li ');
Find UL list specify Li entries for class=selected
$ret = $html->find (' ul li.selected ');
?>
If you find this troublesome, use built-in functions to easily locate the parent, child, and neighboring elements of an element
Copy CodeThe code is as follows:
<?php
Returns the parent element
$e->parent;
Returns an array of child elements
$e->children;
Returns the specified child element by index number
$e->children (0);
Returns the first resource speed
$e->first_child ();
Returns the last child element
$e->last _child ();
Returns the previous adjacent element
$e->prev_sibling ();
Returns the next adjacent element
$e->next_sibling ();
?>
Element Property Manipulation
Use a simple regular expression to manipulate the property selector.
[attribute]– Select an HTML element that contains a property
[attribute=value]– selects all HTML elements of the specified value property
[attribute!=value]-selects all HTML elements for all non-specified value properties
[Attribute^=value]-Select the HTML element for all specified values at the beginning of the property
[Attribute$=value] Selects all HTML elements that specify end-of-value properties
[Attribute*=value]-selects all HTML elements that contain the specified value attribute
Invoking element properties in the parser
Element attributes are also objects in the DOM:
Copy CodeThe code is as follows:
<?php
In this example, a $ A anchor value is assigned to the $link variable
$link = $a->href;
?>
Or:
Copy CodeThe code is as follows:
<?php
$link = $html->find (' A ', 0)->href;
?
Each object has 4 base object properties:
tag– return HTML Label signature
innertext– return innerHTML
outertext– return outerhtml
plaintext– returning text in an HTML tag
Editing elements in the parser
Editing the use of element properties is similar to calling them:
Copy CodeThe code is as follows:
<?php
Assigning a new value to a $ A anchor chain
$a->href = ' http://www.jb51.net ';
Remove Anchor Connection
$a->href = null;
Detect the presence of a chain connection
if (isset ($a->href)) {
Code
}
?>
There is no specific way to add or remove elements in the parser, but you can work around using:
Copy CodeThe code is as follows:
<?php
Encapsulating elements
$e->outertext = ' <div class= ' wrap ' > '. $e->outertext. ' <div> ';
Delete Element
$e->outertext = ";
adding elements
$e->outertext = $e->outertext. ' <div>foo<div> ';
inserting elements
$e->outertext = ' <div>foo<div> '. $e->outertext;
?
Saving the modified HTML DOM document is also very simple:
Copy CodeThe code is as follows:
<?php
$doc = $html;
Output
Echo $doc;
?>
How to avoid the parser consuming too much memory
At the beginning of this article, I mentioned that the simple HTML DOM parser consumes too much memory. If the PHP script takes up too much memory, it can cause a series of serious problems such as the website stops responding. The workaround is also simple, after the parser loads the HTML document and uses it, remember to clean up the object. Of course, don't look at the problem too seriously. If you just load 2, 3 documents, cleaning or not cleaning up is not much of a difference. When you load 5 or more 10 or more documents, clean up the memory by using one and you're absolutely responsible for yourself. ^_^
Copy CodeThe code is as follows:
<?php
$html->clear ();
?> http://www.jb51.net/article/39526.htm
PHP Parsing HTML class library Simple_html_dom