PHP Parsing HTML class Library Simple_html_dom (details) _php tips

Source: Internet
Author: User
Tags php script
Download Address: Https://github.com/samacs/simple_html_dom

The parser is more than just helping us validate HTML documents, but also parsing HTML documents that do not conform to the standards of the consortium. It uses a jquery-like element selector to find positioning through the id,class,tag of elements, and also provides the ability to add, delete, and modify the document tree. Of course, such a powerful HTML DOM parser is not perfect, and you need to be very careful about memory consumption in the process of using it. But don't worry; in this article, the author in the end will introduce you how to avoid consuming too much memory.
Start using
after uploading the class file, there are three ways to call this class:
To load an HTML document from a URL
Loading an HTML document from a string
Loading an HTML document from a file
Copy Code code as follows:

<?php
Create a new DOM instance
$html = new Simple_html_dom ();

Loading from the URL
$html->load_file (' http://www.jb51.net ');

Load from string
$html->load ('
Loading from a file
$html->load_file (' path/file/test.html ');
?>

If you load an HTML document from a string, you need to download it from the network first. It is recommended that you use Curl to crawl HTML documents and load the DOM.
Find HTML elements
You can use the Find function to look up elements in an HTML document. The result returned is an array containing the objects. We use the HTML DOM to parse the functions in the class to access these objects, and here are a few examples:
Copy Code code as follows:

<?php

Find a hyperlink element in an HTML document
$a = $html->find (' a ');

Finds the first (N) hyperlink in the document and returns an empty array if it is not found.
$a = $html->find (' A ', 0);

Find a DIV element with ID main
$main = $html->find (' div[id=main] ', 0);

Find all DIV elements that contain an id attribute
$divs = $html->find (' div[id] ');

Find all elements that contain an id attribute
$divs = $html->find (' [id] ');
?>

You can also use a jquery-like selector to find the positioning element:
Copy Code code as follows:

<?php
Find elements of the id= ' #container '
$ret = $html->find (' #container ');

Find all elements of Class=foo
$ret = $html->find ('. Foo ');

Find multiple HTML tags
$ret = $html->find (' A, img ');

You can still use it.
$ret = $html->find (' a[title], img[title] ');
?>

Parser supports lookup of child elements
Copy Code code as follows:

<?php

Find all Li items in the UL list
$ret = $html->find (' ul Li ');

Find the UL list specify Class=selected's Li item
$ret = $html->find (' ul li.selected ');

?>

If you think this is a hassle, use built-in functions to easily locate the element's parent, child, and adjacent elements
Copy Code code as follows:

<?php
Return parent Element
$e->parent;

Returns an array of child elements
$e->children;

Returns the specified child element by index number
$e->children (0);

Returns the first resource speed
$e->first_child ();

Returns the last child element
$e->last _child ();

Returns the previous adjacent element
$e->prev_sibling ();

Returns the next adjacent element
$e->next_sibling ();
?>

Element Property Action
Use a simple regular expression to manipulate the property selector.
[attribute]– Select an HTML element that contains a property
[attribute=value]– selects all HTML elements of the specified value attribute
[attribute!=value]-selects all HTML elements that do not have a specified value attribute
[Attribute^=value]-selects all HTML elements of the specified value at the beginning of the property
[Attribute$=value] Selects all HTML elements of the attribute at the end of the specified value
[Attribute*=value]-selects all HTML elements that contain the specified value attribute
To invoke an element property in the parser
Element attributes are also objects in the DOM:
Copy Code code as follows:

<?php
In this case, the $a anchor connection value is assigned to the $link variable
$link = $a->href;
?>

Or:
Copy Code code as follows:

<?php
$link = $html->find (' A ', 0)->href;
?

Each object has 4 basic object properties:
tag– return HTML Label signature
innertext– return to innerHTML
outertext– return to outerHTML
plaintext– returns the text in an HTML tag
Editing elements in the parser
Editing the use of element attributes and calling them is similar:
Copy Code code as follows:

<?php
Assign new value to the anchor chain of $a
$a->href = ' http://www.jb51.net ';

Remove Anchor Connection
$a->href = null;

Detect the existence of cable connection
if (isset ($a->href)) {
Code
}
?>

There is no specific method in the parser to add or remove elements, but you can work around it:
Copy Code code as follows:

<?php
Encapsulating elements
$e->outertext = ' <div class= ' wrap ' > '. $e->outertext. ' <div> ';

Delete Element
$e->outertext = ';

adding elements
$e->outertext = $e->outertext. ' <div>foo<div> ';

inserting elements
$e->outertext = ' <div>foo<div> '. $e->outertext;
?

Saving the modified HTML DOM document is also very simple:
Copy Code code as follows:

<?php
$doc = $html;
Output
Echo $doc;
?>

How to prevent the parser from consuming too much memory
In the beginning of this article, I mentioned the problem that simple HTML DOM parser consumes too much memory. If the PHP script takes up too much memory, it can cause the Web site to stop responding to a series of serious problems. The solution is also simple, after the parser loads the HTML document and uses it, remember to clean out the object. Of course, don't take the problem too seriously. If only 2 or 3 of documents are loaded, cleaning or not cleaning is not much different. When you load 5 10 or more documents, you can clean up the memory after using one. ^_^
Copy Code code as follows:

<?php
$html->clear ();
?>

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.