PHP Simple HTML DOM Parser

Last Update:2015-09-12 Source: Internet

Author: User

Tags php class

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Parsing HTML document trees using PHP has always been a challenge. Simple HTML DOM parser helps us to solve the problem of parsing with PHP html nicely. The PHP class can be used to parse the HTML document and manipulate the HTML elements (php5+ or later).

Parsers are more than just helping us validate HTML documents, but we can also parse non-compliant HTML documents. It uses a jquery-like element selector, id,class,tag the elements and so on to find the location, and also provides the ability to add, delete, and modify the document tree. Of course, such a powerful HTML DOM parser is not perfect, and you need to be very careful with memory consumption in the process. However, don't worry; In this article, I'll show you how to avoid consuming too much memory in the end.

How to use PHP HTML parsing

After uploading the class file, there are three ways to call this class:

Loading an HTML document from a URL

Loading an HTML document from a string

Loading an HTML document from a file

<? PHP
//Create a new DOM instance
$html = new simple_html_dom();
//Load from URL
$html->load_file(‘http://www.cnphp.info/php-simple-html-dom-parser-intro.html‘);
//Load from string
$HTML - > Load ('< HTML > < body > Load HTML document from string demonstration < / body > < HTML >');
//Load from file
$html->load_file(‘path/file/test.html‘);
> >

If you load an HTML document from a string, you need to download it from the network first. It is recommended that you use Curl to crawl an HTML document and load the DOM.

Finding HTML elements

You can use the Find function to find elements in an HTML document. The returned result is an array containing the object. We use the HTML DOM to parse the functions in the class to access these objects, and here are a few examples:

<? PHP
//Find hyperlink elements in HTML documents
$a = $html->find(‘a‘);
//Find the (n) hyperlink in the document and return an empty array if not found
$a = $html->find(‘a‘, 0);
//Find the div element with ID as main
$main = $html->find(‘div[id=main]‘,0);
//Find all div elements with ID attribute
$divs = $html->find(‘div[id]‘);
//Find all elements with ID attribute
$divs = $html->find(‘[id]‘);
> >

You can also use a jquery-like selector to find positioned elements:

<? PHP
//Find the element with id = '"container"
$ret = $html->find(‘#container‘);
//Find all elements of class = foo
$ret = $html->find(‘.foo‘);
//Find multiple HTML tags
$ret = $html->find(‘a, img‘);
//It can also be used in this way
$ret = $html->find(‘a[title], img[title]‘);
> >

Parser supports lookup of child elements

<? PHP
//Find all Li items in UL list
$ret = $html->find(‘ul li‘);
//Find the Li item of the UL list specified class = selected
$ret = $html->find(‘ul li.selected‘);
> >

If you find this troublesome, use built-in functions to easily locate the parent, child, and neighboring elements of an element

<? PHP
//Return parent element
$e->parent;
//Returns an array of child elements
$e->children;
//Returns the specified child element through index number
$e->children(0);
//Return to the first resource speed
$e->first_child ();
//Returns the last child element
$e->last _child ();
//Returns the previous adjacent element
$e->prev_sibling ();
//Returns the next adjacent element
$e->next_sibling ();
> >

Element Property Manipulation

Use a simple regular expression to manipulate the property selector.

[attribute]– Select an HTML element that contains a property

[attribute=value]– selects all HTML elements of the specified value property

[attribute!=value]-selects all HTML elements for all non-specified value properties

[Attribute^=value]-Select the HTML element for all specified values at the beginning of the property

[Attribute$=value] Selects all HTML elements that specify end-of-value properties

[Attribute*=value]-selects all HTML elements that contain the specified value attribute

Invoking element properties in the parser

Element attributes are also objects in the DOM:

<? PHP
//In this example, assign the anchor link value of $a to the $link variable
$link = $a->href;
> >

Or:

<?php$link=$html->find(‘a‘,0)->href;?>

Each object has 4 base object properties:

tag– return HTML Label signature

innertext– return innerHTML

outertext– return outerhtml

plaintext– returning text in an HTML tag

Editing elements in the parser

Editing the use of element properties is similar to calling them:

<? PHP
//Assign new value to anchor link of $a
$a->href = ‘http://www.cnphp.info‘;
//Delete anchor link
$a->href = null;
//Detect anchor links
if(isset($a->href)) {
/ / code
}
> >

There is no specific way to add or remove elements in the parser, but you can work around using:

<? PHP
//Package element
$e->outertext = ‘<div class="wrap">‘ . $e->outertext . ‘<div>‘;
//Delete element
$e->outertext = ‘‘;
//Add element
$e->outertext = $e->outertext . ‘<div>foo<div>‘;
//Insert element
$e->outertext = ‘<div>foo<div>‘ . $e->outertext;
> >

Saving the modified HTML DOM document is also very simple:

<? PHP
$doc = $html;
/ / output
Echo $doc;
> >

How to avoid the parser consuming too much memory

At the beginning of this article, I mentioned that the simple HTML DOM parser consumes too much memory. If the PHP script takes up too much memory, it can cause a series of serious problems such as the website stops responding. The workaround is also simple, after the parser loads the HTML document and uses it, remember to clean up the object. Of course, don't look at the problem too seriously. If you just load 2, 3 documents, cleaning or not cleaning up is not much of a difference. When you load 5 or more 10 or more documents, clean up the memory by using one and you're absolutely responsible for yourself. ^_^

<?php

$html->clear();

Original: http://www.cnphp.info/php-simple-html-dom-parser-intro.html

PHP Simple HTML DOM Parser

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More