PHP Parse HTML class library simple

Source: Internet
Author: User
Tags html tags php class php script regular expression
Using PHP to parse an HTML document tree has always been a challenge. Simple HTML DOM parser helps us solve this problem well. You can use this PHP class to parse HTML documents and manipulate HTML elements in them (php5+ version above)

Download Address:https://github.com/samacs/simple_html_dom

The parser is more than just helping us validate HTML documents, but also parsing HTML documents that do not conform to the standards of the consortium. It uses a jquery-like element selector to find positioning through the id,class,tag of elements, and also provides the ability to add, delete, and modify the document tree. Of course, such a powerful HTML DOM parser is not perfect, and you need to be very careful about memory consumption in the process of using it. But don't worry; in this article, the author in the end will introduce you how to avoid consuming too much memory.
Start using
after uploading the class file, there are three ways to call this class:
to load an HTML document from a URL
Loading an HTML document from a string
Loading an HTML document from a file

Copy Code code as follows:


<?php


//Create a new DOM instance


$html = new Simple_html_dom ();





//Load from URL


$html->load_file (' http://www.jb51.net ');





//Load from string


$html->load (' <html><body> loading HTML document demo </body></html> ' from string);





//Load from file


$html->load_file (' path/file/test.html ');


?>


If you load an HTML document from a string, you need to download it from the network first. It is recommended that you use Curl to crawl HTML documents and load the DOM.
Find HTML elements
You can use the Find function to look up elements in an HTML document. The result returned is an array containing the objects. We use the HTML DOM to parse the functions in the class to access these objects, and here are a few examples:

Copy Code code as follows:


<?php





//Find hyperlink elements in an HTML document


$a = $html->find (' a ');





//Find the (N) hyperlink in the document and return an empty array if it is not found.


$a = $html->find (' A ', 0);





//Find DIV element with ID main


$main = $html->find (' div[id=main] ', 0);





//Find all DIV elements that contain ID attributes


$divs = $html->find (' div[id] ');





//Find all elements that contain an id attribute


$divs = $html->find (' [id] ');


?>


You can also use a jquery-like selector to find the positioning element:

Copy Code code as follows:


<?php


//Find id= ' #container ' elements


$ret = $html->find (' #container ');





//Find all Class=foo elements


$ret = $html->find ('. Foo ');





//Find multiple HTML tags


$ret = $html->find (' A, img ');





//can also use


$ret = $html->find (' a[title], img[title] ');


?>


Parser supports lookup of child elements

Copy Code code as follows:


<?php





//Find all the Li items in the UL list


$ret = $html->find (' ul Li ');





//Find the UL list specify class=selected li item


$ret = $html->find (' ul li.selected ');





?>


If you think this is a hassle, use built-in functions to easily locate the element's parent, child, and adjacent elements

Copy Code code as follows:


<?php


//Returns the parent element


$e->parent;





//Returns an array of child elements


$e->children;





//Returns the specified child element through the index number


$e->children (0);





//Return to the first resource speed


$e->first_child ();





//Returns the last child element


$e->last _child ();





//Returns the previous adjacent element


$e->prev_sibling ();





//Returns the next adjacent element


$e->next_sibling ();


?>


Element Property Action
use a simple regular expression to manipulate the property selector.
[attribute]– Select an HTML element that contains a property
[attribute=value]– selects all HTML elements of the specified value attribute
[attribute!=value]-selects all HTML elements that do not have a specified value attribute
[Attribute^=value]-selects all HTML elements of the specified value at the beginning of the property
[Attribute$=value] Selects all HTML elements of the attribute at the end of the specified value
[Attribute*=value]-selects all HTML elements that contain the specified value attribute
To invoke an element property in the parser
Element attributes are also objects in the DOM:

Copy Code code as follows:


<?php


///This example assigns the $a anchor connection value to the $link variable


$link = $a->href;


?>


Or:

Copy Code code as follows:


<?php


$link = $html->find (' A ', 0)->href;


?


Each object has 4 basic object properties:
tag– return HTML Label signature
innertext– return to innerHTML
outertext– return to outerHTML
plaintext– returns the text in an HTML tag
Editing elements in the parser
Editing the use of element attributes and calling them is similar:

Copy Code code as follows:


<?php


//To $a the anchor chain to assign a new value


$a->href = ' http://www.jb51.net ';





//Remove anchor cable to connect


$a->href = null;





//detection of the existence of anchor chain connection


if (isset ($a->href)) {


//Code


}


?>


There is no specific method in the parser to add or remove elements, but you can work around it:

Copy Code code as follows:


<?php


//Encapsulation element


$e->outertext = ' <div class= ' wrap ' > '. $e->outertext. ' <div> ';





//Delete element


$e->outertext = ';





//Add element


$e->outertext = $e->outertext. ' <div>foo<div> ';





//Insert Element


$e->outertext = ' <div>foo<div> '. $e->outertext;


?


Saving the modified HTML DOM document is also very simple:

Copy Code code as follows:


<?php


$doc = $html;


//Output


Echo $doc;


?>


How to prevent the parser from consuming too much memory
in the beginning of this article, I mentioned the problem that simple HTML DOM parser consumes too much memory. If the PHP script takes up too much memory, it can cause the Web site to stop responding to a series of serious problems. The solution is also simple, after the parser loads the HTML document and uses it, remember to clean out the object. Of course, don't take the problem too seriously. If only 2 or 3 of documents are loaded, cleaning or not cleaning is not much different. When you load 5 10 or more documents, you can clean up the memory after using one. ^_^

Copy Code code as follows:


<?php
$html->clear ();
?>

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.