PHP uses DOMDocument capture

Source: Internet
Author: User
Tags xpath

Acquisition is a lot of companies do the most thing, can quickly get others hard-earned data, although not moral, but can not be banned!


Common ways to collect PHP are:

    1. Regular acquisition.

    2. Use DOM object acquisition.

    3. Extracted using a string function.


Here are just a few questions that DOM objects collect:


PHP has DOM objects specifically designed to handle HTML or XML files, which is very handy.

$dom = new DOMDocument (' 1.0 ', ' GBK ');//Create DOM Object @ $dom->loadhtmlfile ($url);//load corresponding URL address HTML content $xpath=new Domxpath ($ DOM);//Create Domxpath Object

The Domxpath object is an XPath path expression that supports http://www.w3school.com.cn/xpath/

XPath path expressions, like the jquery selector, can easily find the corresponding node and extract the content, although XPath is much more selective than jquery. It also supports a variety of function processing.


Note :

@ $dom->loadhtmlfile ($url);//load the corresponding URL address HTML content

This code is best preceded by the @ symbol, because when loading parsing HTML content, there will be more or less errors, such as: in the HTML page some & symbols to escape to & , the HTML entity symbol must be, end, etc. to resolve smoothly. This type of requirement is not possible in the acquisition process.


The most troublesome in the acquisition process is the Chinese character processing, the use of regular acquisition, the multibyte in the regular although convenient but error-prone, in the regular written in English must ensure that the character set of the collection and the current system code character set is the same, otherwise the match is easy to fail.

Use DOMDocument parsing HTML content, also prone to Chinese problems, is usually garbled, the main reason is that the HTML structure is not standard,

Garbled is a character set problem, generally in the HTML head tag to specify the character set

<meta http-equiv=content-type content= "text/html;charset=gb2312" >

When this tag does not exist, DOMDocument will parse the HTML content in the default way, resulting in a Chinese encoding error.


So do not use it directly when using DOMDocument collection

@ $dom->loadhtmlfile ($url);//load the corresponding URL address HTML content

method to load HTML directly, so that if the collected HTML content does not contain the specified character set, the entire HTML content will not be used after parsing.

Best to use:

$ch = Curl_init ($url);//Create Connection curl_setopt ($ch, Curlopt_returntransfer, TRUE);//output Content curl_setopt ($ch, Curlopt_timeout, 10);//Set timeout time $html = curl_exec ($ch);//perform connection, get content if ($err =curl_error ($ch)) {//Determine if error die ($ERR);} else{//Determine if this label exists if (!stripos (' "Content-type" ', $html) &&!stripos (' content= ' text/html; ', $html)) {$meta = ' 

Corresponding to the collected page character set, must be checked well.

Domxpath has two functions for manipulating internal nodes:

Query, and evaluate

Query: Removes the given XPath expression node list and returns False if the expression is valid and returns the Domnodelist object.

Evaluate: Removes the given XPath expression node list, returning False if the expression is valid and a matching node returns the Domnodelist object.


Two functions are fetch nodes, but there are some differences in the return value.

The Domnodelist object has a function and a property:

/* property, Number of node lists */
ReadOnly Publicint $length;
/* Method Gets the number of nodes */
Domnode domnodelist::item (int $index)


Use the item function to get the node domelement Object ,

You can use GetAttribute to get node property values, or you can use the NodeValue property to get node content.

This article is from the "gangbusters" blog, make sure to keep this source http://php2012web.blog.51cto.com/5585213/1619260

PHP uses DOMDocument capture

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.