Acquisition is a lot of companies do the most thing, can quickly get others hard-earned data, although not moral, but can not be banned!
Common ways to collect PHP are:
Regular acquisition.
Use DOM object acquisition.
Extracted using a string function.
Here are just a few questions that DOM objects collect:
PHP has DOM objects specifically designed to handle HTML or XML files, which is very handy.
$dom = new DOMDocument (' 1.0 ', ' GBK ');//Create DOM Object @ $dom->loadhtmlfile ($url);//load corresponding URL address HTML content $xpath=new Domxpath ($ DOM);//Create Domxpath Object
The Domxpath object is an XPath path expression that supports http://www.w3school.com.cn/xpath/
XPath path expressions, like the jquery selector, can easily find the corresponding node and extract the content, although XPath is much more selective than jquery. It also supports a variety of function processing.
Note :
@ $dom->loadhtmlfile ($url);//load the corresponding URL address HTML content
This code is best preceded by the @ symbol, because when loading parsing HTML content, there will be more or less errors, such as: in the HTML page some & symbols to escape to & , the HTML entity symbol must be, end, etc. to resolve smoothly. This type of requirement is not possible in the acquisition process.
The most troublesome in the acquisition process is the Chinese character processing, the use of regular acquisition, the multibyte in the regular although convenient but error-prone, in the regular written in English must ensure that the character set of the collection and the current system code character set is the same, otherwise the match is easy to fail.
Use DOMDocument parsing HTML content, also prone to Chinese problems, is usually garbled, the main reason is that the HTML structure is not standard,
Garbled is a character set problem, generally in the HTML head tag to specify the character set
<meta http-equiv=content-type content= "text/html;charset=gb2312" >
When this tag does not exist, DOMDocument will parse the HTML content in the default way, resulting in a Chinese encoding error.
So do not use it directly when using DOMDocument collection
@ $dom->loadhtmlfile ($url);//load the corresponding URL address HTML content
method to load HTML directly, so that if the collected HTML content does not contain the specified character set, the entire HTML content will not be used after parsing.
Best to use:
$ch = Curl_init ($url);//Create Connection curl_setopt ($ch, Curlopt_returntransfer, TRUE);//output Content curl_setopt ($ch, Curlopt_timeout, 10);//Set timeout time $html = curl_exec ($ch);//perform connection, get content if ($err =curl_error ($ch)) {//Determine if error die ($ERR);} else{//Determine if this label exists if (!stripos (' "Content-type" ', $html) &&!stripos (' content= ' text/html; ', $html)) {$meta = '
Corresponding to the collected page character set, must be checked well.
Domxpath has two functions for manipulating internal nodes:
Query, and evaluate
Query: Removes the given XPath expression node list and returns False if the expression is valid and returns the Domnodelist object.
Evaluate: Removes the given XPath expression node list, returning False if the expression is valid and a matching node returns the Domnodelist object.
Two functions are fetch nodes, but there are some differences in the return value.
The Domnodelist object has a function and a property:
/* property, Number of node lists */
ReadOnly Publicint $length;
/* Method Gets the number of nodes */
Domnode domnodelist::item (int $index)
Use the item function to get the node domelement Object ,
You can use GetAttribute to get node property values, or you can use the NodeValue property to get node content.
This article is from the "gangbusters" blog, make sure to keep this source http://php2012web.blog.51cto.com/5585213/1619260
PHP uses DOMDocument capture