Why can't data be captured using curl or file_get_content. At last, this post was edited by xroha from 2014-12-1509: 49: 56. Why can't data be captured with curl or file_get_content. In baidu's experience, for example, why does jingyan. baidu. comarticle00a07f38441 fail to capture data with curl or file_get_content.
This post was last edited by xroha at 09:49:56
Why can't data be captured using curl or file_get_content.
Baidu experience, such as http://jingyan.baidu.com/article/00a07f38441c3782d028dc04.html,
Directly view the source code of the page. there is article data.
However, neither curl nor file_get_content can be used to obtain the document content.
Why? IP addresses and routes have been forged, but cannot be captured. What does Baidu use to prevent data capture?
The following code is used:
Function fcontents ($ url, $ timeout = 5, $ referer = ""){
$ Ch = curl_init ();
$ Header = array (
'User-Agent: Mozilla/5.0 (Windows NT 5.2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/6666', 'X-FORWARDED-FOR: 154.125.25.15 ', 'client-IP: 154.125.25.15'
);
Curl_setopt ($ ch, CURLOPT_URL, $ url );
Curl_setopt ($ ch, CURLOPT_TIMEOUT, $ timeout );
Curl_setopt ($ ch, CURLOPT_RETURNTRANSFER, 1 );
Curl_setopt ($ ch, CURLOPT_HTTPHEADER, $ header); // Construct the user IP address
Curl_setopt ($ ch, CURLOPT_REFERER, "http://www.baidu.com/"); // Construct a path
$ Result = curl_exec ($ ch );
Curl_close ($ ch );
Return $ result;
}
$ Html = fcontents ('http: // jingyan.baidu.com/article/00a07f38441c3782d028dc04.html ');
Echo $ html;
------ Solution ----------------------
Curl only captures the content of this page, but many other dynamic content on this page cannot be filled by crawling.
------ Solution ----------------------
Why is there no cookie. Add the cookie first.
$ Url = "http://jingyan.baidu.com/article/00a07f38441c3782d028dc04.html ";
$ Cookie_jar = dirname (_ FILE _). "/jy. cookie ";
/* Get cookie */
$ Ch = curl_init ();
Curl_setopt ($ ch, CURLOPT_URL, $ url );
Curl_setopt ($ ch, CURLOPT_RETURNTRANSFER, 1 );
Curl_setopt ($ ch, CURLOPT_COOKIEJAR, $ cookie_jar );
Curl_exec ($ ch );
Curl_close ($ ch );
Then, the request carries the cookie:
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_COOKIEFILE, $cookie_jar);
curl_setopt($ch, CURLOPT_HEADER, 0);
$res = curl_exec($ch);
curl_close($ch);
echo $res;