Abstract: This article introduces PHP Crawl Web content technology, the use of php cURL extension to obtain Web content, you can also crawl the Web header, set cookies, processing 302 jump.
One, Curl installation
When installing PHP using the source code, you need to add the configuration item when configure
CD PHP
./configure--with-curl
After installation, you can use the php-m command to see if the cURL extension is already supported .
php-m | grep Curl
You can also use phpinfo to see if the cURL extension is already supported .
Second, access to Web content
CURLsupports many network protocols, such asHTTP,HTTPS,FTPand so on. Common Web page AdoptionHTTPprotocols, some security-high web pages useHTTPS(HTTPSthe protocol uses data encryption technology to exchange keys through public key technology and encrypt the contents of transmission. therefore adoptHTTPSThe Protocol's Web page, which transmits encrypted data over the entire link. For exampleBaiduAdoptHTTPSprotocol, the keywords you enter are encrypted by the network transport protocol, even if the operator can get all the data, and cannot get the content of the data. HTTPSthe protocol also has drawbacks, that is, the addition of decryption needs to spend computational time, soHTTPSthe site is slower, and most websites useHTTPprotocol). HTTPprotocol, two methods are definedGETand thePOST. POSTmethods are typically used for form submissions, to submit big data such as files. GETmethod is used to obtain web page data or to submit a small amount of data. This paper mainly introduces the useGETagreement to obtain web page data, in the future detailed explanationCURL POSTtechnology.
Let's look at how some browsers work, open the chrome Browser,F12 into developer mode, switch the toolbar to the network, for example, using the Chrome tool can view the transfer information for each file.
Browser to load a Web page, first download the HTML file, and then download js,css, pictures and other resource files for rendering loading. Usually the data crawl only needs to crawl the HTML file, which is the Chrome tool that displays The contents of the downloaded HTTP file.
Third, the implementation of PHP
<? PHP $ch = curl_init (); curl_setopt ($ch, Curlopt_url, "www.qq.com"); curl_setopt ( $chTRUE); $html = curl_exec ($ch); Curl_close ($ch); Var_dump ($html);? >
Basic settings to return Web page content.
Iv. get HTTP Header settings cookie
Some websites will use theCookiestechnology. When the acquisition program is not associated withCookies, it is easy to be identified as a "bot" by the website and refuses to service it. ThroughChromeDebugwww.sogou.comand FoundCookiesis included in the header information of the Web page. Therefore, we need two steps (1)HTTPget the header informationCookies(2) to add when sending a requestCookies.
The header information contains the set Cookie,
Refresh Web page, view header information, request to include cookie information
Get Cookies
<?PHP$url= "Www.sogou.com"; $ch=Curl_init (); curl_setopt ($ch, Curlopt_url,$url); curl_setopt ($ch, Curlopt_nobody,true); curl_setopt ($ch, Curlopt_headerfunction,function($ch,$str) Use(&$setcookie) { //The first parameter is the Curl resource, and the second parameter is the independent header! of each row List($name,$value) =Array_map(' Trim ',Explode(‘:‘,$str, 2)); $name=Strtolower($name); if(' set-cookie ' = =$name) { $setcookie[]=$value; } return strlen($str); }); Curl_exec ($ch); Curl_close ($ch); $cookie=Array(); foreach($setcookie as $c) { $tmp=Explode(";",$c); $cookie[] =$tmp[0]; } $cookiestr= "Cookie:".implode(";",$cookie); Echo $cookiestr;?>
return results
cookie:abtest=0|1433425917|v17;iploc=cn1100; Suid=3295cb6f1220920a00000000557057fd
Set cookies
<?PHP$url= "Www.sogou.com";$ch=curl_init (); curl_setopt ($ch, Curlopt_url,$url); curl_setopt ($ch, Curlopt_returntransfer,TRUE);$headers[] =$cookie; curl_setopt ($ci, Curlopt_httpheader,$headers);$html= Curl_exec ($ch); Curl_close ($ch);Var_dump($html);?>
Five, crawl 302 jump
In Baidu search keywords, the results of the return link is a Baidu encrypted link, through two jump is the real URL. (Baidu to prevent 360 crawl, the results are encrypted).
We can grab the location information in the head to find the real address,
<?PHP$url= "Https://www.baidu.com/link?url= B34apzbjz-cgloxsg4-nvihmtvs0tcvefts6apcasojt1a0h9offpprwk4jpnyggaqe29qputrdpueu3liz2m7gw7dqlmi5ytlhlova3v_ Vy23dooriusyv9zr_ci8rg&wd=&eqid=c89cf372000002cc0000000255705961&ie=utf-8 "; $ch=Curl_init (); curl_setopt ($ch, Curlopt_url,$url); curl_setopt ($ch, Curlopt_nobody,true); curl_setopt ($ch, Curlopt_headerfunction,function($ch,$str) Use(&$location) { //The first parameter is the Curl resource, and the second parameter is the independent header! of each row List($name,$value) =Array_map(' Trim ',Explode(‘:‘,$str, 2)); $name=Strtolower($name); if(' location ' = =$name) { $location=$value; return0; } return strlen($str); }); Curl_exec ($ch); Curl_close ($ch); Echo $location;?>
Crawl 302 jumps There is another way to take advantage of the OB redirection stream, and the settings allow curl to jump to the new address. The code is as follows
<?PHPfunctionGetcontents ($url){ $header=Array("referer:http://www.baidu.com/"); $ch=Curl_init (); curl_setopt ($ch, Curlopt_url,$url); curl_setopt ($ch, Curlopt_timeout, 30); curl_setopt ($ch, Curlopt_httpheader,$header); curl_setopt ($ch, curlopt_followlocation,1);//can not crawl the page after the jump Ob_start(); Curl_exec ($ch); $contents=ob_get_contents(); Ob_end_clean(); Curl_close ($ch); return $contents; } $url= "Https://www.baidu.com/link?url= B34apzbjz-cgloxsg4-nvihmtvs0tcvefts6apcasojt1a0h9offpprwk4jpnyggaqe29qputrdpueu3liz2m7gw7dqlmi5ytlhlova3v_ Vy23dooriusyv9zr_ci8rg&wd=&eqid=c89cf372000002cc0000000255705961&ie=utf-8 "; $contents= getcontents ($url); Echo $contents; ?>
PHP crawler Technology (i)