PHP crawler Technology (i)

Source: Internet
Author: User
Tags curl explode setcookie

Abstract: This article introduces PHP Crawl Web content technology, the use of php cURL extension to obtain Web content, you can also crawl the Web header, set cookies, processing 302 jump.

One, Curl installation

When installing PHP using the source code, you need to add the configuration item when configure

CD PHP

./configure--with-curl

After installation, you can use the php-m command to see if the cURL extension is already supported .

php-m | grep Curl

You can also use phpinfo to see if the cURL extension is already supported .

Second, access to Web content

CURLsupports many network protocols, such asHTTP,HTTPS,FTPand so on. Common Web page AdoptionHTTPprotocols, some security-high web pages useHTTPS(HTTPSthe protocol uses data encryption technology to exchange keys through public key technology and encrypt the contents of transmission. therefore adoptHTTPSThe Protocol's Web page, which transmits encrypted data over the entire link. For exampleBaiduAdoptHTTPSprotocol, the keywords you enter are encrypted by the network transport protocol, even if the operator can get all the data, and cannot get the content of the data. HTTPSthe protocol also has drawbacks, that is, the addition of decryption needs to spend computational time, soHTTPSthe site is slower, and most websites useHTTPprotocol). HTTPprotocol, two methods are definedGETand thePOST. POSTmethods are typically used for form submissions, to submit big data such as files. GETmethod is used to obtain web page data or to submit a small amount of data. This paper mainly introduces the useGETagreement to obtain web page data, in the future detailed explanationCURL POSTtechnology.

Let's look at how some browsers work, open the chrome Browser,F12 into developer mode, switch the toolbar to the network, for example, using the Chrome tool can view the transfer information for each file.

Browser to load a Web page, first download the HTML file, and then download js,css, pictures and other resource files for rendering loading. Usually the data crawl only needs to crawl the HTML file, which is the Chrome tool that displays The contents of the downloaded HTTP file.

Third, the implementation of PHP

<? PHP $ch = curl_init (); curl_setopt ($ch, Curlopt_url, "www.qq.com"); curl_setopt (  $chTRUE); $html = curl_exec ($ch); Curl_close ($ch); Var_dump ($html);? >

Basic settings to return Web page content.

Iv. get HTTP Header settings cookie

Some websites will use theCookiestechnology. When the acquisition program is not associated withCookies, it is easy to be identified as a "bot" by the website and refuses to service it. ThroughChromeDebugwww.sogou.comand FoundCookiesis included in the header information of the Web page. Therefore, we need two steps (1)HTTPget the header informationCookies(2) to add when sending a requestCookies.

The header information contains the set Cookie,

Refresh Web page, view header information, request to include cookie information

Get Cookies

<?PHP$url= "Www.sogou.com"; $ch=Curl_init (); curl_setopt ($ch, Curlopt_url,$url); curl_setopt ($ch, Curlopt_nobody,true); curl_setopt ($ch, Curlopt_headerfunction,function($ch,$str) Use(&$setcookie) {      //The first parameter is the Curl resource, and the second parameter is the independent header! of each row      List($name,$value) =Array_map(' Trim ',Explode(‘:‘,$str, 2)); $name=Strtolower($name); if(' set-cookie ' = =$name)      {        $setcookie[]=$value; }      return strlen($str);    }); Curl_exec ($ch); Curl_close ($ch); $cookie=Array(); foreach($setcookie  as $c)    {      $tmp=Explode(";",$c); $cookie[] =$tmp[0]; }    $cookiestr= "Cookie:".implode(";",$cookie); Echo $cookiestr;?>

return results

cookie:abtest=0|1433425917|v17;iploc=cn1100; Suid=3295cb6f1220920a00000000557057fd

Set cookies

<?PHP$url= "Www.sogou.com";$ch=curl_init (); curl_setopt ($ch, Curlopt_url,$url); curl_setopt ($ch, Curlopt_returntransfer,TRUE);$headers[] =$cookie; curl_setopt ($ci, Curlopt_httpheader,$headers);$html= Curl_exec ($ch); Curl_close ($ch);Var_dump($html);?>

Five, crawl 302 jump

In Baidu search keywords, the results of the return link is a Baidu encrypted link, through two jump is the real URL. (Baidu to prevent 360 crawl, the results are encrypted).

We can grab the location information in the head to find the real address,

<?PHP$url= "Https://www.baidu.com/link?url= B34apzbjz-cgloxsg4-nvihmtvs0tcvefts6apcasojt1a0h9offpprwk4jpnyggaqe29qputrdpueu3liz2m7gw7dqlmi5ytlhlova3v_ Vy23dooriusyv9zr_ci8rg&wd=&eqid=c89cf372000002cc0000000255705961&ie=utf-8 "; $ch=Curl_init (); curl_setopt ($ch, Curlopt_url,$url); curl_setopt ($ch, Curlopt_nobody,true); curl_setopt ($ch, Curlopt_headerfunction,function($ch,$str) Use(&$location) {      //The first parameter is the Curl resource, and the second parameter is the independent header! of each row        List($name,$value) =Array_map(' Trim ',Explode(‘:‘,$str, 2)); $name=Strtolower($name); if(' location ' = =$name)        {          $location=$value; return0; }        return strlen($str);    }); Curl_exec ($ch); Curl_close ($ch); Echo $location;?>

Crawl 302 jumps There is another way to take advantage of the OB redirection stream, and the settings allow curl to jump to the new address. The code is as follows

<?PHPfunctionGetcontents ($url){       $header=Array("referer:http://www.baidu.com/"); $ch=Curl_init (); curl_setopt ($ch, Curlopt_url,$url); curl_setopt ($ch, Curlopt_timeout, 30); curl_setopt ($ch, Curlopt_httpheader,$header); curl_setopt ($ch, curlopt_followlocation,1);//can not crawl the page after the jump      Ob_start(); Curl_exec ($ch); $contents=ob_get_contents(); Ob_end_clean(); Curl_close ($ch); return $contents; }       $url= "Https://www.baidu.com/link?url= B34apzbjz-cgloxsg4-nvihmtvs0tcvefts6apcasojt1a0h9offpprwk4jpnyggaqe29qputrdpueu3liz2m7gw7dqlmi5ytlhlova3v_ Vy23dooriusyv9zr_ci8rg&amp;wd=&amp;eqid=c89cf372000002cc0000000255705961&amp;ie=utf-8 "; $contents= getcontents ($url); Echo $contents; ?>

PHP crawler Technology (i)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.