PHP learns Curl's crawler instances

Source: Internet
Author: User
Tags learn php
Many times we need to crawl some of the site's resources, this time we need to use the crawler. The basis of the crawler is to simulate the HTTP request through Curl and then parse the data, this article by writing a simple web crawler to lead you to learn PHP curl.

Let's introduce some common functions.

Curl_init Initialize a Curl dialog curl_setopt set the curl parameter, which is the transfer option curl_exec perform the request Curl_close close a Curl dialog

It's mostly the top four.

Curl_errno returns the last error code, PHP has defined a number of error enumeration encodings Curl_errror Returns a string that protects the most recent error of the current session


Let's go directly to the example below, the explanations are in the comments


1. Download a webpage on the network and replace "Baidu" in the content with "cock wire" after output

<?php/** * Example Description: Download a webpage on the network and replace "Baidu" in the content with "cock silk" after output */$curlobj = Curl_init ();            Initialize curl_setopt ($curlobj, Curlopt_url, "http://www.baidu.com");        Set the Urlcurl_setopt ($curlobj, Curlopt_returntransfer, true) to access the Web page;           Do not print directly after execution $output=curl_exec ($curlobj);  Executive Curl_close ($curlobj);          Close Curlecho str_replace ("Baidu", "Dick Silk", $output);? >


2. Query the current weather in Beijing by calling WebService

<?php/** * Example Description: By calling WebService query the current weather in Beijing * * $data = ' thecityname= beijing '; $curlobj = Curl_init ();    curl_setopt ($curlobj, Curlopt_url, "http://www.webxml.com.cn/WebServices/WeatherWebService.asmx/ Getweatherbycityname ");  curl_setopt ($curlobj, Curlopt_header, 0); curl_setopt ($curlobj, Curlopt_returntransfer, 1);  curl_setopt ($curlobj, Curlopt_post, 1);  curl_setopt ($curlobj, Curlopt_postfields, $data);  curl_setopt ($curlobj, Curlopt_httpheader, Array ("application/x-www-form-urlencoded; Charset=utf-8 ",     " Content-length: ". strlen ($data))    ); $rtn = curl_exec ($curlobj);   if (!curl_errno ($curlobj)) {    //$info = Curl_getinfo ($curlobj);     Print_r ($info);    echo $rtn;  } else {  echo ' Curl error: '. Curl_error ($curlobj);} Curl_close ($curlobj);? >


3. Simulate the URLs that need to be logged in and crawl the contents of the Web

<?php/** * Example Description: Simulates the URL that needs to be logged in and crawls the contents of the webpage */$data =array (' username ' = ' promonkey ', ' password ' = ' 1q2w3e ', ' rememb            Er ' =>1); $data = ' username=zjzhoufy@126.com&password=1q2w3e&remember=1 '; $curlobj = Curl_init ();     Initialize curl_setopt ($curlobj, Curlopt_url, "Http://www.imooc.com/user/login");           Set the Urlcurl_setopt ($curlobj, Curlopt_returntransfer, true) to access the Web page; Do not print directly//cookie-related settings after execution, this part of the setup needs to set Date_default_timezone_set (' PRC ') before all sessions begin; When using cookies, you must first set the time zone curl_setopt ($curlobj, curlopt_cookiesession, TRUE); curl_setopt ($curlobj, Curlopt_header, 0); curl_setopt ($curlobj, curlopt_followlocation, 1);  This allows Curl to support page links to jump curl_setopt ($curlobj, Curlopt_post, 1);  curl_setopt ($curlobj, Curlopt_postfields, $data); curl_setopt ($curlobj, Curlopt_httpheader, Array ("application/x-www-form-urlencoded; Charset=utf-8 "," Content-length: ". strlen ($data)));   Curl_exec ($curlobj); Executive curl_setopt ($curlobj, Curlopt_url, "Http://www.imooc.com/space/index"); curl_seTopt ($curlobj, curlopt_post, 0); curl_setopt ($curlobj, Curlopt_httpheader, Array ("Content-type:text/xml"));  $output =curl_exec ($curlobj);          Executive Curl_close ($curlobj); Close Curlecho $output;? >


4. Login website Information crawl and download personal Space page + custom implementation page link jump crawl

<?php/** * Example Description: Login website information crawl and download personal Space page * Custom implementation page link Jump crawl * */$data = ' username=demo_peter@126.com&password=123qwe            &remember=1 '; $curlobj = Curl_init ();     Initialize curl_setopt ($curlobj, Curlopt_url, "Http://www.imooc.com/user/login");           Set the Urlcurl_setopt ($curlobj, Curlopt_returntransfer, true) to access the Web page; Do not print directly//cookie-related settings after execution, this part of the setup needs to set Date_default_timezone_set (' PRC ') before all sessions begin; When using cookies, you must first set the time zone curl_setopt ($curlobj, curlopt_cookiesession, TRUE); curl_setopt ($curlobj, Curlopt_header, 0);  Comment out this line, because this setting must turn off Safe mode and turn off Open_basedir, which is disadvantageous to server security//curl_setopt ($curlobj, curlopt_followlocation, 1);  curl_setopt ($curlobj, Curlopt_post, 1);  curl_setopt ($curlobj, Curlopt_postfields, $data); curl_setopt ($curlobj, Curlopt_httpheader, Array ("application/x-www-form-urlencoded; Charset=utf-8 "," Content-length: ". strlen ($data)));   Curl_exec ($curlobj);  Executive curl_setopt ($curlobj, Curlopt_url, "Http://www.imooc.com/space/index") curl_setopt ($curlobj, curlopt_post, 0); curl_sEtopt ($curlobj, Curlopt_httpheader, Array ("Content-type:text/xml"));  $output =curl_redir_exec ($curlobj);          Executive Curl_close ($curlobj);     Close Curlecho $output;/** * Custom Implementation page link jump crawl */function curl_redir_exec ($ch, $debug = "") {static $curl _loops = 0;     static $curl _max_loops = 20;         if ($curl _loops++ >= $curl _max_loops) {$curl _loops = 0;     return FALSE; } curl_setopt ($ch, Curlopt_header, true);     The header is enabled to crawl to the new URL redirected to curl_setopt ($ch, Curlopt_returntransfer, true);     $data = curl_exec ($ch);     The content returned by the split $h _len = Curl_getinfo ($ch, curlinfo_header_size);    $header = substr ($data, 0, $h _len);    $data = substr ($data, $h _len-1);     $http _code = Curl_getinfo ($ch, Curlinfo_http_code);         if ($http _code = = 301 | | $http _code = = 302) {$matches = array (); Preg_match ('/location: (. *?)         \n/', $header, $matches);         $url = @parse_url (Trim (Array_pop ($matches)));         Print_r ($url);        if (! $url) {     Couldn ' t process the URL to redirect to $curl _loops = 0;         return $data;         } $last _url = Parse_url (Curl_getinfo ($ch, Curlinfo_effective_url));         if (!isset ($url [' scheme ')) $url [' scheme '] = $last _url[' scheme '];         if (!isset ($url [' Host ']) $url [' host '] = $last _url[' host '];        if (!isset ($url [' path ')) $url [' path '] = $last _url[' path ']; $new _url = $url [' scheme ']. '://' . $url [' Host ']. $url [' Path ']. (Isset ($url [' query '])? "         $url [' query ']: ');         curl_setopt ($ch, Curlopt_url, $new _url);     Return curl_redir_exec ($ch);         } else {$curl _loops=0;     return $data; }}?>


Download a file from the FTP server to a local

<?php/** * Example Description: Upload the local file to the FTP server */$curlobj = Curl_init ();    $localfile = ' ftp01.php '; $fp = fopen ($localfile, ' R '); curl_setopt ($curlobj, Curlopt_url, "ftp://192.168.1.100/ftp01_ Uploaded.php ");  curl_setopt ($curlobj, Curlopt_header, 0); curl_setopt ($curlobj, Curlopt_returntransfer, 1);  curl_setopt ($curlobj, Curlopt_timeout, 300); Times out after 300scurl_setopt ($curlobj, Curlopt_userpwd, "peter.zhou:123456");//ftp user name: Password// Upload and download is mainly under the sub three parameters different curl_setopt ($curlobj, curlopt_upload, 1); curl_setopt ($curlobj, Curlopt_infile, $fp); curl_setopt ( $curlobj, Curlopt_infilesize, FileSize ($localfile)); $rtn = Curl_exec ($curlobj);  Fclose ($FP); if (!curl_errno ($curlobj)) {    echo "uploaded successfully.";  } else {  echo ' curl error: '. Curl_error ($ Curlobj);} Curl_close ($curlobj);? >


6. Download an HTTPS resource above the network

<?php/** * Example Description: Download an HTTPS resource above the network */$curlobj = Curl_init ();            Initialize curl_setopt ($curlobj, Curlopt_url, "https://ajax.aspnetcdn.com/ajax/jquery.validate/1.12.0/ Jquery.validate.js ");       Set the Urlcurl_setopt ($curlobj, Curlopt_returntransfer, true) to access the Web page;           Do not print directly after execution//set HTTPS support Date_default_timezone_set (' PRC '); When using cookies, you must first set the time zone curl_setopt ($curlobj, Curlopt_ssl_verifypeer, 0); Check the source of the authentication certificate check the SSL encryption algorithm exists from the certificate curl_setopt ($curlobj, Curlopt_ssl_verifyhost, 2); $output =curl_exec ($curlobj);  Executive Curl_close ($curlobj);          Close Curlecho $output;? >


Native PHP Impersonation HTTP request

Sometimes in order to simply simulate an HTTP request, so to use curl a little wasted, in fact, PHP itself has been able to implement this function,


Need to simulate post/get and other requests on the server side, that is, in the PHP program to implement the simulation, how to do it? Or, in a PHP program, give you an array, how do you post/get this array to another address? Of course, it's easy to use curl, so what if you don't use the Curl library? In fact, there are already related functions implemented in PHP, this function is the next stream_context_create ().


Direct show you the code, this is the best way:

$data = Array (    ' foo ' = ' bar ',     ' baz ' = ' boom ',     ' site ' = ' www.nowamagic.net ',     ' name ' = ' = ') Nowa Magic '); $data = Http_build_query ($data); $postdata = Http_build_query ($data); $options = Array ('    http ' = = Array (        ' method ' = ' + ' POST ',        ' Header ' = ' content-type:application/x-www-form-urlencoded ',        ' Content ' + $data        //' Timeout ' = 60 * 60//Timeout (unit: s))    ; $url = "http://www.nowamagic.net/test2.php"; $context = Stream_context_create ($options); $ result = File_get_contents ($url, False, $context); Echo $result;

The code for http://www.nowamagic.net/test2.php is:

$data = $_post;echo ' <pre> ';p rint_r ($data); Echo ' </pre> ';

The result of the operation is:

Array (    [foo] = bar    [Baz] = Boom    [site] = www.nowamagic.net    [name] = Nowa Magic)


Some key points to explain:


The above program uses the Http_build_query () function to construct the URL string.


Stream_context_create () is used to create an open file with the upper and lower file options, such as post access, using a proxy, sending headers and so on. is to create a stream, one more example:

$context = stream_context_create (' http ' = = Array (' method ' = ' +  ' POST ',         ' header '  = > sprintf ("Authorization:basic%s\r\n", Base64_encode ($username. ': '. $password)).         " content-type:application/x-www-form-urlencoded\r\n ",         ' content ' = http_build_query (Array (' status ' = $ Message)),         ' timeout ' = 5, ')     ; $ret = file_get_contents (' Http://twitter.com/statuses/update.xml ', false, $context);


The contextual items created by Stream_context_create are available for streaming (stream) or file system. It is more useful for functions such as file_get_contents, File_put_contents, ReadFile to use file name operations directly without the handle of the files. Stream_context_create added header header is only a part of the function, you can also define agents, timeouts and so on. This makes the ability to access the Web to be not weaker than curl.


Stream_context_create () function: Create and return a text stream and apply various options for the special process of fopen (), file_get_contents () and other procedures such as timeout settings, proxy servers, request methods, and header information settings.


Stream_context_create also resolves file_get_contents timeout processing by adding the timeout option:

$opts = Array ('    http ' =>array (    ' method ' = ' = ' GET ',    ' timeout ' =>60,  ));//Create a data flow context $context = Stream_context_create ($opts); $html =file_get_contents (' http://www.nowamagic.net ', false, $context);// fopen all remaining data at the output file pointer://fpassthru ($FP); Fclose () before use

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.