Introduction and promotion of PHP Curl Crawl Web page and use Curl to crawl Taobao page Integration Method _php Example

Source: Internet
Author: User
Tags curl http cookie http post php class

PHP curl can be used to crawl Web pages, analysis of Web data use, simple and easy-to-use, here to introduce its functions, such as not detailed description, put the code to see:

Only a few of the main functions are retained. To implement a mock login, which may involve session capture, then the front and back pages involve parameter provision form.

Libcurl main function is to use different protocols to connect and communicate with different servers ~ which is quite encapsulated sock

PHP supports Libcurl (allows you to connect and communicate with different servers in different protocols). , Libcurl currently supports HTTP, HTTPS, FTP, Gopher, Telnet, dict, file, and LDAP protocols. Libcurl also supports HTTPS certificate authorization, HTTP POST, HTTP put, FTP upload (of course you can also use PHP FTP extensions), HTTP basic form uploads, proxies, cookies, and user authentication.

In order to use the Curl function you need to install the Curl package. PHP requires you to use a curl 7.0.2-beta or higher version. If the curl version is lower than the 7.0.2-beta,php will not work.

To use PHP's curl support, you must recompile PHP with the--with-curl[=dir parameter (DIR is the directory that contains the library and header files).

These functions are added to the PHP 4.0.2.

Once you have compiled PHP with Curl support, you can use the Curl function. The basic idea is that you use the Curl_init () function to initialize the curl session, and then you can set all your options, run it through the curl_exec () function, and finally you can act on the Curl_close () function to end your session. Here is an example: the PHP home page is retrieved and placed in a file.

Example 1. Use PHP's Curl module to retrieve the PHP home page

$ch = Curl_init ("http://www.php.net/");
$fp = fopen ("Php_homepage.txt", "w");
curl_setopt ($ch, Curlopt_file, $fp);
curl_setopt ($ch, Curlopt_header, 0);
Curl_exec ($ch);
Curl_close ($ch);
Fclose ($FP);
?>

Directory listings

curl_init-initialization of a curl session

curl_setopt-set an option for the curl call

curl_exec-executes a Curl session

Curl_close-closes a Curl session

curl_version-returns the current Curl version

* Installation of Curl extensions

PHP has a built-in php_curl.dll, in the Ext directory, this DLL is used to support SSL and zlib.

Find a Extension=php_curl.dll in the php.ini and remove the previous comment.

Set extension_dir= your PHP ext directory (e.g. C:/php/ext)

The ext directory under the Libeay32.dll, Ssleay32.dll, Php5ts.dll, Php_curl.dll are copied to the System32 directory, restart Apache can be.

Curl_init

Curl_init--Initializes a curl session

Describe

int curl_init ([string URL])

The Curl_init () function Initializes a new session and returns a curl handle for use by curl_setopt (), curl_exec (), and the Curl_close () function. If an optional parameter is provided, the CURLOPT_URL option is set to the value of this parameter. You can use the curl_setopt () function to manually set.

 Example 1. Initializes a new Curl session and retrieves a Web page

$ch = Curl_init ();
curl_setopt ($ch, Curlopt_url, "http://www.zend.com/");
curl_setopt ($ch, Curlopt_header, 0);
Curl_exec ($ch);
Curl_close ($ch);
?>

See also: Curl_close (), curl_setopt ()

* curl_setopt

curl_setopt--Sets an option for the curl call

Describe

BOOL curl_setopt (INT-ch, string option, mixed value)

The curl_setopt () function sets the option for a curl session. The option parameter is the setting you want, and value is the values given by this choice.

  The values of the following options will be used as long reshaping (specified in the option argument):

Curlopt_infilesize: When you upload a file to a remote site, this option tells PHP the size of the file you uploaded.

Curlopt_verbose: If you want to curl report every unexpected thing, set this option to a value other than 0.

Curlopt_header: If you want to include a header in the output, set this option to a value other than 0.

Curlopt_noprogress: If you don't have PHP to display a process bar for curl transmissions, set this option to a non 0 value.

Note: PHP automatically sets this option to a value other than 0, and you should only change this option for debugging purposes.

Curlopt_nobody: If you don't want to include the body part in the output, set this option to a value other than 0.

Curlopt_failonerror: If you want PHP to be in error (HTTP code returns greater than or equal to 300), do not display, set this option to a person not 0 value. The default behavior is to return a normal page, ignoring the code.

Curlopt_upload: If you want PHP to be ready for upload, set this option to a value other than 0.

Curlopt_post: If you want PHP to do a regular HTTP POST, set this option to a non 0 value. This post is an ordinary application/x-www-from-urlencoded type, most of which is used by HTML forms.

Curlopt_ftplistonly: Set this option to a value other than 0, PHP will list the list of directory names for FTP.

Curlopt_ftpappend: Set this option to a value other than 0, PHP will apply the remote file instead of overwriting it.

CURLOPT_NETRC: Set this option to a value other than 0, PHP will find the username and password of the remote site you want to connect to in your ~./netrc file.

Curlopt_followlocation: Set this option to a non 0 value (like "Location:"), the server will send it as a part of the HTTP header (note that this is recursive, PHP will send the shape as "Location:" the head).

Curlopt_put: Set this option for a non 0 value to upload a file with HTTP. To upload this file, you must set the Curlopt_infile and Curlopt_infilesize options.

Curlopt_mute: Set this option to a value other than 0, PHP will be completely silent for the curl function.

Curlopt_timeout: Sets a long shaping number, as the maximum continuation of how many seconds.

Curlopt_low_speed_limit: Sets a long shaping number to control how many bytes are transferred.

Curlopt_low_speed_time: Set a long shaping number, control how many seconds to transfer curlopt_low_speed_limit The specified number of bytes.

Curlopt_resume_from: Passes a long shaping parameter that contains the byte offset address (the start form you want to transfer to).

Curlopt_sslversion: Passes a long parameter containing the SSL version. The default PHP will be determined by its own efforts, and in more security you must set it manually.

Curlopt_timecondition: Passes a long parameter specifying how to handle the Curlopt_timevalue parameter. You can set this parameter to Timecond_ifmodsince or timecond_isunmodsince. This is for HTTP only.

Curlopt_timevalue: Passes a number of seconds from 1970-1-1 to the present. This time is used by the Curlopt_timevalue option as the specified value or by default timecond_ifmodsince.

  The values of the following options will be used as strings:

Curlopt_url: This is the URL you want to retrieve with PHP. You can also set this option when initializing with the Curl_init () function.

Curlopt_userpwd: Pass a string in a form like [Username]:[password] style, the function of PHP to connect.

Curlopt_proxyuserpwd: Passes a string in a form like [Username]:[password] to connect to the HTTP proxy.

Curlopt_range: Pass a range that you want to specify. It should be "XY" format, X or Y is excepted. HTTP transmissions also support several intervals, separated by a x-y,n-m.

Curlopt_postfields: Passes a string of all data that is an HTTP "POST" operation.

Curlopt_referer: Contains a string of "REFERER" headers in the HTTP request.

Curlopt_useragent: Contains a string of "user-agent" headers in the HTTP request.

Curlopt_ftpport: Passes an IP address that contains the FTP "POST" instruction. This post instruction tells the remote server to connect to the IP address we specified. This string can be an IP address, a host name, a network interface name (under Unix), or '-' (using the system default IP address).

Curlopt_cookie: Passes a header connection that contains an HTTP COOKIE.

Curlopt_sslcert: Passes a string containing the PEM format certificate.

CURLOPT_SSLCERTPASSWD: Passes a password that is required to use the Curlopt_sslcert certificate.

Curlopt_cookiefile: A string that passes the name of a file that contains cookie data. This cookie file can be in Netscape format, or in the HTTP style header that is stockpiled in the file.

Curlopt_customrequest: When making an HTTP request, passing a character is used by get or head. It is useful for delete or other operations, more pass a string of used instead of a and head when doing a HTTP request. This is useful to doing or another, more obscure, HTTP request.

  Note: Do not do this before confirming your server support command.

The following options require a file description (obtained by using the fopen () function):

Curlopt_file: This file will be the output file where you put the transfer, and the default is stdout.

Curlopt_infile: This file is the input file you sent over.

Curlopt_writeheader: This file is written with the head part of your output.

Curlopt_stderr: This file is written with errors rather than STDERR.

* Curl_exec

Curl_exec--performing a curl session

Describe

BOOL Curl_exec (int ch)

After you initialize a curl session and set all the options for this session, this function will be invoked. It is intended only to perform a predetermined curl session (with the given CH parameter).

* Curl_close

Curl_close--Close a Curl session

Describe

void curl_close (int ch)

This function closes a curl session and frees all resources. The curl handle (ch parameter) is also deleted.

* Curl_version

Curl_version--Returns the current Curl version

Describe

String curl_version (void)

The Curl_version () function returns a string containing the curl version.

<?php class multihttprequest{Public $urls = Array (); 
  Public $curlopt _header = 0; 
  Public $cookie _file = '; 
  Public $collect _save_file = '; 
  Public $start _timestamp = '; 
  Public $end _timestamp = '; 
  Private $log _handle = '; 
  Private $collect _save_handle = '; 
  Private $db _conn = false; Private $pre _break_goods_id = '; ID of the last forced exit private $per _break_brand_id = ';    Last updated to brand_id private $main _log_id = '; 
  This updated primary table log id private $start _time = '; 
  Public $login _session = '; 
  Public $date _char = '; 
  Private $mode = '; 
  Private $sql _log_handle = ';   
     function __construct ($upgrade _date= ', $force _upt=false) {$this->mysql_init (); 
    Private Function Mysql_init () {$db _name = ' dbname '; 
    $db _user = ' name '; 
    $db _pass = ' pass '; 
    $db _host = ' localhost '; 
    $db _conn = mysql_connect ($db _host, $db _user, $db _pass); 
      if (! $db _conn) {echo ' Database connection failed! '; 
    Exit } $this->db_conn = $db _conn; 
  mysql_select_db ($db _name); 
    The Public Function Init_login () {//The first step to simulate landing $target _url = ' http://www.test.com/login.jsp '; 
    Post submitted data $post _fields = Array (' username ' => ' Cho Yashu Taobao ', ' Password ' => ' joarshow.taobao.com '), 
    ' T_url ' => ', ' Submit2 ' => ' login '); Save the login cookie $cookie _file = dirname (__file__). ' /cookie_ '. Time (). 
    TXT '; 
    $this->cookie_file = $cookie _file; 
    Save Cookie $ch = Curl_init ($target _url); 
    curl_setopt ($ch, Curlopt_header, 1); 
    curl_setopt ($ch, curlopt_cookiesession, 1); 
    curl_setopt ($ch, Curlopt_returntransfer, 1); 
    curl_setopt ($ch, Curlopt_post, 1); 
    curl_setopt ($ch, Curlopt_postfields, $post _fields); 
    curl_setopt ($ch, Curlopt_cookiejar, $cookie _file); 
    $login _contents = curl_exec ($ch); 
  Curl_close ($ch); /** * Test * * @param unknown_type $test _url/Public Function get_one_file ($test _url) {$ch= Curl_init ($test _url); 
    curl_setopt ($ch, Curlopt_header, 0); 
    curl_setopt ($ch, Curlopt_returntransfer, 1); 
    curl_setopt ($ch, Curlopt_cookie, $this->login_session); 
    curl_setopt ($ch, Curlopt_referer, ' http://www.test.com/welcome.shtml '); curl_setopt ($ch, Curlopt_useragent, ' mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; 
    SV1. NET CLR 1.1.4322;. NET CLR 2.0.50727;. NET CLR 3.0.4506.2152;. NET CLR 3.5.30729); 
    $contents = curl_exec ($ch); 
    Curl_close ($ch); 
  return $contents; 
    The Public Function Point_url_brand ($url) {$ch = Curl_init ($url); 
    curl_setopt ($ch, Curlopt_header, 0); 
    curl_setopt ($ch, Curlopt_returntransfer, 1); 
    curl_setopt ($ch, Curlopt_cookie, $this->login_session); 
    curl_setopt ($ch, Curlopt_referer, ' http://www.test.com/product.shtml '); curl_setopt ($ch, Curlopt_useragent, ' mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; SV1. NET CLR 1.1.4322;. NET CLR 2.0.50727;. NET CLR 3.0.4506.2152;. NET CLR 3.5.30729) '; 
    $contents = curl_exec ($ch); 
    echo Htmlspecialchars ($contents); exit; 
    Curl_close ($ch); 
  return $contents;  } 
}

Using Curl to crawl Taobao page integration method

The code is as follows:

 /** * Base Site crawl Taobao page HTML code * @param type $url Address * @return Boolean/Public Function gettaobaohtml ($url) {
    if (empty ($url)) {return false;
    } $ch = Curl_init ();
    Set URL curl_setopt ($ch, Curlopt_url, $url); Sets the browser's specific header curl_setopt ($ch, Curlopt_httpheader, Array ("User-agent: {mozilla/5.0") (Windows NT 6.1; WOW64; rv:26.0) gecko/20100101 firefox/26.0} "," Accept: {text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8} "," Accept-language: {zh-cn,zh;q=0.8,en-us;q=0.5,en;q=0.3} "," COOKIE:{CQ=CCP%3D1; Cna=a7suczomstecaxgg9icf4atx; t=671b2069c7e8ac444da66d664a397a5f;
 tracknick=%5cu4f0d%5cu6653%5cu8f8901; _tb_token_=ndiu1vcuzfd0; 
COOKIE2=C54709FFBE04A5CCB80283C34D6B00FA;
pnm_cku822=128wsmpac%2ffs4kgnn%2byfhzduo4u2nc0zh9cas4%3d% 7cwucljkhqr873boifqcmecsw%3d%7cwmekrlv% 2b3d9a6xwaidnwnqoswxwaxugvqhzhxalh%7cx0 Ylbx78nur2b2dhoxniqzenqqr35tbzbfq5vooi0b6ghza3u1kr%7cxkdilog Cr878ZK9I% 2b%2fe3qjad3lfjjaazra%3d%3d%7cxuemwmr2s% 2btuqk8ipp5tngwfujqwonccmcxihta0frygtjgfa4j6%7cxmy k7f8liovh3hmupzxkiau%2fjw%3d%3d} ",));
    Page content We do not need curl_setopt ($ch, curlopt_nobody, 0);
    Simply return the HTTP header curl_setopt ($ch, Curlopt_header, 0);
    Returns the result instead of outputting it//curl_setopt ($ch, Curlopt_returntransfer, 1);
    curl_setopt ($ch, curlopt_followlocation, 1);
    Ob_start ();
    Curl_exec ($ch);
    $html = Ob_get_contents ();
    Ob_end_clean ();
    Curl_close ($ch);
  return $html; }

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.