First, write a simple crawl page function
Copy Code code as follows:
<?php
function Getsources ($URL, $User _agent= ', $Referer _url= ')//crawl a specified page
{
$URL page address that needs to be crawled
$User _agent need to return user_agent information such as "Baiduspider" or "Googlebot"
$ch = Curl_init ();
curl_setopt ($ch, Curlopt_url, $URL);
curl_setopt ($ch, curlopt_useragent, $User _agent);
curl_setopt ($ch, Curlopt_referer, $Referer _url);
curl_setopt ($ch, curlopt_followlocation,1);
curl_setopt ($ch, Curlopt_returntransfer, 1);
$MySources = curl_exec ($ch);
Curl_close ($ch);
return $MySources;
}
$URL = "Http://www.jb51.net"; There's nothing to get the content.
$User _agent = "baiduspider+ (+http://www.baidu.com/search/spider.htm)";
$Referer _url = ' http://www.jb51.net/';
Echo getsources ($URL, $User _agent, $Referer _url);
?>
Curl function library in PHP (Client URL library function)
Curl_close-closes a curl session;
curl_copy_handle-copies all the contents and parameters of a curl connection resource;
curl_errno-returns a numeric number containing the current session error message;
curl_error-returns a string containing the current session error message;
curl_exec-executes a curl session;
Curl_getinfo-gets the information of a Curl connection resource handle;
curl_init-initialization of a curl session;
Curl_multi_add_handle-adds a separate curl handle resource to the Curl batch session;
Curl_multi_close-closes a batch handle resource;
curl_multi_exec-resolves a curl batch handle;
curl_multi_getcontent-returns the text stream of the obtained output;
Curl_multi_info_read-obtain the relevant transmission information of the current parsed curl;
curl_multi_init-Initializes a curl batch handle resource;
curl_multi_remove_handle-removes a handle resource from the Curl batch handle resource;
Curl_multi_select-get the sockets associated with the curl extension, which can then to be "selected";
curl_setopt_array-sets the session parameters as an array for a curl;
Curl_setopt-set session parameters for a curl;
curl_version-Obtain the Curl related version information;
The Curl_init () function Initializes a curl session, and the only argument to the Curl_init () function is optional, representing a URL address;
The Curl_exec () function is to perform a curl session with the only argument being the handle returned by the Curl_init () function;
The role of the Curl_close () function is to close a curl session with the only argument being the handle returned by the Curl_init () function;
PHP code
Copy Code code as follows:
<?php
$ch = Curl_init ("http://blog.huangchao.org/");
Curl_exec ($ch);
Curl_close ($ch);
?>
The function of Curl_version () is to obtain curl-related version information, and the Curl_version () function has a parameter, which is not clear what to do;
PHP code
<?php
Print_r (Curl_version ())
?>
The Curl_getinfo () function is to get information about a Curl connection resource handle, the Curl_getinfo () function has two parameters, the first argument is the curl resource handle, and the second parameter is the following constants:
PHP code
Copy Code code as follows:
<?php
$ch = Curl_init ("http://blog.huangchao.org/");
Print_r (Curl_getinfo ($ch));
?>
The optional constants include:
Curlinfo_effective_url: The last valid URL address;
Curlinfo_http_code: The last HTTP code received;
Curlinfo_filetime: The time when the document was fetched remotely, and the return value is "1" if it cannot be obtained;
Curlinfo_total_time: The time consumed by the last transmission;
Curlinfo_namelookup_time: The time consumed by name resolution;
Curlinfo_connect_time: The time consumed by establishing the connection;
Curlinfo_pretransfer_time: The time used from the connection to the preparation of the transmission;
Curlinfo_starttransfer_time: The time used to start the connection to the transmission;
Curlinfo_redirect_time: The time used to redirect the transaction before it starts;
Curlinfo_size_upload: The total amount of data uploaded;
Curlinfo_size_download: Download the total value of the amount of data;
Curlinfo_speed_download: Average download speed;
Curlinfo_speed_upload: average upload speed;
The size of the curlinfo_header_size:header part;
Curlinfo_header_out: Send the requested string;
Curlinfo_request_size: The size of a problematic request in an HTTP request;
Curlinfo_ssl_verifyresult:result of SSL certification verification requested by setting Curlopt_ssl_verifypeer;
Curlinfo_content_length_download: The length of the download content read from the Content-length:field;
Curlinfo_content_length_upload: Description of the size of the uploaded content;
Curlinfo_content_type: Download the "Content-type" value of the content, NULL indicates that the server did not send a valid "Content-type:header";
The role of the curl_setopt () function is to set session parameters for a curl. The role of the Curl_setopt_array () function is to set the session parameters in the form of an array for a curl;
PHP code
Copy Code code as follows:
<?php
$ch = Curl_init ();
$fp = fopen ("Example_homepage.txt", "w");
curl_setopt ($ch, Curlopt_file, $fp);
$options = Array (
Curlopt_url => ' http://www.baidu.com/',
Curlopt_header => False
);
Curl_setopt_array ($ch, $options);
Curl_exec ($ch);
Curl_close ($ch);
Fclose ($FP);
?>
The parameters that can be set are:
Curlopt_autoreferer: Automatically set the Referer information in header;
Curlopt_binarytransfer: The data will be retrieved when the Curlopt_returntransfer is enabled;
Curlopt_cookiesession: When enabled Curl will only pass a session cookie, ignoring the other cookies, curl will return all cookies to the server by default. A session cookie is a cookie that is used to determine whether a server-side session is valid;
Curlopt_crlf: Converts a newline character of Unix into a carriage return newline character when enabled;
Curlopt_dns_use_global_cache: When enabled, a global DNS cache is enabled, which is thread-safe and defaults to true;
Curlopt_failonerror: Displays the HTTP status code, and the default behavior is to ignore HTTP information with a number less than or equal to 400;
Curlopt_filetime: Attempts to modify information in remote documents when enabled. The result information is returned via the curlinfo_filetime option of the Curl_getinfo () function;
Curlopt_followlocation: When enabled, the server server will return the "Location:" In the header recursive return to the server, the use of curlopt_maxredirs can limit the number of recursive return;
Curlopt_forbid_reuse: Forced disconnect after completing the interaction, cannot be reused;
Curlopt_fresh_connect: Forces a new connection to be obtained, replacing the connection in the cache;
Curlopt_ftp_use_eprt:true to use EPRT (and LPRT) when doing active FTP downloads. Use FALSE to disable EPRT and lprt with PORT only;added in PHP 5.0.0.
Curlopt_ftp_use_epsv:true to a EPSV command for FTP transfers before reverting back to PASV. Set to FALSE to disable EPSV;
Curlopt_ftpappend:true to append to the remote file instead of overwriting it;
Curlopt_ftpascii:an alias of Curlopt_transfertext. use that instead;
Curlopt_ftplistonly:true to only list the names of FTP directory;
Curlopt_header: When enabled, the header file information is exported as a data stream;
Curlopt_httpget: When enabled, the HTTP method is set to get, because get is the default, so it is only used in the case of modification;
Curlopt_httpproxytunnel: When enabled, it is transmitted via HTTP proxy;
Curlopt_mute: To restore default values for all modified parameters in the Curl function;
CURLOPT_NETRC: After the connection is established, accesses the ~/.netrc file obtains the user name and the password information to connect the remote site;
Curlopt_nobody: When enabled, will not output the body part of HTML;
Curlopt_noprogress: Turns off the progress bar for curl transmissions when enabled, and the default setting of this key is true;
Curlopt_nosignal: When enabled, ignores all curl passed to PHP for the signal. This entry is opened by default when SAPI multithreading is transmitted;
Curlopt_post: When enabled, a regular POST request is sent, with the type: application/x-www-form-urlencoded, just like the form submission;
Curlopt_put: Allow HTTP to send files when enabled, must also set Curlopt_infile and Curlopt_infilesize
Curlopt_returntransfer: The information obtained from CURL_EXEC () is returned as a file stream instead of directly output;
Curlopt_ssl_verifypeer:false to stop CURL from verifying the peer ' s certificate. Alternate certificates to verify against can is specified with the Curlopt_cainfo option or a certificate directory can Specified with the Curlopt_capath option. Curlopt_ssl_verifyhost may also need to TRUE or FALSE if curlopt_ssl_verifypeer are disabled (it defaults to 2). TRUE by default as of CURL 7.10. Default bundle installed as of CURL 7.10;
Curlopt_transfertext:true to use the ASCII mode for FTP transfers. For LDAP, it retrieves data in plain text instead of HTML. On Windows systems, it won't be set STDOUT to binary mode;
Curlopt_unrestricted_auth: Continuously append user name and password information in multiple locations in the header generated by the curlopt_followlocation, even if the domain name has changed;
Curlopt_upload: Allow file transfer when enabled;
Curlopt_verbose: When enabled, all information is reported and stored in stderr or designated Curlopt_stderr;
Curlopt_buffersize: The size of the cache is read in each fetch, and the value is filled each time;
Curlopt_closepolicy: Not curlclosepolicy_least_recently_used is curlclosepolicy_oldest, there are three other, but Curl temporarily not support;
Curlopt_connecttimeout: The time to wait before initiating the connection, if set to 0, then do not wait;
Curlopt_dns_cache_timeout: Sets the time to save DNS information in memory by default of 120 seconds;
Curlopt_ftpsslauth:the FTP authentication method (to activated): Curlftpauth_ssl (try SSL-i), CURLFTPAUTH_TLS (tr Y TLS-I, or curlftpauth_default (let CURL decide);
Curlopt_http_version: Set the HTTP protocol used by CURL, Curl_http_version_none (Let CURL own judgment), Curl_http_version_1_0 (http/1.0), Curl_ Http_version_1_1 (http/1.1);
Curlopt_httpauth: The HTTP authentication method used, the optional values are: Curlauth_basic,curlauth_digest,curlauth_gssnegotiate,curlauth_ntlm,curlauth _any,curlauth_anysafe, you can use the "|" operator separates multiple values, curl lets the server select a support best value, curlauth_any equivalent to Curlauth_basic | Curlauth_digest | Curlauth_gssnegotiate | Curlauth_ntlm,curlauth_anysafe equivalent to Curlauth_digest | Curlauth_gssnegotiate | Curlauth_ntlm
Curlopt_infilesize: Set the size of the upload file;
Curlopt_low_speed_limit: When the transmission speed is less than Curlopt_low_speed_limit, PHP will root curlopt_low_speed_time to determine if the transmission is canceled because it is too slow;
Curlopt_low_speed_time:the number of seconds the transfer should is below Curlopt_low_speed_limit for PHP to consider the Transfer too slow and abort;
When the transmission speed is less than Curlopt_low_speed_limit, PHP will be based on Curlopt_low_speed_time to determine whether the transmission is canceled because of too slow;
Curlopt_maxconnects: The maximum number of connections allowed, over which connections should be stopped through Curlopt_closepolicy;
Curlopt_maxredirs: Specifies the maximum number of HTTP redirects, which are used with curlopt_followlocation;
Curlopt_port: An optional amount used to specify the connection port;
Curlopt_proxyauth:the HTTP authentication Method (s) to the "proxy connection." Use the same bitmasks as described in Curlopt_httpauth. For proxy authentication, only Curlauth_basic and CURLAUTH_NTLM are currently supported.
Curlopt_proxyport:the port number of the proxy to connect to. This port number can also is set in Curlopt_proxy.
Curlopt_proxytype:either curlproxy_http (default) or CURLPROXY_SOCKS5.
Curlopt_resume_from: Passing a byte offset (used for breakpoint continuation) while resuming transmission
Curlopt_ssl_verifyhost:
1 to check the existence of a common name in the SSL peer certificate.
2 to check the existence's a common name and also verify that it matches the hostname.
Curlopt_sslversion:the SSL Version (2 or 3) to use. By default PHP would try to determine this itself, although in some cases the this must is set manually.
Curlopt_timecondition: If edited after a certain time specified by Curlopt_timevalue, use Curl_timecond_ifmodsince to return the page if it has not been modified, and Curlopt_ The header is true to return a "304 not Modified" Header,curlopt_header to False, the curl_timecond_isunmodsince is used, and the default value is Curl_timecond_ Ifmodsince
Curlopt_timeout: Sets the maximum number of seconds that curl allows to execute
Curlopt_timevalue: Sets the timestamp used by a curlopt_timecondition, which is used in the default state curl_timecond_ifmodsince
Curlopt_cainfo:the name of a file holding one or more certificates to verify the peer with. This is makes sense when used in combination with Curlopt_ssl_verifypeer.
Curlopt_capath:a directory that holds multiple CA certificates. Use this option alongside Curlopt_ssl_verifypeer.
Curlopt_cookie: Sets the contents of the "Set-cookie:" section of the HTTP request.
Curlopt_cookiefile: The name of the file that contains the cookie information, which can be a Netscape format or HTTP-style header information.
Curlopt_cookiejar: The name of the file that holds the cookie information after the connection is closed
Curlopt_customrequest:a Custom Request method to use instead of ' get ' or ' head ' when doing A HTTP request. This is useful to doing "DELETE" or other, and more obscure HTTP requests. Valid values are things like ' get ', ' POST ', ' CONNECT ' and so on; i.e. do don't enter a whole HTTP request line. For instance, entering "get/index.html http/1.0\r\n\r\n" would to be incorrect.
Note:don ' t does this without making sure the server supports the custom request method.
Curlopt_egbsocket:like curlopt_random_file, except a filename to a Entropy gathering socket.
Curlopt_encoding:header in the "Accept-encoding:" section of the content, supported by the encoding format: "Identity", "deflate", "gzip." If set to an empty string, all encoding formats are supported
Curlopt_ftpport:the value which would be used to get the IP addresses to the "POST" instruction. The "POST" instruction tells the remote server to connect to my specified IP address. The string may is a plain IP address, a hostname, a network interface name (under Unix), or just a plain '-' to use the ' sy Stems default IP address.
Curlopt_interface: The name used in the external network interface, which can be an interface name, IP, or host name.
CURLOPT_KRB4LEVEL:KRB4 (Kerberos 4) Security level settings can be one of several values: "Clear", "safe", "confidential", "private". The default value is private, when set to null to disable KRB4, now KRB4 security can only be used in FTP transport.
Curlopt_postfields: "POST" operation in HTTP. If you want to transfer a file, you need a filename at the beginning of @
Curlopt_proxy: Setting up the HTTP proxy server through
Curlopt_proxyuserpwd: User name and password in the format "[Username]:[password]" connected to the proxy server.
Curlopt_random_file: Sets the name of the file that holds the random number of seeds used for SSL
Curlopt_range: Set the HTTP transmission range, you can use the form of "X-y" set a transmission interval, if there are more than one HTTP transmission, the use of commas to separate multiple values, such as: "X-y,n-m."
Curlopt_referer: Sets the value in the "REFERER:" section of the header.
Curlopt_ssl_cipher_list:a LIST of ciphers to use for SSL. For example, Rc4-sha and TLSV1 are valid cipher.
Curlopt_sslcert: Passes a string containing the PEM format certificate.
CURLOPT_SSLCERTPASSWD: Passes a password that is required to use the Curlopt_sslcert certificate.
Curlopt_sslcerttype:the format of the certificate. Supported formats are the "PEM" (default), "DER", and "ENG".
Curlopt_sslengine:the identifier for the crypto engine to the private SSL key specified in Curlopt_sslkey.
Curlopt_sslengine_default:the identifier for the crypto engine used for asymmetric crypto.
Curlopt_sslkey:the name of a file containing a private SSL key.
Curlopt_sslkeypasswd:the secret password needed to use the private SSL key specified in Curlopt_sslkey.
Note:since This option contains a sensitive password remember to keep the PHP script it is contained safe.
Curlopt_sslkeytype:the key type of the private SSL key specified in Curlopt_sslkey. Supported key types are "PEM" (default), "DER", and "ENG".
Curlopt_url: The URL address that you need to get, or you can set it in the Curl_init () function of PHP.
Curlopt_useragent: Contains a string of "user-agent" headers in the HTTP request.
CURLOPT_USERPWD: Pass the username and password required in a connection in the format: "[Username]:[password]".
Curlopt_http200aliases: Setting no longer handles HTTP 200 's response in the form of an error, in an array format.
Curlopt_httpheader: Sets an array of contents to be transferred in a header.
Curlopt_postquote:an array of FTP commands to execute on the server after the FTP request has been performed.
Curlopt_quote:an array of FTP commands to execute on the server prior to the FTP request.
Curlopt_file: Sets the location of the output file, the value is a resource type, and the default is stdout (browser).
Curlopt_infile: The file address that needs to be read when uploading files, and the value is a resource type.
Curlopt_stderr: Set an error output address, the value is a resource type, instead of the default STDERR.
Curlopt_writeheader: Sets the Write file address for the header portion of the content, and the value is a resource type.
Curlopt_headerfunction: Set a callback function that has two parameters, the first is the curl resource handle, and the second is the header data of the output. The output of the header data must depend on this function to return the data size that has been written.
Curlopt_passwdfunction: Set a callback function with three parameters, the first is the curl resource handle, the second is a password prompt, and the third parameter is the maximum password length allowed. Returns the value of the password.
Curlopt_readfunction: Set a callback function that has two parameters, the first is the curl resource handle, and the second is the data read. Data reads must depend on this function. Returns the size of the read data, such as 0 or EOF.
Curlopt_writefunction: Set a callback function with two arguments, the first is the curl resource handle, and the second is the data written. Data writes must depend on this function. Returns the exact size of the written data
The function of Curl_copy_handle () is to copy all the contents and parameters of a Curl connection resource
PHP code
Copy Code code as follows:
<?php
$ch = Curl_init ("http://qzone.myqq.us/");
$another = Curl_copy_handle ($ch);
Curl_exec ($another);
Curl_close ($another);
?>
The purpose of the Curl_error () function is to return a string containing the current session error message.
The Curl_errno () function returns a numeric number that contains the current session error message.
The role of the Curl_multi_init () function is to initialize a curl batch handle resource.
The role of the Curl_multi_add_handle () function is to add a separate curl handle resource to the Curl batch session. The Curl_multi_add_handle () function has two parameters, the first parameter represents a curl batch handle resource, and the second parameter represents a separate curl handle resource.
The Curl_multi_exec () function is to parse a curl batch handle, the Curl_multi_exec () function has two parameters, the first parameter represents a batch handle resource, and the second parameter is a reference value parameter. Represents the number of individual curl handle resources remaining to be processed.
The Curl_multi_remove_handle () function represents the removal of a handle resource in a curl batch handle resource, the Curl_multi_remove_handle () function has two parameters, the first parameter represents a curl batch handle resource, The second parameter represents a separate curl handle resource.
The role of the Curl_multi_close () function is to close a batch handle resource.
PHP code
Copy Code code as follows:
<?php
$ch 1 = curl_init ();
$ch 2 = Curl_init ();
curl_setopt ($ch 1, Curlopt_url, "http://blog.huangchao.org/");
curl_setopt ($ch 1, curlopt_header, 0);
curl_setopt ($ch 2, Curlopt_url, "http://test.huangchao.org/");
curl_setopt ($ch 2, Curlopt_header, 0);
$MH = Curl_multi_init ();
Curl_multi_add_handle ($MH, $ch 1);
Curl_multi_add_handle ($MH, $ch 2);
do {
Curl_multi_exec ($MH, $flag);
while ($flag > 0);
Curl_multi_remove_handle ($MH, $ch 1);
Curl_multi_remove_handle ($MH, $ch 2);
Curl_multi_close ($MH);
?>
The role of the curl_multi_getcontent () function is to return the text stream of the obtained output when the Curlopt_returntransfer is set.
The role of the Curl_multi_info_read () function is to obtain the relevant transport information for the currently resolved curl.
Curl_multi_select (): Get all of the sockets associated with the curl extension, which can then to be "selected"