Write a simple fetch page function first
Copy CodeThe code is as follows:
function Getsources ($URL, $User _agent= ", $Referer _url=")//Fetch a specified page
{
$URL the page address to crawl
$User _agent need to return user_agent information such as "Baiduspider" or "Googlebot"
$ch = Curl_init ();
curl_setopt ($ch, Curlopt_url, $URL);
curl_setopt ($ch, curlopt_useragent, $User _agent);
curl_setopt ($ch, Curlopt_referer, $Referer _url);
curl_setopt ($ch, curlopt_followlocation,1);
curl_setopt ($ch, Curlopt_returntransfer, 1);
$MySources = curl_exec ($ch);
Curl_close ($ch);
return $MySources;
}
$URL = "Http://www.jb51.net"; There's no need to get content.
$User _agent = "baiduspider+ (+http://www.baidu.com/search/spider.htm)";
$Referer _url = ' http://www.jb51.net/';
Echo getsources ($URL, $User _agent, $Referer _url);
?>
Curl Libraries in PHP (Client URL library function)
Curl_close-close a Curl session;
curl_copy_handle-Copy all the contents and parameters of a curl connection resource;
curl_errno-returns a numeric number containing the current session error information;
curl_error-returns a string containing the current session error information;
Curl_exec-performs a curl session;
Curl_getinfo-gets the information for a Curl connection resource handle;
curl_init-initialization of a curl session;
curl_multi_add_handle-Add a separate curl handle resource to the Curl batch session;
curl_multi_close-close a batch handle resource;
curl_multi_exec-parsing a curl batch handle;
curl_multi_getcontent-returns the text stream of the obtained output;
Curl_multi_info_read-gets the relevant transmission information of the currently parsed curl;
curl_multi_init-Initializes a curl batching handle resource;
curl_multi_remove_handle-removes a handle resource from the Curl batch handle resource;
Curl_multi_select-get all the sockets associated with the CURL extension, which can and be "selected";
curl_setopt_array-sets the session parameters for a curl in the form of an array;
Curl_setopt-set session parameters for a curl;
curl_version-get version information related to curl;
The function of the Curl_init () function Initializes a curl session, and the only parameter of the Curl_init () function is optional and represents a URL address;
The function of the curl_exec () function is to perform a curl session, and the only argument is the handle returned by the Curl_init () function;
The function of the Curl_close () function is to close a curl session, and the only argument is the handle returned by the Curl_init () function;
PHP code
Copy CodeThe code is as follows:
$ch = Curl_init ("http://blog.huangchao.org/");
Curl_exec ($ch);
Curl_close ($ch);
?>
Curl_version () function is to get curl related version information, curl_version () function has a parameter, not clear is what to do;
PHP code
Print_r (Curl_version ())
?>
The Curl_getinfo () function is to get information about a Curl connection resource handle, the Curl_getinfo () function has two parameters, the first parameter is the resource handle of curl, and the second parameter is the following constants:
PHP code
Copy CodeThe code is as follows:
$ch = Curl_init ("http://blog.huangchao.org/");
Print_r (Curl_getinfo ($ch));
?>
The optional constants are:
Curlinfo_effective_url: The last valid URL address;
Curlinfo_http_code: The last HTTP code received;
Curlinfo_filetime: The time of the remote acquisition of the document, if not available, the return value is "1";
Curlinfo_total_time: The time consumed by the last transmission;
Curlinfo_namelookup_time: The time consumed by name resolution;
Curlinfo_connect_time: Time spent establishing a connection;
Curlinfo_pretransfer_time: The time it takes to establish a connection to prepare the transfer;
Curlinfo_starttransfer_time: The time used to start the connection to the transmission;
Curlinfo_redirect_time: The time used to redirect before the transaction transfer begins;
Curlinfo_size_upload: The total amount of uploaded data;
Curlinfo_size_download: The total value of the downloaded data volume;
Curlinfo_speed_download: Average download speed;
Curlinfo_speed_upload: average upload speed;
The size of the curlinfo_header_size:header part;
Curlinfo_header_out: The string that sent the request;
Curlinfo_request_size: The size of the requested request that has a problem in the HTTP request;
Curlinfo_ssl_verifyresult:result of SSL certification verification requested by setting Curlopt_ssl_verifypeer;
Curlinfo_content_length_download: The length of the downloaded content read from the Content-length:field;
Curlinfo_content_length_upload: Description of upload content size;
Curlinfo_content_type: Download the "Content-type" value of the content, NULL indicates that the server did not send a valid "Content-type:header";
The function of the curl_setopt () function is to set the session parameters for a curl. The role of the Curl_setopt_array () function is to set the session parameters for a curl in the form of an array;
PHP code
Copy CodeThe code is as follows:
$ch = Curl_init ();
$fp = fopen ("Example_homepage.txt", "w");
curl_setopt ($ch, Curlopt_file, $fp);
$options = Array (
Curlopt_url = ' http://www.baidu.com/',
Curlopt_header = False
);
Curl_setopt_array ($ch, $options);
Curl_exec ($ch);
Curl_close ($ch);
Fclose ($FP);
?>
The parameters that can be set are:
Curlopt_autoreferer: Automatically set the Referer information in the header;
Curlopt_binarytransfer: Returns data when Curlopt_returntransfer is enabled;
Curlopt_cookiesession: When enabled, Curl will simply pass a session cookie, ignoring other cookies, and by default curl will return all cookies to the server. Session cookie is a cookie that is used to determine if the session on the server is valid;
Curlopt_crlf: Converts a Unix newline character to a carriage return line character when enabled;
Curlopt_dns_use_global_cache: When enabled, a global DNS cache is enabled, which is thread-safe and is true by default;
Curlopt_failonerror: Display HTTP status code, the default behavior is to ignore the number is less than or equal to 400 of HTTP information;
Curlopt_filetime: When enabled, attempts to modify the information in the remote document. The resulting information is returned through the curlinfo_filetime option of the Curl_getinfo () function;
Curlopt_followlocation: When enabled, the server server returns the "location:" In the header of the recursive return to the server, using Curlopt_maxredirs can limit the number of recursive return;
Curlopt_forbid_reuse: Forced disconnection after completion of the interaction, can not be reused;
Curlopt_fresh_connect: Forces the acquisition of a new connection to replace the connection in the cache;
Curlopt_ftp_use_eprt:true to use EPRT (and LPRT) when doing active FTP downloads. Use FALSE to disable EPRT and lprt and use PORT only;added in PHP 5.0.0.
Curlopt_ftp_use_epsv:true to first try a EPSV command for FTP transfers before reverting back to PASV. Set to FALSE to disable EPSV;
Curlopt_ftpappend:true to append to the remote file instead of overwriting it;
Curlopt_ftpascii:an alias of Curlopt_transfertext. use that instead;
Curlopt_ftplistonly:true to list the names of an FTP directory;
Curlopt_header: When enabled, the header file information is exported as a data stream;
Curlopt_httpget: When enabled, the method for HTTP is set to get, because get is the default, so it is only used in the case of modification;
Curlopt_httpproxytunnel: When enabled, it is transmitted via HTTP proxy;
Curlopt_mute: Restores default values for all modified parameters in the Curl function;
CURLOPT_NETRC: After the connection is established, access the ~/.NETRC file to obtain the user name and password information to connect to the remote site;
Curlopt_nobody: When enabled, the body portion of the HTML is not output;
Curlopt_noprogress: Turns off the progress bar of the curl transfer when enabled, and the default setting for this item is true;
Curlopt_nosignal: Ignores all the signals that curl passes to PHP when enabled. This entry is turned on by default when SAPI multi-threaded transmission;
Curlopt_post: When enabled, a regular POST request is sent, type: application/x-www-form-urlencoded, just like the form submitted;
Curlopt_put: Allow HTTP to send files when enabled, you must set both Curlopt_infile and Curlopt_infilesize
Curlopt_returntransfer: The information obtained by CURL_EXEC () is returned in the form of a file stream, rather than directly output;
Curlopt_ssl_verifypeer:false to stop CURL from verifying the peer ' s certificate. Alternate certificates to verify against can is specified with the Curlopt_cainfo option or a certificate directory can is Specified with the Curlopt_capath option. Curlopt_ssl_verifyhost may also need to be TRUE or FALSE if curlopt_ssl_verifypeer are disabled (it defaults to 2). TRUE by default as of CURL 7.10. Default bundle installed as of CURL 7.10;
Curlopt_transfertext:true to use the ASCII mode for FTP transfers. For LDAP, it retrieves the data in plain text instead of HTML. On Windows systems, it won't set STDOUT to binary mode;
Curlopt_unrestricted_auth: The user name and password information is continuously appended to multiple locations in the header generated by the curlopt_followlocation, even if the domain name has changed;
Curlopt_upload: Allow file transfer when enabled;
Curlopt_verbose: When enabled, all information is reported and stored in stderr or designated Curlopt_stderr;
Curlopt_buffersize: The size of the cache is read in each fetch, and the value is filled each time;
Curlopt_closepolicy: Not curlclosepolicy_least_recently_used is curlclosepolicy_oldest, there are three other, but Curl temporarily does not support;
Curlopt_connecttimeout: The time to wait before initiating the connection, and if set to 0, do not wait;
Curlopt_dns_cache_timeout: Sets the time to save DNS information in memory by default of 120 seconds;
Curlopt_ftpsslauth:the FTP authentication Method (when is activated): Curlftpauth_ssl (try SSL first), CURLFTPAUTH_TLS (tr Y TLS first), or Curlftpauth_default (let CURL decide);
Curlopt_http_version: Set the HTTP protocol used by Curl, curl_http_version_none (let CURL judge), Curl_http_version_1_0 (http/1.0), Curl_ Http_version_1_1 (http/1.1);
Curlopt_httpauth: The HTTP authentication method used, the optional values are: Curlauth_basic,curlauth_digest,curlauth_gssnegotiate,curlauth_ntlm,curlauth _any,curlauth_anysafe, you can use the "|" operator to separate multiple values, curl allows the server to select one of the best supported values, Curlauth_any equivalent to Curlauth_basic | Curlauth_digest | Curlauth_gssnegotiate | Curlauth_ntlm,curlauth_anysafe equivalent to Curlauth_digest | Curlauth_gssnegotiate | Curlauth_ntlm
Curlopt_infilesize: Set the size of the upload file;
Curlopt_low_speed_limit: When the transfer speed is less than Curlopt_low_speed_limit, PHP will root curlopt_low_speed_time to determine whether it is too slow to cancel the transmission;
Curlopt_low_speed_time:the number of seconds the transfer should be below Curlopt_low_speed_limit for PHP to consider the Transfer too slow and abort;
When the transfer speed is less than Curlopt_low_speed_limit, PHP will be based on curlopt_low_speed_time to determine whether it is too slow to cancel the transmission;
Curlopt_maxconnects: The maximum number of connections allowed, more than will be determined by curlopt_closepolicy which connections should be stopped;
Curlopt_maxredirs: Specifies the maximum number of HTTP redirects, this option is used with curlopt_followlocation;
Curlopt_port: An optional amount to specify the connection port;
Curlopt_proxyauth:the HTTP authentication Method (s) to use for the proxy connection. Use the same bitmasks as described in Curlopt_httpauth. For proxy authentication, only Curlauth_basic and CURLAUTH_NTLM is currently supported.
Curlopt_proxyport:the port number of the proxy to connect to. This port number can also is set in Curlopt_proxy.
Curlopt_proxytype:either curlproxy_http (default) or CURLPROXY_SOCKS5.
Curlopt_resume_from: Pass a byte offset (used to resume the breakpoint) when the transfer is resumed
Curlopt_ssl_verifyhost:
1 to check the existence of a common name in the SSL peer certificate.
2 to check the existence of a common name and also verify that it matches the hostname provided.
Curlopt_sslversion:the SSL Version (2 or 3) to use. By default PHP would try to determine the itself, although in some cases this must is set manually.
Curlopt_timecondition: If edited after a certain time specified by Curlopt_timevalue, use Curl_timecond_ifmodsince to return to the page if it has not been modified and Curlopt_ Header is true, returns a header,curlopt_header of "304 not Modified" to false, using Curl_timecond_isunmodsince, the default value is Curl_timecond_ Ifmodsince
Curlopt_timeout: Sets the maximum number of seconds that curl is allowed to execute
Curlopt_timevalue: Sets the timestamp used by a curlopt_timecondition, which is used by default in Curl_timecond_ifmodsince
Curlopt_cainfo:the name of a file holding one or more certificates to verify the peer with. This is makes sense when used in combination with Curlopt_ssl_verifypeer.
Curlopt_capath:a directory that holds multiple CA certificates. Use the This option alongside Curlopt_ssl_verifypeer.
Curlopt_cookie: Sets the contents of the "Set-cookie:" section of the HTTP request.
Curlopt_cookiefile: The name of the file containing the cookie information, which can be either Netscape format or HTTP style header information.
Curlopt_cookiejar: The name of the file holding the cookie information after the connection is closed
Curlopt_customrequest:a Custom Request method to use instead of "GET" or "HEAD" when doing A HTTP request. This was useful for doing "DELETE" or other, and more obscure HTTP requests. Valid values is things like "GET", "POST", "CONNECT" and so on; i.e. do not enter a whole HTTP request line here. For instance, entering ' get/index.html http/1.0\r\n\r\n ' would be incorrect.
Note:don ' t do this without making sure the server supports the custom request method first.
Curlopt_egbsocket:like curlopt_random_file, except a filename to an Entropy gathering Daemon socket.
The content of the "Accept-encoding:" section in Curlopt_encoding:header, supported by the encoding format: "Identity", "deflate", "gzip". If set to an empty string, it means that all encoding formats are supported
Curlopt_ftpport:the value which'll be used to get the IP address to use for the FTP "POST" instruction. The "POST" instruction tells the remote server to connect to our specified IP address. The string is a plain IP address, a hostname, a network interface name (under Unix), or just a plain '-' to use the SY Stems default IP address.
Curlopt_interface: The name used in the external network interface, which can be an interface name, IP, or host name.
CURLOPT_KRB4LEVEL:KRB4 (Kerberos 4) Security level setting, can be one of several values: "Clear", "safe", "confidential", "private". The default value is "Private", which is set to NULL when KRB4 is disabled, and now KRB4 security can only be used in FTP transport.
Curlopt_postfields: "POST" operation in HTTP. If you want to transfer a file, you need a filename at the start of the @
Curlopt_proxy: Setting up an HTTP proxy server through
CURLOPT_PROXYUSERPWD: The user name and password in the format "[Username]:[password]" that is connected to the proxy server.
Curlopt_random_file: Set the file name of the random number seed used to hold SSL
Curlopt_range: Set the HTTP transmission range, you can set a transmission interval in the form of "X-y", if there are multiple HTTP transmissions, use commas to separate multiple values, such as: "X-y,n-m".
Curlopt_referer: Sets the value of the "REFERER:" section in the header.
Curlopt_ssl_cipher_list:a LIST of ciphers to use for SSL. For example, Rc4-sha and TLSV1 is valid cipher lists.
Curlopt_sslcert: Pass a string containing a certificate in PEM format.
CURLOPT_SSLCERTPASSWD: Pass a password that contains required to use the Curlopt_sslcert certificate.
Curlopt_sslcerttype:the format of the certificate. Supported formats is "PEM" (default), "DER", and "ENG".
Curlopt_sslengine:the identifier for the crypto engine of the private SSL key specified in Curlopt_sslkey.
Curlopt_sslengine_default:the identifier for the crypto engine used for asymmetric crypto operations.
Curlopt_sslkey:the name of a file containing a private SSL key.
Curlopt_sslkeypasswd:the secret password needed to use the private SSL key specified in Curlopt_sslkey.
Note:since This option contains a sensitive password, remember to keep the PHP script it is contained within safe.
Curlopt_sslkeytype:the key type of the private SSL key specified in Curlopt_sslkey. Supported key types is "PEM" (default), "DER", and "ENG".
Curlopt_url: The URL address that needs to be obtained, or it can be set in PHP's Curl_init () function.
Curlopt_useragent: A string that contains a "user-agent" header in an HTTP request.
Curlopt_userpwd: Pass in a connection the user name and password required in the format: "[Username]:[password]".
Curlopt_http200aliases: Setting no longer handles HTTP 200 responses in error format as an array.
Curlopt_httpheader: Sets an array of contents to be transferred in a header.
Curlopt_postquote:an array of FTP commands to execute in the server after the FTP request has been performed.
Curlopt_quote:an array of FTP commands to execute on the server prior to the FTP request.
Curlopt_file: Sets the location of the output file, the value is a resource type, and the default is stdout (browser).
Curlopt_infile: The file address to be read when uploading the file, the value is a resource type.
Curlopt_stderr: Sets an error output address, and the value is a resource type that supersedes the default STDERR.
Curlopt_writeheader: Sets the Write file address of the header portion of the content, and the value is a resource type.
Curlopt_headerfunction: Set a callback function that has two parameters, the first is the resource handle for curl, and the second is the header data for the output. The output of the header data must rely on this function to return the size of the data that has been written.
Curlopt_passwdfunction: Set a callback function with three parameters, the first one is the resource handle for curl, the second is a password prompt, and the third parameter is the maximum allowed for the password length. Returns the value of the password.
Curlopt_readfunction: Set a callback function with two parameters, the first is the resource handle for curl, and the second is the data to be read. Data reads must depend on this function. Returns the size of the read data, such as 0 or EOF.
Curlopt_writefunction: Set a callback function with two parameters, the first is the resource handle for curl, and the second is the data to be written. Data writes must depend on this function. Returns the exact size of the written data
The role of the Curl_copy_handle () function is to copy all the contents and parameters of a Curl connection resource
PHP code
Copy CodeThe code is as follows:
$ch = Curl_init ("http://qzone.myqq.us/");
$another = Curl_copy_handle ($ch);
Curl_exec ($another);
Curl_close ($another);
?>
The function of the Curl_error () function is to return a string containing the current session error information.
The function of the Curl_errno () function is to return a numeric number that contains the current session error information.
The function of the Curl_multi_init () function is to initialize a curl batching handle resource.
The function of the Curl_multi_add_handle () function is to add a separate curl handle resource to the Curl batch session. The Curl_multi_add_handle () function has two parameters, the first parameter represents a curl batch handle resource, and the second parameter represents a separate curl handle resource.
The action of the Curl_multi_exec () function is to parse a curl batch handle, the Curl_multi_exec () function has two parameters, the first parameter represents a batch handle resource, the second argument is a reference value parameter, Represents the number of individual curl handle resources remaining to be processed.
The Curl_multi_remove_handle () function represents a handle resource removed from the Curl batch handle resource, the Curl_multi_remove_handle () function has two parameters, the first parameter represents a curl batch handle resource, The second parameter represents a separate curl handle resource.
The function of the Curl_multi_close () function is to close a batch handle resource.
PHP code
Copy CodeThe code is as follows:
$ch 1 = curl_init ();
$ch 2 = Curl_init ();
curl_setopt ($ch 1, Curlopt_url, "http://blog.huangchao.org/");
curl_setopt ($ch 1, curlopt_header, 0);
curl_setopt ($ch 2, Curlopt_url, "http://test.huangchao.org/");
curl_setopt ($ch 2, Curlopt_header, 0);
$MH = Curl_multi_init ();
Curl_multi_add_handle ($MH, $ch 1);
Curl_multi_add_handle ($MH, $ch 2);
do {
Curl_multi_exec ($MH, $flag);
} while ($flag > 0);
Curl_multi_remove_handle ($MH, $ch 1);
Curl_multi_remove_handle ($MH, $ch 2);
Curl_multi_close ($MH);
?>
The function of the curl_multi_getcontent () function is to return the text stream of the obtained output in the case of a curlopt_returntransfer set.
The purpose of the Curl_multi_info_read () function is to obtain the relevant transmission information for the currently resolved curl.
Curl_multi_select (): Get all the sockets associated with the CURL extension, which can and be "selected"
http://www.bkjia.com/PHPjc/321298.html www.bkjia.com true http://www.bkjia.com/PHPjc/321298.html techarticle write a simple fetch page function copy code as follows: Php function getsources ($URL, $User _agent= ", $Referer _url=")//Fetch a specified page {//$URL to be crawled ...