First, write a simple page capture function.
Function GetSources ($ Url, $ User_Agent = '', $ Referer_Url ='') // capture a specified page
// $ Url: The Url of the page to be crawled
// $ User_Agent the user_agent information to be returned, such as "baiduspider" or "googlebot"
$ Ch = curl_init ();
Curl_setopt ($ ch, CURLOPT_URL, $ Url );
Curl_setopt ($ ch, CURLOPT_USERAGENT, $ User_Agent );
Curl_setopt ($ ch, CURLOPT_REFERER, $ Referer_Url );
Curl_setopt ($ ch, CURLOPT_FOLLOWLOCATION, 1 );
Curl_setopt ($ ch, CURLOPT_RETURNTRANSFER, 1 );
$ MySources = curl_exec ($ ch );
Curl_close ($ ch );
Return $ MySources;
$ Url = "http://www.baidu.com ";
$ User_Agent = "baiduspider + (+ http://www.baidu.com/search/spider.htm )";
$ Referer_Url = 'http: // www.chinaz.com /';
The result after GetSources ($ Url, $ User_Agent, $ Referer_Url) is:
CURL Library Function in PHP)
Curl_close-close a curl session;
Curl_copy_handle-copy all content and parameters of a curl connection resource;
Curl_errno-a number containing the current session error message is returned;
Curl_error-returns a string containing the current session error message;
Curl_exec-execute a curl session;
Curl_getinfo-obtains the information of a curl connection resource handle;
Curl_init-initialize a curl session;
Curl_multi_add_handle-add a separate curl handle resource to the curl batch processing session;
Curl_multi_close-closes a batch processing handle resource;
Curl_multi_exec-parses a curl batch handle;
Curl_multi_getcontent-returns the obtained output text stream;
Curl_multi_info_read-obtains the transmission information of the currently resolved curl;
Curl_multi_init-initialize a curl batch processing handle resource;
Curl_multi_remove_handle-removes a handle resource from the curl batch processing handle;
Curl_multi_select-Get all the sockets associated with the cURL extension, which can then be "selected ";
Curl_setopt_array-sets session parameters for a curl in the form of an array;
Curl_setopt-sets session parameters for a curl;
Curl_version-obtain the version information related to curl;
The function curl_init () initializes a curl session. the unique parameter of the curl_init () function is optional, indicating a url address;
The role of the curl_exec () function is to execute a curl session. the unique parameter is the handle returned by the curl_init () function;
The function curl_close () is used to close a curl session. the only parameter is the handle returned by the curl_init () function;
PHP code $ Ch = curl_init ("http://www.BkJia.com /");
Curl_exec ($ ch );
Curl_close ($ ch );
The curl_version () function is used to obtain the version information related to curl. the curl_version () function has a parameter and it is unclear what it is;
PHP code Print_r (curl_version ())
The curl_getinfo () function is used to obtain information about a curl connection resource handle. the curl_getinfo () function has two parameters. The first parameter is the curl resource handle, the second parameter is the following constants:
PHP code $ Ch = curl_init ("http://www.BkJia.com /");
Print_r (curl_getinfo ($ ch ));
Optional constants include:
Curlinfo_inclutive_url: the last valid url address;
CURLINFO_HTTP_CODE: The Last HTTP code received;
CURLINFO_FILETIME: the time when the document is obtained remotely. if the document cannot be obtained, the returned value is-1 ";
CURLINFO_TOTAL_TIME: the time consumed by the last transmission;
CURLINFO_NAMELOOKUP_TIME: time consumed by name resolution;
CURLINFO_CONNECT_TIME: the time consumed to establish a connection;
CURLINFO_PRETRANSFER_TIME: the time used to prepare the transmission from the established connection;
CURLINFO_STARTTRANSFER_TIME: the time used to establish a connection to the transmission;
CURLINFO_REDIRECT_TIME: the time used for redirection before the transaction transmission starts;
CURLINFO_SIZE_UPLOAD: The total value of the uploaded data volume;
CURLINFO_SIZE_DOWNLOAD: The total value of the downloaded data volume;
CURLINFO_SPEED_DOWNLOAD: average download speed;
CURLINFO_SPEED_UPLOAD: average upload speed;
CURLINFO_HEADER_SIZE: the size of the header;
CURLINFO_HEADER_OUT: The request string;
CURLINFO_REQUEST_SIZE: the size of the request with a problem in the HTTP request;
CURLINFO_SSL_VERIFYRESULT: Result of SSL certification verification requested by setting CURLOPT_SSL_VERIFYPEER;
CURLINFO_CONTENT_LENGTH_DOWNLOAD: Length of the downloaded Content read from Content-Length: field;
CURLINFO_CONTENT_LENGTH_UPLOAD: description of the size of the uploaded content;
CURLINFO_CONTENT_TYPE: the "Content-type" value of the downloaded Content. NULL indicates that the server has not sent a valid "Content-Type: header ";
The curl_setopt () function sets session parameters for a curl. Curl_setopt_array () function is used to set session parameters for a curl in the form of arrays;
PHP code $ Ch = curl_init ();
$ Fp = fopen ("example_homepage.txt", "w ");
Curl_setopt ($ ch, CURLOPT_FILE, $ fp );
$ Options = array (
CURLOPT_URL => 'http: // www.baidu.com /',
CURLOPT_HEADER => false
Curl_setopt_array ($ ch, $ options );
Curl_exec ($ ch );
Curl_close ($ ch );
Fclose ($ fp );
Configurable parameters include:
CURLOPT_AUTOREFERER: automatically sets the referer information in the header;
CURLOPT_BINARYTRANSFER: when CURLOPT_RETURNTRANSFER is enabled, data is returned;
CURLOPT_COOKIESESSION: When enabled, curl only transmits one session cookie and ignores other cookies. by default, curl returns all cookies to the server. Session cookie refers to the cookies used to determine whether the session on the server is valid;
CURLOPT_CRLF: converts Unix line breaks into carriage return line breaks when enabled;
CURLOPT_DNS_USE_GLOBAL_CACHE: When enabled, a global DNS cache is enabled. this option is thread-safe and the default value is true;
CURLOPT_FAILONERROR: displays the HTTP status code. the default behavior is to ignore HTTP information whose number is less than or equal to 400;
CURLOPT_FILETIME: When enabled, the system tries to modify the information in the remote document. The result is returned through the CURLINFO_FILETIME option of the curl_getinfo () function;
CURLOPT_FOLLOWLOCATION: When enabled, the "Location:" returned by the server is put in the header and recursively returned to the server. you can use CURLOPT_MAXREDIRS to limit the number of recursive responses;
CURLOPT_FORBID_REUSE: force disconnect after interaction is completed and cannot be reused;
CURLOPT_FRESH_CONNECT: forces a new connection to replace the connection in the cache;
CURLOPT_FTP_USE_EPRT: TRUE to use EPRT (and LPRT) when doing active FTP downloads. Use FALSE to disable EPRT and LPRT and use PORT only; Added in PHP 5.0.0.
CURLOPT_FTP_USE_EPSV: TRUE to first try an EPSV command for FTP transfers before reverting back to PASV. Set to FALSE to disable EPSV;
CURLOPT_FTPAPPEND: TRUE to append to the remote file instead of overwriting it;
CURLOPT_FTPASCII: An alias of CURLOPT_TRANSFERTEXT. Use that instead;
CURLOPT_FTPLISTONLY: TRUE to only list the names of an FTP directory;
CURLOPT_HEADER: When enabled, the header file information is output as a data stream;
CURLOPT_HTTPGET: When enabled, the HTTP method is set to GET. because GET is the default value, it is used only when it is modified;
CURLOPT_HTTPPROXYTUNNEL: it will be transmitted through the HTTP proxy when it is enabled;
CURLOPT_MUTE: restores the default values of all modified parameters in the curl function;
CURLOPT_NETRC: after the connection is established, access ~ /. The netrc file obtains the user name and password to connect to the remote site;
CURLOPT_NOBODY: When enabled, the content of the body in HTML is not output;
CURLOPT_NOPROGRESS: indicates the progress bar for disabling curl transmission when enabled. this option is set to true by default;
CURLOPT_NOSIGNAL: When enabled, ignore all curl signals sent to php. This option is enabled by default during SAPI multi-thread transmission;
CURLOPT_POST: When enabled, a regular POST request will be sent. the type is application/x-www-form-urlencoded, just like form submission;
CURLOPT_PUT: When enabled, files can be sent over HTTP. both CURLOPT_INFILE and CURLOPT_INFILESIZE must be set.
CURLOPT_RETURNTRANSFER: returns the information obtained by curl_exec () as a file stream, rather than directly outputting it;
CURLOPT_SSL_VERIFYPEER: FALSE to stop cURL from verifying the peer's certificate. alternate certificates to verify against can be specified with the CURLOPT_CAINFO option or a certificate directory can be specified with the CURLOPT_CAPATH option. CURLOPT_SSL_VERIFYHOST may also need to be TRUE or FALSE if CURLOPT_SSL_VERIFYPEER is disabled (it defaults to 2 ). TRUE by default as of cURL 7.10. default bundle installed as of cURL 7.10;
CURLOPT_TRANSFERTEXT: TRUE to use ASCII mode for FTP transfers. For LDAP, it retrieves data in plain text instead of HTML. On Windows systems, it will not set STDOUT to binary mode;
CURLOPT_UNRESTRICTED_AUTH: The username and password information is continuously appended to multiple locations in the header generated using CURLOPT_FOLLOWLOCATION, even if the domain name has changed;
CURLOPT_UPLOAD: allows file transfer when enabled;
CURLOPT_VERBOSE: all information is reported when it is enabled and stored in STDERR or the specified CURLOPT_STDERR;
CURLOPT_BUFFERSIZE: the size of the data read into the cache each time. This value is filled every time;
CURLOPT_CLOSEPOLICY: either CURLCLOSEPOLICY_LEAST_RECENTLY_USED or CURLCLOSEPOLICY_OLDEST. There are three other policies, but curl is not supported yet;
CURLOPT_CONNECTTIMEOUT: The waiting time before the connection is initiated. if it is set to 0, no wait;
CURLOPT_DNS_CACHE_TIMEOUT: Set the time for saving DNS information in the memory. the default value is 120 seconds;
CURLOPT_FTPSSLAUTH: The FTP authentication method (when is activated): CURLFTPAUTH_SSL (try SSL first), CURLFTPAUTH_TLS (try TLS first), or CURLFTPAUTH_DEFAULT (let cURL decide );
CURLOPT_HTTP_VERSION: Set the HTTP protocol used by curl, CURL_HTTP_VERSION_NONE (let curl determine by itself), CURL_HTTP_VERSION_1_0 (HTTP/1.0), CURL_HTTP_VERSION_1_1 (HTTP/1.1 );
CURLOPT_HTTPAUTH: the HTTP verification method used. optional values: CURLAUTH_BASIC, CURLAUTH_DIGEST, digest, CURLAUTH_NTLM, CURLAUTH_ANY, and CURLAUTH_ANYSAFE. multiple values can be separated by the "|" operator, curl allows the server to select the best value. CURLAUTH_ANY is equivalent to CURLAUTH_BASIC | CURLAUTH_DIGEST | digest | CURLAUTH_NTLM, CURLAUTH_ANYSAFE is equivalent to CURLAUTH_DIGEST | digest | CURLAUTH_NTLM
CURLOPT_INFILESIZE: set the size of the uploaded file;
CURLOPT_LOW_SPEED_LIMIT: when the transmission speed is less than CURLOPT_LOW_SPEED_LIMIT, PHP will use CURLOPT_LOW_SPEED_TIME to determine whether the transmission is canceled because of the slowness;
CURLOPT_LOW_SPEED_TIME: The number of seconds the transfer shoshould be below CURLOPT_LOW_SPEED_LIMIT for PHP to consider the transfer too slow and abort;
When the transmission speed is less than CURLOPT_LOW_SPEED_LIMIT, PHP will determine whether to cancel the transmission because it is too slow based on CURLOPT_LOW_SPEED_TIME;
CURLOPT_MAXCONNECTS: maximum number of connections allowed. if the maximum number is exceeded, CURLOPT_CLOSEPOLICY is used to determine which connections should be stopped;
CURLOPT_MAXREDIRS: specifies the maximum number of HTTP redirects. this option is used with CURLOPT_FOLLOWLOCATION;
CURLOPT_PORT: an optional parameter used to specify the number of connection ports;
CURLOPT_PROXYAUTH: The HTTP authentication method (s) to use for the proxy connection. Use the same bitmasks as described in CURLOPT_HTTPAUTH. For proxy authentication, only CURLAUTH_BASIC and values are currently supported.
CURLOPT_PROXYPORT: The port number of the proxy to connect to. This port number can also be set in CURLOPT_PROXY.
CURLOPT_PROXYTYPE: Either CURLPROXY_HTTP (default) or CURLPROXY_SOCKS5.
CURLOPT_RESUME_FROM: transmits a byte offset when the transmission is resumed (used for resumable data transfer)
1 to check the existence of a common name in the SSL peer certificate.
2 to check the existence of a common name and also verify that it matches the hostname provided.
CURLOPT_SSLVERSION: The SSL version (2 or 3) to use. By default PHP will try to determine this itself, although in some cases this must be set manually.
CURLOPT_TIMECONDITION: if it has been edited after a certain time specified by CURLOPT_TIMEVALUE, CURL_TIMECOND_IFMODSINCE is used to return to the page. if it has not been modified and CURLOPT_HEADER is true, returns a "304 Not Modified" header. if CURLOPT_HEADER is false, use CURL_TIMECOND_ISUNMODSINCE. the default value is CURL_TIMECOND_IFMODSINCE.
CURLOPT_TIMEOUT: sets the maximum number of seconds that curl can be executed.
CURLOPT_TIMEVALUE: Set the timestamp used by CURLOPT_TIMECONDITION. by default, CURL_TIMECOND_IFMODSINCE is used.
CURLOPT_CAINFO: The name of a file holding one or more certificates to verify the peer with. This only makes sense when used in combination with CURLOPT_SSL_VERIFYPEER.
CURLOPT_CAPATH: A directory that holds multiple CA certificates. Use this option alongside CURLOPT_SSL_VERIFYPEER.
CURLOPT_COOKIE: Set the content of "Set-Cookie:" In the HTTP request.
CURLOPT_COOKIEFILE: name of the file containing cookie information. the cookie file can be in Netscape format or HTTP header information.
CURLOPT_COOKIEJAR: name of the file that stores cookie information after the connection is closed
CURLOPT_CUSTOMREQUEST: A custom request method to use instead of "GET" or "HEAD" when doing a HTTP request. this is useful for doing "DELETE" or other, more obscure HTTP requests. valid values are things like "GET", "POST", "CONNECT" and so on; I. e. do not enter a whole HTTP request line here. for instance, entering "GET/index.html HTTP/1.0 \ r \ n" wocould be incorrect.
Note: Don't do this without making sure the server supports the custom request method first.
CURLOPT_EGBSOCKET: Like CURLOPT_RANDOM_FILE, using T a filename to an Entropy Gathering Daemon socket.
The content in CURLOPT_ENCODING: "Accept-Encoding:" in the header. the supported Encoding formats are: "identity", "deflate", and "gzip ". If it is set to an empty string, all encoding formats are supported.
CURLOPT_FTPPORT: The value which will be used to get the IP address to use for the FTP "POST" instruction. the "POST" instruction tells the remote server to connect to our specified IP address. the string may be a plain IP address, a hostname, a network interface name (under Unix), or just a plain '-' to use the systems default IP address.
CURLOPT_INTERFACE: name used in an external network interface. it can be an interface name, IP address, or host name.
CURLOPT_KRB4LEVEL: set the security level of KRB4 (Kerberos 4). It can be one of the following values: "clear", "safe", "confidential", "private ". The default value is "private". if it is set to null, KRB4 is disabled. Currently, KRB4 can only be used for FTP transmission.
CURLOPT_POSTFIELDS: the "POST" operation in HTTP. If you want to transfer a file, you need a file name starting @.
CURLOPT_PROXY: sets the HTTP proxy server
CURLOPT_PROXYUSERPWD: the user name and password connected to the proxy server in the format of [username]: [password.
CURLOPT_RANDOM_FILE: specifies the name of the file that stores the random seed used by SSL.
CURLOPT_RANGE: sets the HTTP transfer range, which can be set as a X-Y, if there are multiple HTTP transfers, multiple values are separated by commas, such as: "X-Y, n-M ".
CURLOPT_REFERER: set the value of "Referer:" in the header.
CURLOPT_SSL_CIPHER_LIST: A list of ciphers to use for SSL. For example, RC4-SHA and TLSv1 are valid cipher lists.
CURLOPT_SSLCERT: transmits a string containing the PEM format certificate
CURLOPT_SSLCERTPASSWD: Pass a password that contains the password required to use the CURLOPT_SSLCERT certificate.
CURLOPT_SSLCERTTYPE: The format of the certificate. Supported formats are "PEM" (default), "DER", and "ENG ".
CURLOPT_SSLENGINE: The identifier for the crypto engine of the private SSL key specified in CURLOPT_SSLKEY.
CURLOPT_SSLENGINE_DEFAULT: The identifier for the crypto engine used for asypolicric crypto operations.
CURLOPT_SSLKEY: The name of a file containing a private SSL key.
CURLOPT_SSLKEYPASSWD: The secret password needed to use the private SSL key specified in CURLOPT_SSLKEY.
Note: Since this option contains a sensitive password, remember to keep the PHP script it is contained within safe.
CURLOPT_SSLKEYTYPE: The key type of the private SSL key specified in CURLOPT_SSLKEY. Supported key types are "PEM" (default), "DER", and "ENG ".
CURLOPT_URL: the URL to be obtained. you can also set it in the curl_init () function of PHP.
CURLOPT_USERAGENT: a string containing the "user-agent" header in an HTTP request.
CURLOPT_USERPWD: Pass the username and password required for a connection in the format of [username]: [password].
CURLOPT_HTTP200ALIASES: the setting no longer processes the HTTP 200 response in the form of an error. the format is an array.
CURLOPT_HTTPHEADER: sets an array of content transmitted in the header.
CURLOPT_POSTQUOTE: An array of FTP commands to execute on the server after the FTP request has been completed Med.
CURLOPT_QUOTE: An array of FTP commands to execute on the server prior to the FTP request.
CURLOPT_FILE: specifies the location of the output file. The value is a resource type. the default value is STDOUT (browser ).
CURLOPT_INFILE: the address of the file to be read during File upload. The value is a resource type.
CURLOPT_STDERR: set an error output address. The value is a resource type that replaces the default STDERR.
CURLOPT_WRITEHEADER: specifies the address of the file to which the header is written. The value is a resource type.
CURLOPT_HEADERFUNCTION: sets a callback function. this function has two parameters: the first is the resource handle of curl, and the second is the output header data. The output of header data must depend on this function to return the size of written data.
CURLOPT_PASSWDFUNCTION: sets a callback function with three parameters: the first is the resource handle of curl, the second is a password prompt, and the third parameter is the maximum allowed password length. Returns the password value.
CURLOPT_READFUNCTION: sets a callback function with two parameters: the first is the resource handle of curl, and the second is the data read. Data reading must depend on this function. The size of the data to be read, such as 0 or EOF.
CURLOPT_WRITEFUNCTION: sets a callback function with two parameters: the first is the resource handle of curl, and the second is the written data. Data writing must depend on this function. Returns the exact size of the written data.
The curl_copy_handle () function is used to copy all content and parameters of a curl connection resource.
PHP code $ Ch = curl_init ("http://qzone.myqq.us /");
$ Another = curl_copy_handle ($ ch );
Curl_exec ($ another );
Curl_close ($ another );
The curl_error () function returns a string containing the current session error message.
The curl_errno () function returns a number that contains the current session error message.
The curl_multi_init () function initializes a curl batch processing handle resource.
The curl_multi_add_handle () function adds a separate curl handle resource to the curl batch processing session. The curl_multi_add_handle () function has two parameters. The first parameter indicates a curl batch processing handle resource, and the second parameter indicates a separate curl handle resource.
The curl_multi_exec () function is used to parse a curl batch processing handle. the curl_multi_exec () function has two parameters. The first parameter indicates a batch processing handle resource, the second parameter is a reference value parameter, indicating the number of individual curl handle resources to be processed.
The curl_multi_remove_handle () function removes a handle resource from the curl batch processing handle. the curl_multi_remove_handle () function has two parameters. The first parameter indicates a curl batch processing handle resource, the second parameter represents a separate curl handle resource.
The curl_multi_close () function is used to close a batch processing handle resource.
PHP code $ Response = curl_init ();
$ Ch2 = curl_init ();
Curl_setopt ($ scheme, CURLOPT_URL, "http://www.BkJia.com /");
Curl_setopt ($ scheme, CURLOPT_HEADER, 0 );
Curl_setopt ($ ch2, CURLOPT_URL, "http://test.huangchao.org /");
Curl_setopt ($ ch2, CURLOPT_HEADER, 0 );
$ Mh = curl_multi_init ();
Curl_multi_add_handle ($ mh, $ handle );
Curl_multi_add_handle ($ mh, $ ch2 );
Curl_multi_exec ($ mh, $ flag );
} While ($ flag> 0 );
Curl_multi_remove_handle ($ mh, $ handle );
Curl_multi_remove_handle ($ mh, $ ch2 );
Curl_multi_close ($ mh );
The curl_multi_getcontent () function is used to return the obtained text stream when CURLOPT_RETURNTRANSFER is set.
The curl_multi_info_read () function is used to obtain the transmission information of the currently resolved curl.
Curl_multi_select (): Get all the sockets associated with the cURL extension, which can then be "selected"
Bytes. There are many parameters. Most of them are useful. If you have mastered it and regular expressions, you must be a collection expert. First...