PHP Curl Web site Acquisition of the implementation program

Source: Internet
Author: User
Tags curl first string html tags regular expression

Reasons to Choose Curl


With regard to curl and file_get_contents, excerpt a plain and easy comparison:
File_get_contents is actually a combination of built-in file manipulation functions, such as file_exists,fopen,fread,fclose, specifically for lazy people, and it's primarily used against local files, but also because of lazy people, At the same time, added to the network file support;
Curl is a library dedicated to network interaction, providing a bunch of custom options for dealing with different environments, which are naturally more stable than file_get_contents.

How to use

1, Open Curl support

Because the PHP environment is installed by default is not open curl support, you need to modify the php.ini file, find, Extension=php_curl.dll, the previous colon removed, restart the service can;

2, the use of Curl data capture

Initializes a CURL object
$curl = Curl_init ();
Set the URL you want to crawl
curl_setopt ($curl, Curlopt_url, ' http://www.111cn.net ');
Set Header
curl_setopt ($curl, Curlopt_header, 1);
Sets the curl parameter to require that the results be saved to the string or to the screen.
curl_setopt ($curl, Curlopt_returntransfer, 1);
Run Curl, request Web page
$data = curl_exec ($curl);
Close URL Request
Curl_close ($curl);

3, through the regular match to find the key data

$data is the value returned by the curl_exec, that is, the target content of the collection
Preg_match_all ("/<li class=\" item\ ">" (. *?) <\/li>/", $data, $out, Preg_set_order);
foreach ($out as $key => $value) {
Here $value is an array, and records find the whole sentence with matching characters and the individual matching characters
Echo ' match to the whole sentence: '. $value [0]. '
';
Echo ' alone matched to: '. $value [1]. '
';
}

Skills

1, timeout related settings

by curl_setopt ($ch, opt) You can set some time-out settings, including:
Curlopt_timeout sets the maximum number of seconds that curl is allowed to execute.
Curlopt_timeout_ms sets the maximum number of milliseconds that the curl allows to execute. (Joined in the Curl 7.16.2.) Available from PHP 5.2.3. )
Curlopt_connecttimeout the time to wait before initiating the connection, and if set to 0, wait indefinitely.
Curlopt_connecttimeout_ms the time, in milliseconds, that the attempt to connect waits. If set to 0, wait indefinitely. Be joined in the Curl 7.16.2. Available starting from PHP 5.2.3.

Curlopt_dns_cache_timeout sets the time to save DNS information in memory by default of 120 seconds.

curl_setopt ($ch, curlopt_timeout, 60); You just need to set a number of seconds to

curl_setopt ($ch, curlopt_nosignal, 1); Note that the millisecond timeout must be set for this
curl_setopt ($ch, Curlopt_timeout_ms, 200); Timeout millisecond, joined in CURL 7.16.2. Available from PHP 5.2.3

2. Submit data by post, keep cookies

The following excerpt an example to learn from:
Curl Analog Login Discuz program, suitable for DZ7.0

!extension_loaded (' curl ') && die (' The curl extension is not loaded. ');

$discuz _url = ' http://www.111cn.net ';//Forum Address
$login _url = $discuz _url. /logging.php?action=login ';//Login page address
$get _url = $discuz _url. /my.php?item=threads '; My posts

$post _fields = Array ();
The following two items do not need to be modified
$post _fields[' loginfield '] = ' username ';
$post _fields[' loginsubmit '] = ' true ';
User name and password must be filled in
$post _fields[' username '] = ' lxvoip ';
$post _fields[' password '] = ' 88888888 ';
Security Questions
$post _fields[' QuestionID '] = 0;
$post _fields[' answer '] = ';
@todo Verification Code
$post _fields[' seccodeverify '] = ';

Get Form Formhash
$ch = Curl_init ($login _url);
curl_setopt ($ch, Curlopt_header, 0);
curl_setopt ($ch, Curlopt_returntransfer, 1);
$contents = curl_exec ($ch);
Curl_close ($ch);
Preg_match ('/<input\s*type= "hidden" \s*name= "Formhash" \s*value= "(. *?)" \s*\/>/i ', $contents, $matches);
if (!empty ($matches)) {
$formhash = $matches [1];
} else {
Die (' not found the Forumhash ');
}

Post data, getting cookies
$cookie _file = dirname (__file__). '/cookie.txt ';
$cookie _file = Tempnam (' tmp ');
$ch = Curl_init ($login _url);
curl_setopt ($ch, Curlopt_header, 0);
curl_setopt ($ch, Curlopt_returntransfer, 1);
curl_setopt ($ch, Curlopt_post, 1);
curl_setopt ($ch, Curlopt_postfields, $post _fields);
curl_setopt ($ch, Curlopt_cookiejar, $cookie _file);
Curl_exec ($ch);
Curl_close ($ch);

Take the cookie above and get the content of the page that you need to log in to view
$ch = Curl_init ($get _url);
curl_setopt ($ch, Curlopt_header, 0);
curl_setopt ($ch, Curlopt_returntransfer, 0);
curl_setopt ($ch, Curlopt_cookiefile, $cookie _file);
$contents = curl_exec ($ch);
Curl_close ($ch);

Var_dump ($contents);

In the use of curl have some experience to share with you


Encoding Conversion

First, by looking at the source code to find the acquisition of the site to use the code, through the mb_convert_encoding function for transcoding;
The specific use method:

The source character is $str

The following known original encoding is GBK and converted to Utf-8
Mb_convert_encoding ($str, "UTF-8", "GBK");

The following unknown original code, through auto automatic detection, conversion code for UTF-8
Mb_convert_encoding ($str, "UTF-8", "Auto");

3, in order to better avoid the line break and space and other uncertainties, it is necessary to first clear the collection of source code line, Spaces and tab

Method one, using Str_replace to replace
$contents = Str_replace ("\ r \ n", ", $contents); Clear line breaks
$contents = Str_replace ("\ n", ", $contents); Clear line breaks
$contents = Str_replace ("T", "", $contents); Clear tab
$contents = Str_replace ("", ", $contents); Clear spaces

Method Two, replace with a regular expression

$contents = Preg_replace ("/([\r\n|\n|\t|] +)/",", $contents);

4, through the regular expression matching to find the code snippet to obtain, using Preg_match_all to implement the match

Function Explanation:

Int Preg_match_all (string pattern, string subject, array matches [, int flags])
pattern is the regular expression
subject that you want to find The original text
matches is an array for storing output results
Flags are stored patterns, including:
    preg_pattern_order; //The entire array is a two-dimensional array, $arr 1[0 is an array of matching strings consisting of boundaries, $arr 1[1] The array
    preg_set_order; //The entire array is a two-dimensional array of matching strings that constitute the boundary, $arr 2[0][0] is the first matching string consisting of boundaries, $arr 2[0][1] is the first matched string to remove the boundary, and then the array
    preg_offset_capture; // The entire array is a three-dimensional array, $arr 3[0][0][0] is the first matching string consisting of a boundary, $arr 3[0][0][1] is the offset from the boundary of the first matching string (the boundary is not counted), and so on, $arr 2[1][0][0] is the first string to include a matching boundary, $arr 3[1][0][1] is the offset of the boundary that reaches the first matching string (including the boundary);
 
//Actual application
Preg_match_all ('/<pclass=\ "content\" > (. *?) <\/p>/', $contents, $out, Preg_set_order);
$out will get all the matching elements
$out [0][0] will be a whole-character
$out [0][1] including <pclass=\ "content\" will be included only (. *?). The character segments that match in parentheses are
 
///So, the nth matching field can be obtained in the following ways
$out [n-1][1]
 
//Jo Zheng The expression has a large number of parentheses in it. The method to get the first m match in the sentence is
$out [n-1][m]

5, get to find characters, to remove HTML tags, using PHP's own function strip_tags can be easily implemented

Cases
$result =strip_tags ($out [0][1]);

The above is just to download the data collection, of course, we need to $contents content into the library processing, here is a simple PHP data query to save the function, very simple.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.