My program allows users to fill in the URL from other sites to fetch resources, but before crawling I need to know the size of resources, otherwise the resource too much time consuming too long will also occupy unnecessary bandwidth. I found out that HTTP has head in it. This protocol is to get the HTTP header information for only one resource, then
curl
How to get only HTTP headers without downloading all the body?
There content-length
is all the HTTP header information must be there, because I only have this method to get the size of the resources. Without this information, I would like to use an alternative, is to set the maximum length of the Curl download resource, if the connection is exceeded, and then error. Is there an option to achieve such effects in curl?
Finally ask, how does the server support the head protocol?
Reply content:
My program allows users to fill in the URL from other sites to fetch resources, but before crawling I need to know the size of resources, otherwise the resource too much time consuming too long will also occupy unnecessary bandwidth. I found out that HTTP has head this protocol, is to get only one resource HTTP header information, then how to curl
only get HTTP headers and not download all the body?
There content-length
is all the HTTP header information must be there, because I only have this method to get the size of the resources. Without this information, I would like to use an alternative, is to set the maximum length of the Curl download resource, if the connection is exceeded, and then error. Is there an option to achieve such effects in curl?
Finally ask, how does the server support the head protocol?
Actually, Curl has a long HEAD
-overdue support for the agreement.
Just add such a line to your code and the Head protocol curl_setopt ($ch, Curlopt_nobody, True) is automatically selected;
If you want to read Content-Length
, then only need to be in the curl_exec
rear
Read the Content-length value in the header $size = Curl_getinfo ($ch, curlinfo_content_length_download);
It should be noted that HEAD
although the protocol is supported by most servers, it is not said that all the servers are supported, and some servers in order to prevent crawling, in the settings to kill the protocol. and is Content-Length
not a required field, you should do if you have this value, and exceed the maximum value, you can return an error, if there is no such value, or do not exceed the maximum value, you must be the size of the downloaded content to judge.
As far as you say the maximum resource download length, I have not seen this setting, but there is a better solution to this problem, that is to use CURLOPT_HEADERFUNCTION
and CURLOPT_WRITEFUNCTION
two callbacks, then only need a single request to complete all the judgment, and can be broken at any time
$size = 0; $max _size = 123456;curl_setopt ($ch, curlopt_headerfunction, function ($ch, $STR) {//The first parameter is a curl resource, The second parameter is the independent header! of each line List ($name, $value) = Array_map (' Trim ', explode (': ', $str, 2)); $name = Strtolower ($name); Determine the size of if (' content-length ' = = $name) { if ($value > $max _size) {return 0;//will break Read}}} ); For no content-length, we read one side to Judge Curl_setopt ($ch, curlopt_writefunction, function ($ch, $STR) use (& $size) {$len = Strlen ($STR); $size + = $len; if ($size > $max _size) { return 0;//interrupted read } return $len;});
Why do you use curl? Just use Fsockopen to send a head over there and ask for it.
However, the head request does not necessarily return the size of the resource, which does not seem to be guaranteed.
curl_setopt ($curl, Curlopt_header, true);
The results returned by Curl_exec also include the HTTP response header, where the Content-length value can be extracted.
http/1.1 okserver:apachecontent-type:text/htmlcontent-encoding:gzipcontent-length:26395
This length value is unreliable, and the server backend script can modify the value arbitrarily.
Setting the maximum fetch size is OK. The remote server is not trustworthy, and the given content-length is not necessarily the true size. To prevent abuse, you also have to add size restrictions.
At the same time you can make an additional judgment, such as a domain name often return content-length and the actual inconsistent content, give it a relatively low reputation. If a user submits a resource fetch requirement for a reputation low domain name, it can be deferred or low-priority processed.
Plus the maximum execution time control is OK, curl is able to control the time-out.