The crawl to the content of the encoding can be ($content =iconv ("GBK", "Utf-8//ignore", $content);), we discussed here is how to crawl open the Gzip page. How do you judge? The header that gets is content-encoding:gzip the content is gzip compressed. Use Firebug to see the page opened gzip No. The following is a firebug view of my blog header information, gzip is open.
Copy Code code as follows:
Request header information Raw header information
Accept text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-encoding gzip, deflate
Accept-language zh-cn,zh;q=0.8,en-us;q=0.5,en;q=0.3
Connection keep-alive
Cookie __utma=225240837.787252530.1317310581.1335406161.1335411401.1537; __utmz=225240837.1326850415.887.3.utmcsr=google|utmccn= (Organic) |utmcmd=organic|utmctr=%e4%bb%bb%e4%bd%95%e9% A1%b9%e7%9b%ae%e9%83%bd%e4%b8%8d%e4%bc%9a%e9%82%a3%e4%b9%88%e7%ae%80%e5%8d%95%20site%3awww.nowamagic.net; PHPSESSID=888MJ4425P8S0M7S0FRRE3OVC7; __utmc=225240837; __utmb=225240837.1.10.1335411401
Host www.nowamagic.net
User-agent mozilla/5.0 (Windows NT 5.1; rv:12.0) gecko/20100101 firefox/12.0
Here are some solutions:
1. Use the Zlib library with your own
If the server has installed the Zlib library, use the following code can easily solve the garbled problem.
Copy Code code as follows:
$data = file_get_contents ("compress.zlib://". $url);
2. Use Curl instead of file_get_contents
Copy Code code as follows:
function Curl_get ($url, $gzip =false) {
$curl = Curl_init ($url);
curl_setopt ($curl, Curlopt_returntransfer, 1);
curl_setopt ($curl, Curlopt_connecttimeout, 10);
if ($gzip) curl_setopt ($curl, curlopt_encoding, "gzip"); The key is here.
$content = curl_exec ($curl);
Curl_close ($curl);
return $content;
}
3. Use gzip decompression function
Copy Code code as follows:
function Gzdecode ($data) {
$len = strlen ($data);
if ($len < | | strcmp (substr ($data, 0,2), "\x1f\x8b") {
return null; Not GZIP format (if you are RFC 1952)
}
$method = Ord (substr ($data, 2, 1)); Compression method
$flags = Ord (substr ($data, 3, 1)); Flags
if ($flags &!= $flags) {
Reserved bits are set--not allowed by RFC 1952
return null;
}
Note: $mtime May is negative (PHP integer limitations)
$mtime = Unpack ("V", substr ($data, 4,4));
$mtime = $mtime [1];
$XFL = substr ($data, 8, 1);
$os = substr ($data, 8, 1);
$headerlen = 10;
$extralen = 0;
$extra = "";
if ($flags & 4) {
2-byte length prefixed EXTRA data in header
if ($len-$headerlen-2 < 8) {
return false; Invalid format
}
$extralen = Unpack ("V", substr ($data, 8,2));
$extralen = $extralen [1];
if ($len-$headerlen-2-$extralen < 8) {
return false; Invalid format
}
$extra = substr ($data, $extralen);
$headerlen + + 2 + $extralen;
}
$filenamelen = 0;
$filename = "";
if ($flags & 8) {
//c-style string file NAME data in header
if ($len-$headerlen-1 < 8) {
return false; //Invalid format br> }
$filenamelen = Strpos (substr ($data, 8+ $extralen), Chr (0));
if ($filenamelen = = False | | $len-$headerlen-$filenamelen-1 < 8) {
return false; //Invalid format
}
$ filename = substr ($data, $headerlen, $filenamelen);
$headerlen + = $filenamelen + 1;
}
$commentlen = 0;
$comment = "";
if ($flags &) {
//C-style string COMMENT data in header
I F ($len-$headerlen-1 < 8) {
return false; //Invalid format }
$commentlen = Strpos (substr ($data, 8+ $extralen + $filenamelen), Chr (0));
if ($commentlen = = False | | $len-$headerlen-$commentlen-1 < 8) {
&NBSP;&NBSP;&NBSP;&N bsp; return false; //Invalid header format
}
$ Comment = substr ($data, $headerlen, $commentlen);
$headerlen + = $commentlen + 1;
}
$HEADERCRC = "";
if ($flags & 1) {
2-bytes (lowest order) of CRC32 on header present
if ($len-$headerlen-2 < 8) {
return false; Invalid format
}
$CALCCRC = CRC32 (substr ($data, 0, $headerlen)) & 0xFFFF;
$HEADERCRC = Unpack ("V", substr ($data, $headerlen, 2));
$HEADERCRC = $HEADERCRC [1];
if ($HEADERCRC!= $CALCCRC) {
return false; Bad Header CRC
}
$headerlen + 2;
}
GZIP Footer-these is negative due to PHP ' s limitations
$DATACRC = Unpack ("V", substr ($data, -8,4));
$DATACRC = $DATACRC [1];
$isize = Unpack ("V", substr ($data,-4));
$isize = $isize [1];
Perform the decompression:
$bodylen = $len-$headerlen-8;
if ($bodylen < 1) {
This should never happen-implementation bug!
return null;
}
$body = substr ($data, $headerlen, $bodylen);
$data = "";
if ($bodylen > 0) {
Switch ($method) {
Case 8:
Currently the supported compression method:
$data = Gzinflate ($body);
Break
Default
Unknown compression method
return false;
}
} else {
I ' m not sure if zero-byte the body content is allowed.
Allow it for now ... Doing nothing ...
}
verifiy decompressed size and CRC32:
Note:this may fail with large data sizes depending on
PHP ' s integer limitations affect strlen () since $isize
May is negative for large sizes.
if ($isize!= strlen ($data) | | | CRC32 ($DATA)!= $DATACRC) {
Bad format! Length or CRC doesn ' t match!
return false;
}
return $data;
}
Use:
Copy Code code as follows:
$html =file_get_contents (' http://www.jb51.net/');
$html =gzdecode ($html);
On the introduction of these three methods, should be able to solve most gzip-induced crawling garbled problem.