Page crawling is often used in PHP projects to capture various encoding and compressed page content. two common problems may occur during page crawling.
1. the page encoding is inconsistent. the local code is UTF-8, and the captured page is gbk, leading to garbled characters.
2. Some websites use the compression technology to compress the pages and gzip the pages, which causes exceptions when capturing results.
After searching for related solutions on the Internet and performing local tests, you can sort out related functions for subsequent use. Also put on github, address: https://github.com/lock-upme/Spider
The main program briefly describes:
First, use file_get_contents to capture the file. if the file cannot be captured, use Snoopy to capture the file and convert the file encoding.
For the program, see:
/*** Capture page content ** $ obj = new spider () * $ result = $ obj-> spider (' http://www.test.com/1.html '); ** @ Author lock */class Spider {/*** capture page content ** @ param string $ url * @ return string */public function spider ($ url) {set_time_limit (10); $ result = self: fileGetContents ($ url); if (empty ($ result) {$ result = self: snoopy ($ url );} if (empty ($ result) {return false;} $ result = self: array_iconv ($ result); if (empty ($ result) {return false ;} $ result = str_replace ("\ n", "", $ result); return $ result;}/*** get Page content ** @ param string $ url * @ return string */public function fileGetContents ($ url) {// read-only 2 bytes if it is (hexadecimal) gzip is enabled for 1f 8b (10 hexadecimal) 31 139; $ file = @ fopen ($ url, 'RB'); $ bin = @ fread ($ file, 2 ); @ fclose ($ file); $ strInfo = @ unpack ('c2chars', $ bin); $ typeCode = intval ($ strInfo ['chars1']. $ strInfo ['chars2 ']); $ url = ($ typeCode = 31139 )? 'Ress. zlib ://'. $ url: $ url; // return @ file_get_contents ($ url );} /*** get page content ** @ param string $ url * @ return string */public function snoopy ($ url) {require_once 'Snoopy. class. php'; $ snoopy = new Snoopy; $ snoopy-> agent = 'mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) chrome/33.0.1750.146 Safari/537.36 '; $ snoopy-> _ fp_timeout = 10; $ urlSplit = self: urlSimplify ($ url); $ snoopy-> Referer = $ urlSplit ['domain ']; $ result = $ snoopy-> fetch ($ url); return $ snoopy-> results ;} /*** encode and convert the data (from the network) ** @ param array/string $ data array * @ param string $ encoding after output * @ return returns the encoded data */public function array_iconv ($ data, $ output = 'utf-8') {$ encodeArr = array ('utf-8', 'ascii ', 'gbk', 'gb2312', 'big5 ', 'jis ', 'eucjp-win', 'sjis-win', 'euc-JP'); $ encoded = mb_detect_encoding ($ data, $ encodeArr ); If (empty ($ encoded) {$ encoded = 'utf-8';} if (! Is_array ($ data) {return @ mb_convert_encoding ($ data, $ output, $ encoded);} else {foreach ($ data as $ key => $ val) {$ key = self: array_iconv ($ key, $ output); if (is_array ($ val) {$ data [$ key] = self: array_iconv ($ val, $ output);} else {$ data [$ key] = @ mb_convert_encoding ($ data, $ output, $ encoded) ;}} return $ data ;}}}