Web page content capturing programs for various encoding and compression PHP

Source: Internet
Author: User
Page crawling is often used in PHP projects to capture various encoding and compressed page content. two common problems may occur during page crawling.

1. the page encoding is inconsistent. the local code is UTF-8, and the captured page is gbk, leading to garbled characters.

2. Some websites use the compression technology to compress the pages and gzip the pages, which causes exceptions when capturing results.

After searching for related solutions on the Internet and performing local tests, you can sort out related functions for subsequent use. Also put on github, address: https://github.com/lock-upme/Spider

The main program briefly describes:

First, use file_get_contents to capture the file. if the file cannot be captured, use Snoopy to capture the file and convert the file encoding.

For the program, see:

/*** Capture page content ** $ obj = new spider () * $ result = $ obj-> spider (' http://www.test.com/1.html '); ** @ Author lock */class Spider {/*** capture page content ** @ param string $ url * @ return string */public function spider ($ url) {set_time_limit (10); $ result = self: fileGetContents ($ url); if (empty ($ result) {$ result = self: snoopy ($ url );} if (empty ($ result) {return false;} $ result = self: array_iconv ($ result); if (empty ($ result) {return false ;} $ result = str_replace ("\ n", "", $ result); return $ result;}/*** get Page content ** @ param string $ url * @ return string */public function fileGetContents ($ url) {// read-only 2 bytes if it is (hexadecimal) gzip is enabled for 1f 8b (10 hexadecimal) 31 139; $ file = @ fopen ($ url, 'RB'); $ bin = @ fread ($ file, 2 ); @ fclose ($ file); $ strInfo = @ unpack ('c2chars', $ bin); $ typeCode = intval ($ strInfo ['chars1']. $ strInfo ['chars2 ']); $ url = ($ typeCode = 31139 )? 'Ress. zlib ://'. $ url: $ url; // return @ file_get_contents ($ url );} /*** get page content ** @ param string $ url * @ return string */public function snoopy ($ url) {require_once 'Snoopy. class. php'; $ snoopy = new Snoopy; $ snoopy-> agent = 'mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) chrome/33.0.1750.146 Safari/537.36 '; $ snoopy-> _ fp_timeout = 10; $ urlSplit = self: urlSimplify ($ url); $ snoopy-> Referer = $ urlSplit ['domain ']; $ result = $ snoopy-> fetch ($ url); return $ snoopy-> results ;} /*** encode and convert the data (from the network) ** @ param array/string $ data array * @ param string $ encoding after output * @ return returns the encoded data */public function array_iconv ($ data, $ output = 'utf-8') {$ encodeArr = array ('utf-8', 'ascii ', 'gbk', 'gb2312', 'big5 ', 'jis ', 'eucjp-win', 'sjis-win', 'euc-JP'); $ encoded = mb_detect_encoding ($ data, $ encodeArr ); If (empty ($ encoded) {$ encoded = 'utf-8';} if (! Is_array ($ data) {return @ mb_convert_encoding ($ data, $ output, $ encoded);} else {foreach ($ data as $ key => $ val) {$ key = self: array_iconv ($ key, $ output); if (is_array ($ val) {$ data [$ key] = self: array_iconv ($ val, $ output);} else {$ data [$ key] = @ mb_convert_encoding ($ data, $ output, $ encoded) ;}} return $ data ;}}}

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.