The file_get_contents function is the key to capturing the data below, so let's look at the file_get_contents function syntax
String file_get_contents (String $filename [, bool $use _include_path = False [, resource $context [, int $offset =-1 [, int $maxlen]]]
As with file (), only file_get_contents () reads the file into a string. Reads the content of length MaxLen at the location specified by the parameter offset. If it fails, file_get_contents () returns FALSE.
The file_get_contents () function is the preferred method for reading the contents of a file into a string. Memory-mapping technology is also used to enhance performance if supported by the operating system.
Cases
The code is as follows |
Copy Code |
<?php $homepage = file_get_contents (' http://www.111cn.net/'); Echo $homepage; ?> |
So $homepage is that we collect the content of the net to save down, well say so much we begin.
Cases
The code is as follows |
Copy Code |
<?php function Fetch_urlpage_contents ($url) { $c =file_get_contents ($url); return $c; } Get matching content function fetch_match_contents ($begin, $end, $c) { $begin =change_match_string ($begin); $end =change_match_string ($end); $p = "{$begin} (. *) {$end}"; if (eregi ($p, $c, $rs)) { return $rs [1];} else {return "";} }//Escape Regular Expression string function change_match_string ($STR) { Note that the following is just a simple escape $old =array ("/", "$"); $new =array ("/", "$"); $str =str_replace ($old, $new, $STR); return $str; } Collecting Web pages function pick ($url, $ft, $th) { $c =fetch_urlpage_contents ($url); foreach ($ft as $key => $value) { $rs [$key]=fetch_match_contents ($value ["Begin"], $value ["End"], $c); if (Is_array ($th [$key])) {foreach ($th [$key] as $old => $new) { $rs [$key]=str_replace ($old, $new, $rs [$key]); } } } return $rs; } $url = "Http://www.111cn.net"; The address to collect $ft ["title"] ["Begin"]= "<title>"; The start point of the interception $ft ["title"] ["End"]= "</title>"; End point of interception $th ["title"] ["Zhongshan"]= "Guangdong"; Replacement of intercepted parts $ft ["Body"] ["Begin"]= "<body>"; The start point of the interception $ft ["Body"] ["End"]= "</body>"; End point of interception $th ["Body"] ["Zhongshan"]= "Guangdong"; Replacement of intercepted parts $rs =pick ($url, $ft, $th); Start collecting echo $rs ["title"]; echo $rs ["body"]; Output ?> |
The following code is modified from the previous side and is designed to extract all hyperlinks, mailboxes, or other specific content on a Web page.
The code is as follows |
Copy Code |
<?php function Fetch_urlpage_contents ($url) { $c =file_get_contents ($url); return $c; } Get matching content function fetch_match_contents ($begin, $end, $c) { $begin =change_match_string ($begin); $end =change_match_string ($end); $p = "#{$begin} (. *) {$end} #iU";//i indicates ignore case, u forbids greedy match if (Preg_match_all ($p, $c, $rs)) { return $rs;} else {return "";} }//Escape Regular Expression string function change_match_string ($STR) { Note that the following is just a simple escape $old =array ("/", "$", '? '); $new =array ("/", "$", '? '); $str =str_replace ($old, $new, $STR); return $str; } Collecting Web pages function pick ($url, $ft, $th) { $c =fetch_urlpage_contents ($url); foreach ($ft as $key => $value) { $rs [$key]=fetch_match_contents ($value ["Begin"], $value ["End"], $c); if (Is_array ($th [$key])) {foreach ($th [$key] as $old => $new) { $rs [$key]=str_replace ($old, $new, $rs [$key]); } } } return $rs; } $url = "Http://www.111cn.net"; The address to collect $ft ["A"] ["Begin"]= ' <a '; Intercept the start point <br/> $ft ["A"] ["End"]= ' > '; End point of interception $rs =pick ($url, $ft, $th); Start collecting Print_r ($rs ["a"]); ?> |
small hint file_get_contents is very easy to be collected, we can use curl to imitate the user to visit the site, which is higher than the above to a lot of Oh, file_get_contents () efficiency slightly lower, commonly used failure situation, Curl () is very efficient, support multithreading, but need to open the curl extension. The following are the steps to open the Curl extension:
1, the PHP folder under the three files Php_curl.dll,libeay32.dll,ssleay32.dll copy to the System32;
2, in the php.ini (c:windows directory) in the Extension=php_curl.dll in the semicolon removed;
3, restart Apache or IIS.
Simple crawl page function with forged Referer and user_agent functions
The code is as follows |
Copy Code |
<?php Function Getsources ($Url, $User _agent= ', $Referer _url= ')//crawl a specified page { //$URL page address to be crawled //$User _ The Agent needs to return user_agent information such as "Baiduspider" or "Googlebot" $ch = Curl_init (); curl_setopt ($ch, Curlopt_url, $URL); curl_setopt ($ch, curlopt_useragent, $User _agent); curl_setopt ($ch, Curlopt_referer, $Referer _url); curl_setopt ($ch, curlopt_followlocation,1); curl_setopt ($ch, Curlopt_returntransfer, 1); $MySources = curl_exec ($ch); Curl_close ($ch); return $MySources; } $Url = "http://www.111cn.net";//There is no $User _agent = "baiduspider+ to get Content" (+http://www.baidu.com/search/ spider.htm) "; $Referer _url = ' http://www.111cn.net/'; Echo getsources ($URL, $User _agent, $Referer _url); ? |