Translator: limodou
Capturing and analyzing a file is very simple. This tutorial will show you how to implement it step by step through an example. Let's start
Start!
First, I must first determine the URL we will capture. It can be set in the script or passed through $ QUERY_STRING. For simplicity
For the sake of simplicity, let's set the variables directly in the script.
<? $url = 'http://www.php.net'; ?>
Step 2: capture the specified file and store it in an array using the file () function.
<? $url = 'http://www.php.net'; $lines_array = file($url); ?>
Now the file is available in the array. However, the text we want to analyze may not be all in one line. To solve this problem
, We can simply convert the array $ lines_array into a string. We can use the implode (x, y) function to implement it. For example
If you want to use explode (array of string variables), set X to "|" or "! "Or other similar separators may be better. However
Our goal is to set X to a space. Y is another necessary parameter because it is an array that you want to process with implode.
<? $url = 'http://www.php.net'; $lines_array = file($url); $lines_string = implode('', $lines_array); ?>
Now, the crawling is finished, and the analysis is as follows. For the purpose of this example, we want to go to Between all things. To analyze the string, we also need something called a regular expression.
<? $url = 'http://www.php.net'; $lines_array = file($url); $lines_string = implode('', $lines_array); eregi("
Let's take a look at the code. As you can see, the eregi () function is executed in the following format:
eregi("
"(. *)" Indicates everything, which can be interpreted as "analyzing things between In the analyzed string, $ head is the array stored in the analysis result.
Finally, we can lose data. Because only one instance exists between There is an element that we want. Let's print it out.
<? $url = 'http://www.php.net'; $lines_array = file($url); $lines_string = implode('', $lines_array); eregi("
This is all the code.
<? Php <br/> // get all content URL and save it to the file <br/> function get_index ($ save_file, $ prefix = "index _") {<br/> $ COUNT = 68; <br/> $ I = 1; <br/> If (file_exists ($ save_file) @ unlink ($ save_file ); <br/> $ fp = fopen ($ save_file, "A +") or die ("open ". $ save_file. "failed"); <br/> while ($ I <$ count) {<br/> $ url = $ prefix. $ I. ". htm "; <br/> echo" get ". $ URL. "... "; <br/> $ url_str = get_content_ur L (get_url ($ URL); <br/> echo "OK/N"; <br/> fwrite ($ FP, $ url_str ); <br/> + $ I; <br/>}< br/> fclose ($ FP ); <br/>}< br/> // obtain the target multimedia object <br/> function get_object ($ url_file, $ save_file, $ split = "| --:**: -- | ") {<br/> If (! File_exists ($ url_file) Die ($ url_file. "Not Exist"); <br/> $ file_arr = file ($ url_file); <br/> If (! Is_array ($ file_arr) | empty ($ file_arr) Die ($ url_file. "Not content"); <br/> $ url_arr = array_unique ($ file_arr); <br/> If (file_exists ($ save_file) @ unlink ($ save_file ); <br/> $ fp = fopen ($ save_file, "A +") or die ("open save File ". $ save_file. "failed"); <br/> foreach ($ url_arr as $ URL) {<br/> If (empty ($ URL) continue; <br/> echo "get ". $ URL. "... "; <br/> $ html _ STR = get_url ($ URL); <br/> echo $ html_str; <br/> echo $ URL; <br/> exit; <br/> $ obj_str = get_content_object ($ html_str); <br/> echo "OK/N"; <br/> fwrite ($ FP, $ obj_str ); <br/>}< br/> fclose ($ FP ); <br/>}< br/> // retrieve file content by traversing the directory <br/> function get_dir ($ save_file, $ DIR) {<br/> $ dp = opendir ($ DIR); <br/> If (file_exists ($ save_file) @ unlink ($ save_file ); <br/> $ fp = fopen ($ SA Ve_file, "A +") or die ("open save File ". $ save_file. "failed"); <br/> while ($ file = readdir ($ DP ))! = False) {<br/> if ($ file! = "." & $ File! = ".. ") {<Br/> echo" Read File ". $ file. "... "; <br/> $ file_content = file_get_contents ($ dir. $ file); <br/> $ obj_str = get_content_object ($ file_content); <br/> echo "OK/N"; <br/> fwrite ($ FP, $ obj_str); <br/>}< br/> fclose ($ FP ); <br/>}</P> <p> // obtain the specified URL content <br/> function get_url ($ URL) {<br/> $ Reg = '/^ http: // [^/]. + $/'; <br/> If (! Preg_match ($ Reg, $ URL) Die ($ URL. "invalid"); <br/> $ fp = fopen ($ URL, "R") or die ("Open URL :". $ URL. "failed. "); <br/> while ($ fc = fread ($ FP, 8192) {<br/> $ content. = $ FC; <br/>}< br/> fclose ($ FP); <br/> If (empty ($ content )) {<br/> die ("Get URL :". $ URL. "content failed. "); <br/>}< br/> return $ content; <br/>}< br/> // use socket to obtain the specified webpage <br/> function get_content_by _ Socket ($ URL, $ host) {<br/> $ fp = fsockopen ($ host, 80) or die ("open ". $ URL. "failed"); <br/> $ header = "Get /". $ URL. "HTTP/1.1/R/N"; <br/> $ header. = "accept: */R/N"; <br/> $ header. = "Accept-language: ZH-CN/R/N"; <br/> $ header. = "Accept-encoding: gzip, deflate/R/N"; <br/> $ header. = "User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; sv1; Maxthon; infopath.1;. Net CLR 2.0.50727)/R/N "; <br/> $ header. = "Host :". $ host. "/R/N"; <br/> $ header. = "connection: keep-alive/R/N"; <br/> // $ header. = "Cookie: cnzz02 = 2; rtime = 1; ltime = 1148456424859; cnzz_eid = 56601755-/R/n/R/N"; <br/> $ header. = "connection: Close/R/n/R/N"; <br/> fwrite ($ FP, $ header); <br/> while (! Feof ($ FP) {<br/> $ contents. = fgets ($ FP, 8192); <br/>}< br/> fclose ($ FP); <br/> return $ contents; <br/>}</P> <p> // obtain the URL in the specified content <br/> function get_content_url ($ host_url, $ file_contents) {<br/> // $ Reg = '/^ (# | JavaScript. *? | Ftp: //. + | http: //. + | .*? Href .*? | Play .*? | Index .*? | .*? ASP) + $/I '; <br/> // $ Reg ='/^ (down .*? /. Html |/d + _/d +/. htm .*?) $/I '; <br/> $ REX = "/([HH] [RR] [EE] [ff]) /S * =/S * ['/"] * ([^>'/"/S] +) [/"'>] */S */I "; <br/> $ Reg = '/^ (down. *? /. Html) $/I '; <br/> preg_match_all ($ Rex, $ file_contents, $ R); <br/> $ result = ""; // array (); <br/> foreach ($ R as $ c) {<br/> If (is_array ($ C) {<br/> foreach ($ C as $ D) {<br/> If (preg_match ($ Reg, $ D) {$ result. = $ host_url. $ D. "/N" ;}< br/>}< br/> return $ result; <br/>}< br/> // obtain the multimedia file in the specified content <br/> function get_content_object ($ STR, $ split = "| --:**: -- | ") {<br /> $ Regx = "/href/S * =/S * ['/"] * ([^>'/"/S] +) [/"'>] */S *(. *? </B>)/I "; <br/> preg_match_all ($ regx, $ STR, $ result); <br/> If (count ($ result) = 3) {<br/> $ result [2] = str_replace ("Multimedia:", "", $ result [2]); <br/> $ result [2] = str_replace ("", "", $ result [2]); <br/> $ result = $ result [1] [0]. $ split. $ result [2] [0]. "/N"; <br/>}< br/> return $ result; <br/>}< br/>?>