Web page capture and analysis by PHP

Source: Internet
Author: User

Translator: limodou
Capturing and analyzing a file is very simple. This tutorial will show you how to implement it step by step through an example. Let's start
Start!

First, I must first determine the URL we will capture. It can be set in the script or passed through $ QUERY_STRING. For simplicity
For the sake of simplicity, let's set the variables directly in the script.

<? $url = 'http://www.php.net'; ?> 

Step 2: capture the specified file and store it in an array using the file () function.

<? $url = 'http://www.php.net'; $lines_array = file($url); ?> 

Now the file is available in the array. However, the text we want to analyze may not be all in one line. To solve this problem
, We can simply convert the array $ lines_array into a string. We can use the implode (x, y) function to implement it. For example
If you want to use explode (array of string variables), set X to "|" or "! "Or other similar separators may be better. However
Our goal is to set X to a space. Y is another necessary parameter because it is an array that you want to process with implode.

<? $url = 'http://www.php.net'; $lines_array = file($url); $lines_string = implode('', $lines_array); ?> 

Now, the crawling is finished, and the analysis is as follows. For the purpose of this example, we want to go to Between all things. To analyze the string, we also need something called a regular expression.

<? $url = 'http://www.php.net'; $lines_array = file($url); $lines_string = implode('', $lines_array); eregi("

Let's take a look at the code. As you can see, the eregi () function is executed in the following format:

eregi("

"(. *)" Indicates everything, which can be interpreted as "analyzing things between In the analyzed string, $ head is the array stored in the analysis result.
Finally, we can lose data. Because only one instance exists between There is an element that we want. Let's print it out.

<? $url = 'http://www.php.net'; $lines_array = file($url); $lines_string = implode('', $lines_array); eregi("

This is all the code.

<? Php <br/> // get all content URL and save it to the file <br/> function get_index ($ save_file, $ prefix = "index _") {<br/> $ COUNT = 68; <br/> $ I = 1; <br/> If (file_exists ($ save_file) @ unlink ($ save_file ); <br/> $ fp = fopen ($ save_file, "A +") or die ("open ". $ save_file. "failed"); <br/> while ($ I <$ count) {<br/> $ url = $ prefix. $ I. ". htm "; <br/> echo" get ". $ URL. "... "; <br/> $ url_str = get_content_ur L (get_url ($ URL); <br/> echo "OK/N"; <br/> fwrite ($ FP, $ url_str ); <br/> + $ I; <br/>}< br/> fclose ($ FP ); <br/>}< br/> // obtain the target multimedia object <br/> function get_object ($ url_file, $ save_file, $ split = "| --:**: -- | ") {<br/> If (! File_exists ($ url_file) Die ($ url_file. "Not Exist"); <br/> $ file_arr = file ($ url_file); <br/> If (! Is_array ($ file_arr) | empty ($ file_arr) Die ($ url_file. "Not content"); <br/> $ url_arr = array_unique ($ file_arr); <br/> If (file_exists ($ save_file) @ unlink ($ save_file ); <br/> $ fp = fopen ($ save_file, "A +") or die ("open save File ". $ save_file. "failed"); <br/> foreach ($ url_arr as $ URL) {<br/> If (empty ($ URL) continue; <br/> echo "get ". $ URL. "... "; <br/> $ html _ STR = get_url ($ URL); <br/> echo $ html_str; <br/> echo $ URL; <br/> exit; <br/> $ obj_str = get_content_object ($ html_str); <br/> echo "OK/N"; <br/> fwrite ($ FP, $ obj_str ); <br/>}< br/> fclose ($ FP ); <br/>}< br/> // retrieve file content by traversing the directory <br/> function get_dir ($ save_file, $ DIR) {<br/> $ dp = opendir ($ DIR); <br/> If (file_exists ($ save_file) @ unlink ($ save_file ); <br/> $ fp = fopen ($ SA Ve_file, "A +") or die ("open save File ". $ save_file. "failed"); <br/> while ($ file = readdir ($ DP ))! = False) {<br/> if ($ file! = "." & $ File! = ".. ") {<Br/> echo" Read File ". $ file. "... "; <br/> $ file_content = file_get_contents ($ dir. $ file); <br/> $ obj_str = get_content_object ($ file_content); <br/> echo "OK/N"; <br/> fwrite ($ FP, $ obj_str); <br/>}< br/> fclose ($ FP ); <br/>}</P> <p> // obtain the specified URL content <br/> function get_url ($ URL) {<br/> $ Reg = '/^ http: // [^/]. + $/'; <br/> If (! Preg_match ($ Reg, $ URL) Die ($ URL. "invalid"); <br/> $ fp = fopen ($ URL, "R") or die ("Open URL :". $ URL. "failed. "); <br/> while ($ fc = fread ($ FP, 8192) {<br/> $ content. = $ FC; <br/>}< br/> fclose ($ FP); <br/> If (empty ($ content )) {<br/> die ("Get URL :". $ URL. "content failed. "); <br/>}< br/> return $ content; <br/>}< br/> // use socket to obtain the specified webpage <br/> function get_content_by _ Socket ($ URL, $ host) {<br/> $ fp = fsockopen ($ host, 80) or die ("open ". $ URL. "failed"); <br/> $ header = "Get /". $ URL. "HTTP/1.1/R/N"; <br/> $ header. = "accept: */R/N"; <br/> $ header. = "Accept-language: ZH-CN/R/N"; <br/> $ header. = "Accept-encoding: gzip, deflate/R/N"; <br/> $ header. = "User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; sv1; Maxthon; infopath.1;. Net CLR 2.0.50727)/R/N "; <br/> $ header. = "Host :". $ host. "/R/N"; <br/> $ header. = "connection: keep-alive/R/N"; <br/> // $ header. = "Cookie: cnzz02 = 2; rtime = 1; ltime = 1148456424859; cnzz_eid = 56601755-/R/n/R/N"; <br/> $ header. = "connection: Close/R/n/R/N"; <br/> fwrite ($ FP, $ header); <br/> while (! Feof ($ FP) {<br/> $ contents. = fgets ($ FP, 8192); <br/>}< br/> fclose ($ FP); <br/> return $ contents; <br/>}</P> <p> // obtain the URL in the specified content <br/> function get_content_url ($ host_url, $ file_contents) {<br/> // $ Reg = '/^ (# | JavaScript. *? | Ftp: //. + | http: //. + | .*? Href .*? | Play .*? | Index .*? | .*? ASP) + $/I '; <br/> // $ Reg ='/^ (down .*? /. Html |/d + _/d +/. htm .*?) $/I '; <br/> $ REX = "/([HH] [RR] [EE] [ff]) /S * =/S * ['/"] * ([^>'/"/S] +) [/"'>] */S */I "; <br/> $ Reg = '/^ (down. *? /. Html) $/I '; <br/> preg_match_all ($ Rex, $ file_contents, $ R); <br/> $ result = ""; // array (); <br/> foreach ($ R as $ c) {<br/> If (is_array ($ C) {<br/> foreach ($ C as $ D) {<br/> If (preg_match ($ Reg, $ D) {$ result. = $ host_url. $ D. "/N" ;}< br/>}< br/> return $ result; <br/>}< br/> // obtain the multimedia file in the specified content <br/> function get_content_object ($ STR, $ split = "| --:**: -- | ") {<br /> $ Regx = "/href/S * =/S * ['/"] * ([^>'/"/S] +) [/"'>] */S *(. *? </B>)/I "; <br/> preg_match_all ($ regx, $ STR, $ result); <br/> If (count ($ result) = 3) {<br/> $ result [2] = str_replace ("Multimedia:", "", $ result [2]); <br/> $ result [2] = str_replace ("", "", $ result [2]); <br/> $ result = $ result [1] [0]. $ split. $ result [2] [0]. "/N"; <br/>}< br/> return $ result; <br/>}< br/>?>




Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.