PHP web page capture and analysis methods, php Web page capture and analysis

Source: Internet
Author: User

PHP web page capture and analysis methods, php Web page capture and analysis

This article describes how to capture and analyze Web pages in PHP. We will share this with you for your reference. The details are as follows:

Capturing and analyzing a file is very simple. This tutorial will show you how to implement it step by step through an example. Let's get started!

First, I must first determine the URL we will capture. It can be set in the script or passed through $ QUERY_STRING. For simplicity, let's set the variables directly in the script.

<?php$url = 'http://www.php.net';?>

Step 2: capture the specified file and store it in an array using the file () function.

<?php$url = 'http://www.php.net';$lines_array = file($url);?>

Now the file is available in the array. However, the text we want to analyze may not be all in one line. To solve this problem, we can simply convert the array $ lines_array into a string. We can use the implode (x, y) function to implement it. If you want to use explode (array of string variables), set x to "|" or "! "Or other similar separators may be better. But for our purpose, it is best to set x to a space. Y is another necessary parameter because it is an array that you want to process with implode.

<?php$url = 'http://www.php.net';$lines_array = file($url);$lines_string = implode('', $lines_array);?>

Now, the crawling is finished, and the analysis is as follows. For the purpose of this example, we want to get everything from

<?php$url = 'http://www.php.net';$lines_array = file($url);$lines_string = implode('', $lines_array);eregi("

Let's take a look at the code. As you can see, the eregi () function is executed in the following format:

eregi("

"(. *)" Indicates everything, which can be interpreted as "analyzing things between

Finally, we can lose data. Because there is only one instance between

<?php$url = 'http://www.php.net';$lines_array = file($url);$lines_string = implode('', $lines_array); eregi("

This is all the code.

<? Php // get all content url save to file function get_index ($ save_file, $ prefix = "index _") {$ count = 68; $ I = 1; if (file_exists ($ save_file) @ unlink ($ save_file); $ fp = fopen ($ save_file, "a +") or die ("Open ". $ save_file. "failed"); while ($ I <$ count) {$ url = $ prefix. $ I. ". htm "; echo" Get ". $ url. "... "; $ url_str = get_content_url (get_url ($ url); echo" OK/n "; fwrite ($ fp, $ url_st R); ++ $ I;} fclose ($ fp);} // obtain the target multimedia object function get_object ($ url_file, $ save_file, $ split = "| --: **: -- | ") {if (! File_exists ($ url_file) die ($ url_file. "not exist"); $ file_arr = file ($ url_file); if (! Is_array ($ file_arr) | empty ($ file_arr) die ($ url_file. "not content"); $ url_arr = array_unique ($ file_arr); if (file_exists ($ save_file) @ unlink ($ save_file); $ fp = fopen ($ save_file, "a +") or die ("Open save file ". $ save_file. "failed"); foreach ($ url_arr as $ url) {if (empty ($ url) continue; echo "Get ". $ url. "... "; $ html_str = get_url ($ url); echo $ html_str; Echo $ url; exit; $ obj_str = get_content_object ($ html_str); echo "OK/n"; fwrite ($ fp, $ obj_str);} fclose ($ fp );} // retrieve the file content through the directory function get_dir ($ save_file, $ dir) {$ dp = opendir ($ dir); if (file_exists ($ save_file )) @ unlink ($ save_file); $ fp = fopen ($ save_file, "a +") or die ("Open save file ". $ save_file. "failed"); while ($ file = readdir ($ dp ))! = False) {if ($ file! = "." & $ File! = ".. ") {Echo" Read file ". $ file. "... "; $ file_content = file_get_contents ($ dir. $ file); $ obj_str = get_content_object ($ file_content); echo "OK/n"; fwrite ($ fp, $ obj_str) ;}} fclose ($ fp );} // obtain the specified url Content function get_url ($ url) {$ reg = '/^ http: // [^/]. + $/'; if (! Preg_match ($ reg, $ url) die ($ url. "invalid"); $ fp = fopen ($ url, "r") or die ("Open url :". $ url. "failed. "); while ($ fc = fread ($ fp, 8192) {$ content. = $ fc;} fclose ($ fp); if (empty ($ content) {die ("Get url :". $ url. "content failed. ");} return $ content;} // use socket to obtain the specified webpage function get_content_by_socket ($ url, $ host) {$ fp = fsockopen ($ host, 80) or die ("Open". $ Url. "failed"); $ header = "GET /". $ url. "HTTP/1.1/r/n"; $ header. = "Accept: */r/n"; $ header. = "Accept-Language: zh-cn/r/n"; $ header. = "Accept-Encoding: gzip, deflate/r/n"; $ header. = "User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; Maxthon; InfoPath.1 ;. net clr 2.0.50727)/r/n "; $ header. = "Host :". $ host. "/r/n"; $ header. = "Connection: Keep-Alive/r/n "; // $ Header. = "Cookie: cnzz02 = 2; rtime = 1; ltime = 1148456424859; cnzz_eid = 56601755-/r/n"; $ header. = "Connection: Close/r/n"; fwrite ($ fp, $ header); while (! Feof ($ fp) {$ contents. = fgets ($ fp, 8192);} fclose ($ fp); return $ contents;} // get the urlfunction get_content_url ($ host_url, $ file_contents) in the specified content) {// $ reg = '/^ (# | <a href = "http://lib.csdn.net/base/18" class = 'replace _ word' title = "JavaScript knowledge base" target =' _ blank 'style = 'color: # df3434; font-weight: bold; '> JavaScript </a>. *? | Ftp: //. + | http: //. + | .*? Href .*? | Play .*? | Index .*? | .*? Asp) + $/I '; // $ reg ='/^ (down .*? /. Html |/d + _/d +/. htm .*?) $/I '; $ rex = "/([hH] [rR] [eE] [Ff]) /s * =/s * ['/"] * ([^>'/"/s] +) [/"'>] */s */I "; $ reg = '/^ (down. *? /. Html) $/I '; preg_match_all ($ rex, $ file_contents, $ r); $ result = ""; // array (); foreach ($ r as $ c) {if (is_array ($ c) {foreach ($ c as $ d) {if (preg_match ($ reg, $ d) {$ result. = $ host_url. $ d. "/n" ;}}}return $ result;} // obtain the function get_content_object ($ str, $ split = "| --:**: -- | ") {$ regx ="/href/s * =/s * ['/"] * ([^>'/"/s] +) [/"'>] */s *(. *? </B>)/I "; preg_match_all ($ regx, $ str, $ result); if (count ($ result) = 3) {$ result [2] = str_replace ("Multimedia:", "", $ result [2]); $ result [2] = str_replace ("","", $ result [2]); $ result = $ result [1] [0]. $ split. $ result [2] [0]. "/n" ;}return $ result ;}?>

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.