Introduction to the snoopy collection class in PHP _ PHP Tutorial

Source: Internet
Author: User
Introduction to PHP capture collection class snoopy. Snoopy is a php class used to imitate the functions of a web browser. it can obtain webpage content and send forms. Some functions of snoopy. sourceforge. netSnoopy on the official website snoopy is a php class used to imitate the functions of the web browser. it can complete the tasks of obtaining webpage content and sending forms. Official website http://snoopy.sourceforge.net/

Some features of Snoopy:

  • Fetch ()
  • Capture the text content of a webpage (remove HTML tags) fetchtext ()
  • Capture the link of the web page, form fetchlinks () fetchform ()
  • Support proxy Host
  • Supports basic user name/password verification
  • Supports setting user_agent, referer, cookies, and header content)
  • Supports browser redirection and can control the depth of redirection.
  • Extends links on a webpage to high-quality URLs (default)
  • Submit data and obtain the returned value
  • Supports tracking HTML frameworks
  • Supports sending cookies during redirection

Php4 and above are required. Because it is a php class, it does not need to be expanded, and the server does not support curl.

Class method

1. fetch ($ uri)

This method is used to capture the content of a webpage. $ URI is the URL of the webpage to be crawled. The captured results are stored in $ this-> results.

If you are capturing a framework, Snoopy will track each frame and store it in an array, and then save it to $ this-> results.

 Fetch ($ url); // Obtain all content echo $ snoopy-> results; // Display result?>

2. fetchtext ($ URI)

This method is similar to fetch (). The only difference is that this method will remove HTML tags and other irrelevant data and only return the text content in the webpage.

 Fetchtext ($ url); // Obtain the text content echo $ snoopy-> results; // display the result?>

3. fetchform ($ URI)

This method is similar to fetch (). The only difference is that this method will remove the HTML tag and other irrelevant data and only return the form content (form) in the webpage ).

 

4. fetchlinks ($ URI)

This method is similar to fetch (). The only difference is that this method will remove HTML tags and other irrelevant data and only return links in the webpage ). By default, the relative link is automatically completed and converted to a complete URL.

 

5. submit ($ URI, $ formvars)

This method sends a confirmation form to the URL specified by $ URL. $ Formvars is an array that stores form parameters.

 

6. submittext ($ URI, $ formvars)

This method is similar to submit (). The only difference is that this method will remove HTML tags and other irrelevant data and only return the text content on the webpage after login.

 

7. submitlinks ($ URI)

This method is similar to submit (). The only difference is that this method will remove HTML tags and other irrelevant data and only return links in the webpage ). By default, the relative link is automatically completed and converted to a complete URL.

Class attributes (the default value is in brackets) 
  • $ Host connected host
  • $ Port connection port
  • $ Proxy_host: the proxy host used, if any
  • $ Proxy_port indicates the proxy Host Port used. If yes
  • $ Agent User proxy disguise (Snoopy v0.1)
  • $ Referer information, if any
  • $ Cookies, if any
  • $ Rawheaders other header information, if any
  • $ Maxredirs maximum redirect times, 0 = not allowed (5)
  • $ Offsiteok whether or not to allow redirects off-site. (true)
  • $ Expandlinks: whether to add all links to the full address (true)
  • $ User authentication username, if any
  • $ Pass authentication username, if any
  • $ Accept http accept type (image/gif, image/x-xbitmap, image/jpeg, image/pjpeg ,*/*)
  • $ Error: where is the error reported? if Yes
  • $ Response_code response code returned from the server
  • $ Headers header information returned from the server
  • $ Maxlength: maximum length of returned data
  • $ Read_timeout read operation timeout (requires PHP 4 Beta 4 +), set to 0 to no timeout
  • $ Timed_out if a read operation times out, this attribute returns true (requires PHP 4 Beta 4 +)
  • $ Maxframes maximum number of frames that can be tracked
  • $ Status indicates the http status captured.
  • $ Temp_dir temporary file directory (/tmp) that can be written by the webpage server)
  • $ Curl_path cURL binary Directory. If no cURL binary is available, set it to false.
Demo

 include "Snoopy.class.php"; $snoopy = new Snoopy;  $snoopy->proxy_host = "http://www.bkjia.com/librarys/veda/"; $snoopy->proxy_port = "80";  $snoopy->agent = "(compatible; MSIE 4.01; MSN 2.5; AOL 4.0; Windows 98)"; $snoopy->referer = "http://www.4wei.cn";  $snoopy->cookies["SessionID"] = 238472834723489l; $snoopy->cookies["favoriteColor"] = "RED";  $snoopy->rawheaders["Pragma"] = "no-cache";  $snoopy->maxredirs = 2; $snoopy->offsiteok = false; $snoopy->expandlinks = false;  $snoopy->user = "joe"; $snoopy->pass = "bloe";  if($snoopy->fetchtext("http://www.4wei.cn")) { echo "
".htmlspecialchars($snoopy->results)."
n"; } else echo "error fetching document: ".$snoopy->error."n";

Obtain the content of a specified url:

 
 Fetch ($ url); // Obtain all content echo $ snoopy-> results; // display the result // Optional. // $ snoopy-> fetchtext // Obtain the text content (remove the html code) // $ snoopy-> fetchlinks // Obtain the link // $ snoopy-> fetchform // Obtain the form?>

Form submission:

 Submit ($ action, $ formvars); // $ formvars is the submitted array echo $ snoopy-> results; // obtain the result returned after the form is submitted. // optional values: $ snoopy-> submittext; // after the form is submitted, only the html-removed text is returned. $ snoopy-> submitlinks; // after submission, only the link is returned?>

Since the form has been submitted, you can do a lot of things. Next we will disguise the ip address and the browser:

 Cookies ["PHPSESSID"] = 'fc0000b1918bd522cc863f000090e6fff7 '; // disguise sessionid $ snoopy-> agent = "(compatible; MSIE 4.01; MSN 2.5; AOL 4.0; Windows 98 )"; // camouflage browser $ snoopy-> referer = http://www.4wei.cn; // camouflage source page address http_referer $ snoopy-> rawheaders ["Pragma"] = "no-cache "; // cache http header information $ snoopy-> rawheaders ["X_FORWARDED_FOR"] = "127.0.0.101"; // disguise ip $ snoopy-> submit ($ action, $ formvars ); echo $ snoopy-> results;?>

In the past, we could disguise session as a web browser and ip address, and haha could do a lot of things. For example, you can vote for an ip address with a verification code.

Ps: Here, the disguised ip address is actually an http header. Therefore, the ip address obtained through REMOTE_ADDR cannot be disguised, but those obtained through the http header (which can prevent proxy) you can create an ip address by yourself.

Let's briefly describe how to use the verification code. First, use a common browser to view the page, find the sessionid corresponding to the verification code, and write down the sessionid and the verification code value. Then, use snoopy to forge the verification code.

Principle: because it is the same sessionid, the verification code obtained is the same as the one entered for the first time.

Sometimes we may need to forge more things, and snoopy thinks for us completely:

 Proxy_host = "http://www.bkjia.com/librarys/veda/"; $ snoopy-> proxy_port = "8080"; // use proxy $ snoopy-> maxredirs = 2; // redirect times $ snoopy-> expandlinks = true; // whether to enable the full link is frequently used during Collection. // for example, if the link is/images/taoav.gif, you can change it to its full link metadata-> maxframes = 5 // maximum number of frames allowed/ /when capturing the framework, $ snoopy-> results returns an array $ snoopy-> error // returns an error message?>

A complete example:

/*** You need the snoopy.class.php from * http://snoopy.sourceforge.net/*/include("snoopy.class.php"); $snoopy = new Snoopy;// need an proxy?://$snoopy->proxy_host = "my.proxy.host";//$snoopy->proxy_port = "8080"; // set browser and referer:$snoopy->agent = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)";$snoopy->referer = "http://www.jonasjohn.de/"; // set some cookies:$snoopy->cookies["SessionID"] = '238472834723489';$snoopy->cookies["favoriteColor"] = "blue"; // set an raw-header:$snoopy->rawheaders["Pragma"] = "no-cache"; // set some internal variables:$snoopy->maxredirs = 2;$snoopy->offsiteok = false;$snoopy->expandlinks = false; // set username and password (optional)//$snoopy->user = "joe";//$snoopy->pass = "bloe"; // fetch the text of the website www.google.com:if($snoopy->fetchtext("http://www.google.com")){     // other methods: fetch, fetchform, fetchlinks, submittext and submitlinks     // response code:    print "response code: ".$snoopy->response_code."
n"; // print the headers: print "Headers:
"; while(list($key,$val) = each($snoopy->headers)){ print $key.": ".$val."
n"; } print "
n"; // print the texts of the website: print htmlspecialchars($snoopy->results)."n";}else { print "Snoopy: error while fetching document: ".$snoopy->error."n";}

Use the Snoopy class to complete a simple image collection:

 
 Fetchlinks ($ sourceURL); // Obtain the webpage link $ a = $ snoopy-> results; // Obtain the webpage link result $ re = "/d+.html $ /"; // match the regular expression // filter out the specified file address request foreach ($ a as $ tmp) {if (preg_match ($ re, $ tmp )) {$ aa = $ tmp ;}} getImgURL ($ aa); function getImgURL ($ siteName) {$ snoopy = new Snoopy (); $ snoopy-> fetch ($ siteName ); $ fileContent = $ snoopy-> results; // Obtain the content of the filtered page // Obtain the regular expression that matches the image $ reTag = "// I "; if (preg_match ($ reTag, $ fileContent) {// filter images $ ret = preg_match_all ($ reTag, $ fileContent, $ matchResult); for ($ I = 0, $ len = count ($ matchResult [1]); $ I <$ len; ++ $ I) {saveImgURL ($ matchResult [1] [$ I], $ matchResult [2] [$ I]) ;}} function saveImgURL ($ name, $ suffix) {$ url = $ name. ". ". $ suffix; echo "requested image address :". $ url."
"; $ ImgSavePath =" E:/123/images/"; // Image save address $ imgId = mt_rand (); // Generate a random file name if ($ suffix = "gif") {// according to the image type, put it in a different folder $ imgSavePath. = "emotion";} else {$ imgSavePath. = "topic";} $ imgSavePath. = ("/". $ imgId. ". ". $ suffix); // assemble the file name to be saved if (is_file ($ imgSavePath) {// Determine whether the file name exists. if yes, delete unlink ($ imgSavePath); echo"

The file ". $ imgSavePath." already exists and will be deleted.

";}$ ImgFile = file_get_contents ($ url); // Read network files $ flag = file_put_contents ($ imgSavePath, $ imgFile); // write to local if ($ flag) {echo"

The file ". $ imgSavePath." is saved successfully.

";}}?>

Bytes. Some features of the official website http://snoopy.sourceforge.net/Snoopy...

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.