Introduction and use of snoopy

Source: Internet
Author: User
: This article mainly introduces snoopy introduction and usage. if you are interested in the PHP Tutorial, refer to it. Snoopy is a php class used to simulate browser functions. it can obtain webpage content and send forms. To run Snoopy correctly, the PHP version of your server is later than 4 and PCRE (Perl Compatible Regular Expressions) is supported. basic LAMP services are supported. Official Snoopy websites: http://snoopy.sourceforge.net/ I. Some features of Snoopy: 1. capture the webpage content fetch2. capture the webpage text content (remove HTML tags) fetchtext3. capture the webpage link, form fetchlinks fetchform4. support for proxy Host 5. supports basic user name/password verification 6. you can set user_agent, referer, cookies, and header content. supports browser redirection and controls the depth of redirection. can expand the link in the webpage to a high-quality url (default) 9. submit data and obtain the returned value 10. support tracking HTML framework 11. sending cookies when redirection is supported requires php4 or above. because it is a php class, you do not need to expand the support, and the server does not support curl. II. class method: fetch ($ URI) is used to capture webpage content. $ URI is the URL of the webpage to be crawled. The captured results are stored in $ this-> results. If you are capturing a framework, Snoopy will track each frame and store it in an array, and then save it to $ this-> results. Fetchtext ($ URI) this method is similar to fetch (). The only difference is that this method will remove HTML tags and other irrelevant data and only return the text content in the webpage. Fetchform ($ URI) this method is similar to fetch (). The only difference is that this method will remove HTML tags and other irrelevant data and only return form content (form) in the webpage ). Fetchlinks ($ URI) this method is similar to fetch (). The only difference is that this method will remove HTML tags and other irrelevant data and only return links on the webpage ). By default, the relative link is automatically completed and converted to a complete URL. Submit ($ URI, $ formvars) this method sends a confirmation form to the link address specified by $ URL. $ Formvars is an array that stores form parameters. Submittext ($ URI, $ formvars) this method is similar to submit (). The only difference is that this method removes HTML tags and other irrelevant data and returns only the text content on the webpage after login. Submitlinks ($ URI) this method is similar to submit (). The only difference is that this method will remove HTML tags and other irrelevant data and only return links on the webpage ). By default, the relative link is automatically completed and converted to a complete URL. III. class attributes: (the default value is in brackets) $ host-connected host $ port-connected port $ proxy_host proxy host. If yes, $ proxy_port uses the proxy host port, if there is $ agent camouflage (Snoopy v0.1) $ referer information, if there is $ cookies, if there is $ rawheaders other header information, if $ maxredirs has the maximum number of redirection times, 0 = not allowed (5) $ offsiteok whether or not to allow redirects off-site. (true) $ whether expandlinks adds all links to the complete address (true) $ user authentication username, if any $ pass authentication username, if $ accept http accept type (image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, */*) $ error, if yes, $ response_code returns response code from the server $ headers returns header information from the server $ maxlength maximum returned data length $ read_timeout read operation timeout (requires PHP 4 Beta 4 +) set to 0 to no timeout $ timed_out if a read operation times out, this attribute returns true (requires PHP 4 Beta 4 +) $ maxframes maximum number of frames that can be tracked $ http status captured by status $ temp_dir temporary file directory (/tmp) that can be written by the webpage server $ curl_path cURL binary Directory, if there is no cURL binary, set it to false. 4. The following is the demo include "Snoopy. class. php "; $ snoopy = new Snoopy; $ snoopy-> proxy_host =" www.phpoac.com "; $ snoopy-> proxy_port =" 8080 "; $ snoopy-> agent =" (compatible; MSIE 4.01; MSN 2.5; AOL 4.0; Windows 98) "; $ snoopy-> referer =" http://www.phpoac.com /"; $ Snoopy-> cookies [" SessionID "] = 238472842523489l; $ snoopy-> cookies [" favoriteColor "] =" RED "; $ snoopy-> rawheaders ["Pragma"] = "no-cache"; $ snoopy-> maxredirs = 2; $ snoopy-> offsiteok = false; $ snoopy-> expandlinks = false; $ snoopy-> user = "joe"; $ snoopy-> pass = "bloe"; if ($ snoopy-> fetchtext (" http://www.phpoac.com ") {Echo"
".htmlspecialchars($snoopy->results)." 
\ N ";
}
Else
Echo "error fetching document:". $ snoopy-> error. "\ n"; snoopy sample of collecting phpoac Submit ($ submit_url, $ submit_vars); if ($ snoopy-> results) {// Obtain the connection address $ snoopy-> fetchlinks (" http://www.phpoac.com/bbs "); $ Url = array (); $ url = $ snoopy-> results; // print_r ($ url); foreach ($ url as $ key => $ value) {// Match http://www.phpoac.com/bbs /Forumdisplay. php? Fid = 156 & sid = VfcqTR address: Forum forum address if (! Preg_match ("/^ (http: \/www \. phpoac \. com \/bbs \/forumdisplay \. php \? Fid =) [0-9] * & sid = [a-zA-Z] {6}/I ", $ value )) {unset ($ url [$ key]) ;}// print_r ($ url); // you can obtain the plate array $ url and access it cyclically, obtain the data on the first page of the first module $ I = 0; foreach ($ url as $ key => $ value) {if ($ I> = 1) {// test limit break;} else {// access this module, extract the Post connection address, and extract the Post paging data during official access, then, based on the paging data, extract the Post data $ snoopy = new Snoopy (); $ snoopy-> fetchlinks ($ value); $ tie = array (); $ tie [$ I] = $ snoopy-> results; // print_r ($ tie ); // Convert the array foreach ($ tie [$ I] as $ key => $ value) {// Match http://www.phpoac.com/bbs /Viewthread. php? Tid = 68127 & extra = page % 3D1 & page = 1 & sid = iBLZfKif (! Preg_match ("/^ (http: \/www \. phpoac \. com \/bbs \/viewthread \. php \? Tid =) [0-9] * & extra = page \ % 3D1 & page = [0-9] * & sid = [a-zA-Z] {6}/I ", $ value) {unset ($ tie [$ I] [$ key]) ;}// print_r ($ tie [$ I]); // category array, put the content of different pages of the same post in an array $ left = ''; // connect to the left public address $ j = 0; $ page = array (); foreach ($ tie [$ I] as $ key => $ value) {$ left = substr ($ value,); $ m = 0; foreach ($ tie [$ I] as $ pkey => $ pvalue) {// restructured the array if (substr ($ pvalue,) = $ left) {$ page [$ j] [$ m] = $ pvalue; $ m ++ ;}}$ j ++ ;} // start with removing repeated items // $ page = array_unique ($ page); can only be used for one-dimensional arrays $ pa Get [0] = $ page [0]; $ nums = count ($ page); for ($ n = 1; $ n <$ nums; $ n ++) {$ paget [$ n] = array_diff ($ page [$ n], $ page [$ n-1]);} // remove the multi-dimensional array repetition value to end // remove the array null value unset ($ page); $ page = array (); // redefine the page array $ page = array_filter ($ paget); // print_r ($ page); $ u = 0; $ title = array (); $ content = array (); $ temp = ''; $ tt = array (); foreach ($ page as $ key => $ value) {// peripheral loop, for a post if (is_array ($ value) {foreach ($ value as $ k1 => $ v1) {// page loop, for N pages of a post $ snoopy = new Snoopy (); $ Snoopy-> fetch ($ v1); $ temp = $ snoopy-> results; // read the title if (! Preg_match_all ("/(. *) <\/h2>/I ", $ temp, $ tt) {echo" no title "; exit ;} else {$ title [$ u] = $ tt [1] [1];} unset ($ tt); // read content if (! Preg_match_all ("/

(. *) <\/P>/I ", $ temp, $ tt) {print_r ($ tt); echo" no content1 "; exit ;} else {foreach ($ tt [1] as $ c => $ c2) {$ content [$ u]. = $ c2 ;}}} else {// directly retrieve the page content $ snoopy = new Snoopy (); $ snoopy-> fetch ($ value ); $ temp = $ snoopy-> results; // read the title if (! Preg_match_all ("/(. *) <\/h2>/I ", $ temp, $ tt) {echo" no title "; exit ;} else {$ title [$ u] = $ tt [1] [1];} unset ($ tt); // read content if (! Preg_match_all ("/

(. *) <\/P>/I ", $ temp, $ tt) {echo" no content2 "; exit ;} else {foreach ($ tt [1] as $ c => $ c2) {$ content [$ u]. =$ c2 ;}}$ u ++;} print_r ($ content) ;}$ I ++ ;}} else {echo "login failed"; exit ;}?>

The above introduces snoopy introduction and usage, including some content, and hopes to help friends who are interested in PHP tutorials.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.