1 fetching content from Web pages
2 Crawl page text content (remove HTML tags) fetchtext
3 Crawl page links, form fetchlinks Fetchform
4 Support Agent Host
5 support for basic username/password Authentication
6 support Set User_agent, Referer (routing), cookies and header content (header file)
7 support for browser redirection and the ability to control redirect depth
8 can extend the link in the webpage to the High quality URL (default)
9 submitting the data and getting the return value
10 support for tracking HTML frames
11 Pass Cookies when supporting redirection
Require PHP4 above it's OK. Because it is a PHP class without expanding the support server does not support the best choice of curl time,
Class method:
Fetch ($URI)
———–
This is the method used to crawl the content of a Web page.
The $URI parameter is the URL address of the crawled Web page.
The results of the crawl are stored in the $this->results.
If you're grabbing a frame, Snoopy will track each frame into an array and deposit it into the $this->results.
Fetchtext ($URI)
—————
This method is similar to fetch (), except that this method removes the HTML tag and other extraneous data and returns only the text content in the page.
Fetchform ($URI)
—————
This method is similar to fetch (), except that this method removes the HTML tags and other extraneous data and returns only the form content (form) in the Web page.
Fetchlinks ($URI)
—————-
This method is similar to fetch (), except that this method removes the HTML tags and other extraneous data and returns only the links in the Web page.
By default, relative links are automatically completed and converted to full URLs.
Submit ($URI, $formvars)
———————-
This method sends a confirmation form to the link address specified by the $url. $formvars is an array of stored form parameters.
Submittext ($URI, $formvars)
————————–
This method is similar to submit (), the only difference is that this method will remove HTML tags and other unrelated data, only return to the page after landing text content.
Submitlinks ($URI)
—————-
This method is similar to submit (), except that this method removes the HTML tags and other extraneous data and returns only the links in the Web page.
By default, relative links are automatically completed and converted to full URLs.
Class Properties: (The default value is in parentheses)
$host Connected hosts
$port Connected Ports
$proxy _host used by the proxy host, if any
$proxy the proxy host port used by _port, if any
$agent User Agent Camouflage (Snoopy v0.1)
$referer routing information, if any.
$cookies cookies, if any.
$rawheaders Other header information, if any.
$maxredirs Maximum number of redirects, 0 = not allowed (5)
$offsiteok whether or not to allow redirects off-site. (true)
$expandlinks whether the link is fully filled with the full address (true)
$user authenticated user name, if any
$pass authenticated user name, if any
$accept http Accept type (image/gif, Image/x-xbitmap, Image/jpeg, Image/pjpeg, */*)
$error where the error is, if any.
$response _code Response code returned from the server
$headers header information returned from the server
$maxlength Longest return data length
$read _timeout Read operation timeout (requires PHP 4 Beta 4+)
Set to 0 for no timeout
$timed _out If a read operation times out, this property returns True (Requires PHP 4 Beta 4+)
Maximum number of frames $maxframes allowed to track
$status the state of the HTTP being crawled
$temp The Temporary Files directory (/tmp) that the _dir Web server can write to
$curl _path Curl Binary directory, set to False if no curl binary
Here are some code snippets:
1. Get the specified URL content
Copy CodeThe code is as follows:
?
$url = "Http://www.jb51.net";
Include ("snoopy.php");
$snoopy = new Snoopy;
$snoopy->fetch ($url); Get all content
Echo $snoopy->results; Show results
You can choose the following
$snoopy->fetchtext//Get text content (remove HTML code)
$snoopy->fetchlinks//Get Links
$snoopy->fetchform//Get the form
?>
2 form Submission
Copy CodeThe code is as follows:
<?php
$formvars ["username"] = "admin";
$formvars ["pwd"] = "admin";
$action = "Http://www.jb51.net";//</a> form submission Address
$snoopy->submit ($action, $formvars);//$formvars for the submitted array
Echo $snoopy->results; Get the results of a return after a form is submitted
You can choose the following
$snoopy->submittext; Only text that is stripped of HTML is returned after submission
Only return link after $snoopy->submitlinks;//commit
?>
Now that you've submitted a form, you can do a lot of things. Next we're going to disguise the IP, camouflage browser
3 Camouflage
Copy CodeThe code is as follows:
<?php
$formvars ["username"] = "admin";
$formvars ["pwd"] = "admin";
$action = "Http://www.jb51.net";
Include "snoopy.php";
$snoopy = new Snoopy;
$snoopy->cookies["PHPSESSID"] = "FC106B1918BD522CC863F36890E6FFF7"; Camouflage SessionID
$snoopy->agent = "(compatible; MSIE 4.01; MSN 2.5; AOL 4.0; Windows 98) "; Camouflage browser
$snoopy->referer = http://www.jb51.net; Camouflage Source page Address Http_referer
$snoopy->rawheaders["Pragma"] = "No-cache"; Cache HTTP Header Information
$snoopy->rawheaders["x_forwarded_for"] = "127.0.0.101"; Camouflage IP
$snoopy->submit ($action, $formvars);
Echo $snoopy->results;
?>
Originally we can disguise the session camouflage browser, camouflage IP, haha can do a lot of things.
For example, with verification code, verify IP voting, you can keep casting.
PS: Here camouflage IP, in fact, is the camouflage HTTP head, so the general through the REMOTE_ADDR to obtain IP is not disguised,
Instead, those who get IP through HTTP headers (which can prevent proxies) can make their own IP.
about how to verify the code, simply:
First use the normal browser, view the page, find the corresponding SessionID code,
Also note the SessionID and the Verification code values,
Next, use Snoopy to forge.
Principle: Because it is the same SessionID, the verification code obtained is the same as the first time input.
4 Sometimes we may need to forge more stuff, Snoopy completely for us to think of
Copy CodeThe code is as follows:
<?php
$snoopy->proxy_host = "Http://www.jb51.net";
$snoopy->proxy_port = "8080"; Using agents
$snoopy->maxredirs = 2; Number of redirects
$snoopy->expandlinks = true; Whether the full link in the collection of time often used
For example, a link to/images/taoav.gif can be changed to its full link <a href= "http://www.jb51.net/images/taoav.gif" >http://www.jb51.net/images /taoav.gif</a>
$snoopy->maxframes = 5//maximum number of frames allowed
Note that when you crawl the frame $snoopy->results returns an array
$snoopy->error//Return error message
?>
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.