PHP Picking tool: Snoopy trial experience

Source: Internet
Author: User
PHP Acquisition Tool: Snoopy trial experience

Snoopy is a PHP class that simulates the function of a browser and can retrieve the contents of a Web page and send a form. Snoopy the PHP version that requires your server to run correctly is above 4 and supports Pcre (Perl Compatible Regular Expressions), basic lamp service is supported.

First, some characteristics of Snoopy:

1. Fetching the content of a Web page fetch

2. Crawl the text content of the Web page (remove HTML tags) fetchtext

3. Crawl Web links, form fetchlinks Fetchform

4. Support Agent Host

5. Support Basic username/password Verification

6. Support Settings user_agent, Referer (routing), cookies and header content (header file)

7. Support browser redirection, and can control the depth of redirection

8. Can expand the link in the Web page into a high-quality URL (default)

9. Submit the data and get the return value

10. Support for tracking HTML framework

11. Support redirection of the time to pass cookies, require PHP4 above, because it is php a class, no need to support, the server does not support curl when the best choice.

Second, class method:

Fetch ($URI)
———–

This is the method used to crawl the contents of a Web page. The $URI parameter is the URL address of the crawled Web page. The results of the fetch are stored in the $this->results. If you are crawling a frame, Snoopy will track each frame back into the array and deposit it into the $this->results.

Fetchtext ($URI)
—————

This method is similar to fetch (), except that this method removes HTML tags and other unrelated data, returning only the text content in the page.

Fetchform ($URI)
—————

This method is similar to fetch (), except that this method removes HTML tags and other unrelated data and returns only the form content (form) in the Web page.

Fetchlinks ($URI)
—————-

This method is similar to fetch (), except that this method removes HTML tags and other unrelated data and only returns links to the Web page.
By default, relative links are automatically completed and converted to full URLs.

Submit ($URI, $formvars)
———————-

This method sends a confirmation form to the link address specified by the. $formvars is an array of stored form parameters.

Submittext ($URI, $formvars)
————————–

This method is similar to submit (), the only difference is that this method will remove the HTML tags and other unrelated data, only return to the page after landing text content.

Submitlinks ($URI)
—————-

This method is similar to submit (), the only difference is that this method will remove the HTML tags and other unrelated data, only return the link in the Web page.
By default, relative links are automatically completed and converted to full URLs.

Class attribute: (default value in parentheses)

$host a connected host
$port Connected Ports
$proxy proxy host used by _host, if any
$proxy the proxy host port used by the _port, if any
$agent User Agent Spoofing (Snoopy v0.1)
$referer Route information, if any
$cookies cookies, if any
$rawheaders Other header information, if any
$maxredirs maximum redirects, 0 = not allowed (5)
$offsiteok whether or not to allow redirects off-site. (true)
$expandlinks whether to complete the link with full address (true)
$user authenticated user name, if any
$pass authenticated user name, if any
$accept http Accept type (image/gif, Image/x-xbitmap, Image/jpeg, Image/pjpeg, */*)
$error where to error, if any.
$response _code Response code returned from the server
$headers header information returned from the server
$maxlength Longest return data length
$read _timeout Read operation timeout (requires PHP 4 Beta 4+)
Set to 0 for no timeout
$timed _out If a read operation times out, this property returns True (Requires PHP 4 Beta 4+)
Maximum number of frames $maxframes allowed to track
$status the state of the crawled HTTP
$temp Temporary file directory (/tmp) that the _dir Web server can write to
$curl _path Curl Binary directory, if no curl binary is set to False

Four, the following is the demo

Include "Snoopy.class.php"; $snoopy = new Snoopy;  $snoopy->proxy_host = "www.baidu.com"; $snoopy->proxy_port = "8080"; $snoopy->agent = "(compatible; MSIE 4.01; MSN 2.5; AOL 4.0;  Windows 98) "; $snoopy->referer = "http://www.baidu.com/";  $snoopy->cookies["SessionID"] = 238472834723489l; $snoopy->cookies["FavoriteColor"] = "RED"; $snoopy->rawheaders["Pragma"] = "No-cache";  $snoopy->maxredirs = 2;  $snoopy->offsiteok = false; $snoopy->expandlinks = false;  $snoopy->user = "Joe"; $snoopy->pass = "Bloe"; if ($snoopy->fetchtext ("http://www.baidu.com")) {echo]
\ n ";
}
Else
echo "Error fetching document:". $snoopy->error. " \ n "; Snoopy capture Phpchina Example Submit ($submit _url, $submit _vars); if ($snoopy->results) {//Get connection Address $snoopy->fetchlinks ("Http://www.phpchina.com/bbs"); $url =array (); $url = $snoopy->results; Print_r ($url); foreach ($url as $key = + $value) {//Match Http://www.phpchina.com/bbs/forumdisplay.php?fid=156&sid=VfcqTR address is Forum Plate Address if (!preg_match ("/^ (http:\/\/www\.phpchina\.com\/bbs\/forumdisplay\.php\?fid=) [0-9]*&sid=[a-za-z]{6}/ I ", $value)) {unset ($url [$key]); }}//print_r ($url); Get to the plate array, the first page of the data $i = 0; foreach ($url as $key = + $value) {if ($i >=1) {//test limit break; } else {//access the module, extract the connection address of the post, the official visit needs to extract the post paging data, and then extract the post data according to the paging data $snoopy =new Snoopy (); $snoopy->fetchlinks ($value); $tie =array (); $tie [$i]= $snoopy->results; Print_r ($tie); Converting arrays foreach ($tie [$i] as $key + = $value) {//Match http://www.phpchina.com/bbs/viewthread.php?tid=6 8127& EXTRA=PAGE%3D1&AMP;PAGE=1&AMP;SID=IBLZFK if (!preg_match ("/^ (http:\/\/www\.phpchina\.com\/bbs\/v iewthread\.php\?tid=) ("0-9]*&extra=page\%3d1&page=[0-9]*&sid=[a-za-z]{6}/i", $value)) { Unset ($tie [$i] [$key]); }}//print_r ($tie [$i]); A collation array that places the contents of a different page of the same post in an array $left = ';//connect the left public address $j = 0; $page =array (); foreach ($tie [$i] as $key = = $value) {$left =substr ($value, 0,52); $m = 0; foreach ($tie [$i] as $pkey = = $pvalue) {//Reorganized array if (SUBSTR ($pvalu e,0,52) = = $left) {$page [$j] [$m]= $pvalue; $m + +; } } $j + +; }//Remove duplicates start//$page =array_unique ($page); only for one-dimensional arrays $paget [0]= $page [0]; $nums =count ($page); for ($n =1; $n < $nums; $n + +) {$paget [$n]=array_diff ($page [$n], $page [$n-1]); }//Remove multi-dimensional array repeat value end//remove array null value unset ($page); $page =array ();//redefine the page array $page =array_filter ($paget); Print_r ($page); $u = 0; $title =array (); $content =array (); $temp = "; $tt =array (); foreach ($page as $key + $value) {//peripheral loop, for one post if (Is_array ($value)) {foreach ($value as $k 1=> $v 1) {//In-page loop, for a n pages of posts $snoopy =new Snoopy (); $snoopy->fetch ($v 1); $temp = $SNoopy->results; Read Header if (!preg_match_all ("/

(. *) <\/h2>/i ", $temp, $tt)) {echo" no title "; Exit } else {$title [$u]= $tt [1][1]; } unset ($TT); Read Content if (!preg_match_all ("/(. *) <\/div>/i", $temp, $tt)) { Print_r ($TT); echo "No Content1"; Exit } else {foreach ($tt [1] as $c = + $c 2) {$content [$u].= $c 2; }}}} else { Direct page fetch content $sNoopy=new Snoopy (); $snoopy->fetch ($value); $temp = $snoopy->results; Read Header if (!preg_match_all ("/

(. *) <\/h2>/i ", $temp, $tt)) {echo" no title "; Exit } else {$title [$u]= $tt [1][1]; } unset ($TT); Read Content if (!preg_match_all ("/(. *) <\/div>/i", $temp, $tt)) { echo "No Content2"; Exit } else {foreach ($tt [1] as $c = + $c 2) {$content [$u].= $c 2; }}} $u + +; } print_r ($content); } $i + +; }} else {echo "Login failed"; Exit }?>

??

  • Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.