PHP Data fetching class Snoopy use

Source: Internet
Author: User
Tags php class response code
PHP Collection Snoopy Detailed
PHP Collection Tool Snoopy application detailed
Snoopy is a PHP class that simulates the function of a browser and can retrieve the contents of a Web page and send a form. Snoopy the PHP version that requires your server to run correctly is above 4 and supports Pcre (Perl Compatible Regular Expressions), basic lamp service is supported.
First, some characteristics of Snoopy:
1. Fetching the content of a Web page fetch
2. Crawl the text content of the Web page (remove HTML tags) fetchtext
3. Crawl Web links, form fetchlinks Fetchform
4. Support Agent Host
5. Support Basic username/password Verification
6. Support Settings user_agent, Referer (routing), cookies and header content (header file)
7. Support browser redirection, and can control the depth of redirection
8. Can expand the link in the Web page into a high-quality URL (default)
9. Submit the data and get the return value
10. Support for tracking HTML framework
11. Support redirection of the time to pass cookies, require PHP4 above, because it is php a class, no need to support, the server does not support curl when the best choice.
Second, class method:
Fetch ($URI)
———–
This is the method used to crawl the contents of a Web page. The $URI parameter is the URL address of the crawled Web page. The results of the fetch are stored in the $this->results. If you are crawling a frame, Snoopy will track each frame back into the array and deposit it into the $this->results.
Fetchtext ($URI)
—————
This method is similar to fetch (), except that this method removes HTML tags and other unrelated data, returning only the text content in the page.
Fetchform ($URI)
—————
This method is similar to fetch (), except that this method removes HTML tags and other unrelated data and returns only the form content (form) in the Web page.
Fetchlinks ($URI)
—————-
This method is similar to fetch (), except that this method removes HTML tags and other unrelated data and only returns links to the Web page.
By default, relative links are automatically completed and converted to full URLs.
Submit ($URI, $formvars)
———————-
This method sends a confirmation form to the link address specified by the. $formvars is an array of stored form parameters.
Submittext ($URI, $formvars)
————————–
This method is similar to submit (), the only difference is that this method will remove the HTML tags and other unrelated data, only return to the page after landing text content.
Submitlinks ($URI)
—————-
This method is similar to submit (), the only difference is that this method will remove the HTML tags and other unrelated data, only return the link in the Web page.
By default, relative links are automatically completed and converted to full URLs.
Class attribute: (default value in parentheses)
$host a connected host
$port Connected Ports
$proxy proxy host used by _host, if any
$proxy the proxy host port used by the _port, if any
$agent User Agent Spoofing (Snoopy v0.1)
$referer Route information, if any
$cookies cookies, if any
$rawheaders Other header information, if any
$maxredirs maximum redirects, 0 = not allowed (5)
$offsiteok whether or not to allow redirects off-site. (true)
$expandlinks whether to complete the link with full address (true)
$user authenticated user name, if any
$pass authenticated user name, if any
$accept http Accept type (image/gif, Image/x-xbitmap, Image/jpeg, Image/pjpeg, */*)
$error where to error, if any.
$response _code Response code returned from the server
$headers header information returned from the server
$maxlength Longest return data length
$read _timeout Read operation timeout (requires PHP 4 Beta 4+)
Set to 0 for no timeout
$timed _out If a read operation times out, this property returns True (Requires PHP 4 Beta 4+)
Maximum number of frames $maxframes allowed to track
$status the state of the crawled HTTP
$temp Temporary file directory (/tmp) that the _dir Web server can write to
$curl _path Curl Binary directory, if no curl binary is set to False
Four, the following is the demo
Include "Snoopy.class.php";
$snoopy = new Snoopy;
$snoopy->proxy_host = "www.baidu.com";
$snoopy->proxy_port = "8080";
$snoopy->agent = "(compatible; MSIE 4.01; MSN 2.5; AOL 4.0; Windows 98) ";
$snoopy->referer = "http://www.baidu.com/";
$snoopy->cookies["SessionID"] = 238472834723489l;
$snoopy->cookies["FavoriteColor"] = "RED";
$snoopy->rawheaders["Pragma"] = "No-cache";
$snoopy->maxredirs = 2;
$snoopy->offsiteok = false;
$snoopy->expandlinks = false;
$snoopy->user = "Joe";
$snoopy->pass = "Bloe";
if ($snoopy->fetchtext ("http://www.baidu.com"))
{
echo "<PRE>". Htmlspecialchars ($snoopy->results). "</pre>\n"; <BR>
} <BR>
else <BR>
echo "Error fetching document:". $snoopy->error. " \ n ";
Snoopy Sampling Phpchina Example
<?php
Acquisition Phpchina
Set_time_limit (0);
Require_once ("Snoopy.class.php");
$snoopy =new Snoopy ();
Login Forum
$submit _url = "Http://www.phpchina.com/bbs/logging.php?action=login";
$submit _vars["LoginMode"] = "normal";
$submit _vars["Styleid"] = "1";
$submit _vars["cookietime"] = "315360000";
$submit _vars["Loginfield"] = "username";
$submit _vars["username"] = "* * *"; Your user name
$submit _vars["password"] = "* * * * *"; Your password.
$submit _vars["QuestionID"] = "0";
$submit _vars["answer"] = "";
$submit _vars["loginsubmit"] = "submit";
$snoopy->submit ($submit _url, $submit _vars);
if ($snoopy->results)
{
Get connection Address
$snoopy->fetchlinks ("Http://www.phpchina.com/bbs");
$url =array ();
$url = $snoopy->results;
Print_r ($url);
foreach ($url as $key = $value)
{
Match Http://www.phpchina.com/bbs/forumdisplay.php?fid=156&sid=VfcqTR address that is the Forum section address
if (!preg_match ("/^ (http:\/\/www\.phpchina\.com\/bbs\/forumdisplay\.php\?fid=) [0-9]*&sid=[a-za-z]{6}/i", $ Value))
{
Unset ($url [$key]);
}
}
Print_r ($url);
Get to the plate array, the first page of the first module, loop access
$i = 0;
foreach ($url as $key = $value)
{
if ($i >=1)
{
Test limits
Break
}
Else
{
Access the module, extract the connection address of the post, the official visit needs to extract the post paging data, and then extract the post data based on the paging data
$snoopy =new Snoopy ();
$snoopy->fetchlinks ($value);
$tie =array ();
$tie [$i]= $snoopy->results;
Print_r ($tie);
Converting arrays
foreach ($tie [$i] as $key + = $value)
{
Match http://www.phpchina.com/bbs/viewthread.php?tid=68127&amp; Extra=page%3d1&amp;page=1&sid=iblzfk
if (!preg_match ("/^ (http:\/\/www\.phpchina\.com\/bbs\/viewthread\.php\?tid=) [0-9]*&amp;extra=page\%3d1& Amp;page=[0-9]*&sid=[a-za-z]{6}/i ", $value))
{
Unset ($tie [$i] [$key]);
}
}
Print_r ($tie [$i]);
Array, put the contents of different pages of the same post in an array
$left = ";//Connect to the left public address
$j = 0;
$page =array ();
foreach ($tie [$i] as $key + = $value)
{
$left =substr ($value, 0,52);
$m = 0;
foreach ($tie [$i] as $pkey + = $pvalue)
{
Reorganizing an array
if (substr ($pvalue, 0,52) = = $left)
{
$page [$j] [$m]= $pvalue;
$m + +;
}
}
$j + +;
}
Remove Duplicates Start
$page =array_unique ($page); only for one-dimensional arrays
$paget [0]= $page [0];
$nums =count ($page);
for ($n =1; $n < $nums; $n + +)
{
$paget [$n]=array_diff ($page [$n], $page [$n-1]);
}
Remove multi-dimensional array repeat value end
To remove an array of null values
Unset ($page);
$page =array ();//Redefine page array
$page =array_filter ($paget);
Print_r ($page);
$u = 0;
$title =array ();
$content =array ();
$temp = ";
$tt =array ();
foreach ($page as $key = $value)
{
Perimeter loop, for one post
if (Is_array ($value))
{
foreach ($value as $k 1=> $v 1)
{
In-page loops, n pages for a post
$snoopy =new Snoopy ();
$snoopy->fetch ($v 1);
$temp = $snoopy->results;
Read headers
if (!preg_match_all ("/{
echo "no title";
Exit
}
Else
{
$title [$u]= $tt [1][1];
}
Unset ($TT);
Read content
if (!preg_match_all ("/<div id=\" postmessage_[0-9]{1,8}\ "class=\" t_msgfont\ "> (. *) <\/div>/i", $temp, $tt ))
{
Print_r ($TT);
echo "No Content1";
Exit
}
Else
{
foreach ($tt [1] as $c = $c 2)
{
$content [$u].= $c 2;
}
}
}
}
Else
{
Direct page Fetch content
$snoopy =new Snoopy ();
$snoopy->fetch ($value);
$temp = $snoopy->results;
Read headers
if (!preg_match_all ("/{
echo "no title";
Exit
}
Else
{
$title [$u]= $tt [1][1];
}
Unset ($TT);
Read content
if (!preg_match_all ("/<div id=\" postmessage_[0-9]*\ "class=\" t_msgfont\ "> (. *) <\/div>/i", $temp, $tt))
{
echo "No Content2";
Exit
}
Else
{
foreach ($tt [1] as $c = $c 2)
{
$content [$u].= $c 2;
}
}
}
$u + +;
}
Print_r ($content);
}
$i + +;
}
}
Else
{
echo "Login Failed";
Exit
}
?>
  • Related Article

    Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.