How to use the PHP server-side agent to crawl Web content

Source: Internet
Author: User
Tags continue join php server reset split tostring
Recently, the company temporarily disconnected the extranet, only open the company's own site to allow access, to say the truth, do the web development of the network, really let people laugh and cry ...

Because of the need to find information, had to simply write a PHP server Proxy page to use!

Simple frames page:

The following are the referenced contents:

<style type= "Text/css" >
*{margin:0;padding:0;}
Html,body{overflow:hidden;}
Td{padding:0;vertical-align:top;}
</style>
<table width= "100%" height= "100%" cellspacing= "0" cellpadding= "0" border= "0" >
<tr>
<TD style= "Height:25px;background: #d4d0c8;p adding:5px 10px;" >
<form method= "POST" action= "action_get.php" target= "actioncontent" style= "margin:0;padding:0"; >
<input type= "text" id= "TargetUrl" name= "TargetUrl" style= "width:100%;border:1px inset;margin:0;"/>
</form>
</td>
</tr>
<tr>
<td>
<iframe name= "actioncontent" style= "width:100%;height:100%;" ></iframe>
</td>
</tr>
</table>
<script type= "Text/javascript" >
Window.onload = function () {
document.getElementById (' TargetUrl '). focus ();
};
</script>

PHP Proxy page:

The following are the referenced contents:

<?php
Use Snoopy to capture page function
Include "Snoopy.class.php";
Destination URL
$url = $_request[' TargetUrl '];
List all the parameters passed over to
$par = Array ();
$GetPost = Array_merge ($_post,$_get);
foreach ($GetPost as $Key => $Value) {
if ($Key!= ' TargetUrl ') {
$Value = Str_replace ("%25", "%", $Value);
Array_push ($par, $Key. "=" . $Value));
}
}
To determine if the destination URL is with a (that is, with parameters)
$CC = Strpos ($url, "?")? " & ":" "?";
Reorganize URLs
$geturl = $par? $url: $url. $cc. Implode ("&", $par);
Crawl the Reorganized URL page content
$snoopy = new Snoopy;
$snoopy->fetch ($geturl);
Replace the code in the target content that might replace the parent window address in the script
$org = Str_replace ("Top.location", "Top.title", $snoopy->results);
Attempt to convert target content encoding to UTF-8
$opt = Iconv ("GBK", "Utf-8", $org);
Determine if the target content is encoded as GBK or UTF-8
$ec = strlen ($opt)? " GBK ":" Utf-8 ";

?>
<script type= "Text/javascript" >
Closed run to avoid scripting clutter in later content
(function () {
var easyUTF8 = function (GBK) {if (!GBK) {return ';} var UTF8 = [];for (var i=0;i<gbk.length;i++) {var s_str = Gbk.charat (i); if (!) ( /^%u/i.test (Escape (S_STR))) {Utf8.push (s_str); continue;} var S_char = gbk.charcodeat (i); var B_char = s_char.tostring (2). Split ("); var C_char = (b_char.length==15)? [0].concat (B_char): B_char;var a_b =[];a_b[0] = ' 1110 ' +c_char.splice (0,4). Join ('); a_b[1] = ' +c_char.splice ' (0,6). Join ('); a_b[2] = ' +c_char.splice ' (0,6). Join ("); for (Var n=0;n<a_b.length;n++) {utf8.push ('% ' +parseint (a_b[n) , 2). ToString (). toUpperCase ());} return Utf8.join (');
var Getargs = function (sURL) {var sarg = surl.split ('? '), Rv={};rv.filename=sarg[0];if (!sarg[1]) {return RV;} var aarg=sarg[1].split (' & '), Atmp=[];for (Var i=0;i<aarg.length;i++) {atmp=aarg[i].split (' = '); rv[atmp[0]]= ATMP[1];} return RV;};
var createiph = function (name,value) {if (!name) {return;} if (/msie/i.test (navigator.appversion)) {return document.createelement (' <input type= ' hidden ' name= ' "+name+ '") Value= "' +value+ '" "/>");} Else{var DFI = document.createelement (' input ');d Fi.type = ' hidden ';d fi.name = Name;dfi.value = Value;return dfi;};
echo Target URL to parent window text box
var DTU = top.document.getElementById (' TargetUrl ');
if (DTU) {dtu.value = ' <?php echo $geturl;? > ';}
Target URL and domain
var sref = ' <?php echo $url;? > ';
var sdomain = Sref.match (/^http:\/\/[^\/]*/i) [0];
After the page loads, perform the following procedure
var process = function () {
Crawl all links in a page
var dlink = document.getElementsByTagName (' A '), la = dlink.length;
Grab all the forms in a page
var dform = document.getelementsbytagname (' form '), LF = dform.length;
Iterate through all the links and replace their href addresses
for (Var i=0;i<la;i++) {
var src = dlink[i].href.tostring (). replace (/^http:\/\/www\.w3cgroup\.com (?: \ /geturl)?/i,sdomain);
var oargs = Getargs (src), ahref = [];
UTF-8 encoded parameter value
for (var d in Oargs) {
if (!dd== ' filename '!oargs[d]) {Continue}
Ahref.push (d+ ' = ' +encodeuricomponent (easyUTF8 (oargs[d)));
}
var ghref = ahref.length?oargs.filename+ '? ' +ahref.join (' & '): Oargs.filename;
Reset link Address
Dlink[i].href = ' http://www.w3cgroup.com/geturl/action_get.php?targeturl= ' +ghref;
}
Iterate through all the forms, replacing their action addresses
for (i=0;i<lf;i++) {
Grab the form action and process
var src = dform[i].action.tostring (). replace (/^http:\/\/www\.w3cgroup\.com (?: \ /geturl)?/i,sdomain);
if (!) ( /^http/.test (SRC))) {src = (/^\/.*$/.test (src))? ( SDOMAIN+SRC):(sdomain+ '/' +src);}
Create a hidden domain targeturl value for the SRC address processed above
var DFI = createiph (' TargetUrl ', SRC);
Dform[i].appendchild (DFI);
Create a hidden domain ie, the value is Utf-8, purely for search engine use
var dfi2 = createiph (' ie ', ' utf-8 ');
Dform[i].appendchild (DFI2);
Reset Form Submit Target window
Dform[i].target = ' actioncontent ';
Reset Form Action Address
dform[i].action = ' http://www.w3cgroup.com/geturl/action_get.php ';
Resets the form onsubmit event to UTF8 the encoded field value
Dform[i].onsubmit = function () {
var DLMs = this.elements,l = DLMS.LENGTH-1,PN = ', pt = ', PV = ';
for (Var i=0;i<l;i++) {
PN = dlms[i].name,pt = DLMS[I].TYPE,PV = Dlms[i].value;
if (!pnpn== ' targeturl ' pn== ' ie ') {continue;}
if (pt== ' Submit ' pt== ' Reset ' pt== ' button ') {
Dlms[i].value = encodeURIComponent (PV);
}else{
Dlms[i].value = encodeURIComponent (EasyUTF8 (PV));
}
}
};
}
};
Bind this procedure to Window.onload
if (document.attachevent) {window.attachevent (' onload ', process);} Else{window.addeventlistener (' Load ', process,false);}
})();
</script>
The script is placed before the output to avoid the possibility of scripting errors in the content, and the things we want to do are discarded.
Output captures the target page content
<?php echo ($ec = = "GBK")? $opt: $org; >

In this little work, I've written an important JavaScript function easyUTF8, which makes it easy to convert GBK encoded content into UTF-8 encoding in JavaScript scripts.

We also dealt with the compatibility issues with adding items to the form, and looking at the CREATEIPH function, the form items created in IE, the results we don't want to specify name and value, are already described in the DHTML manual.



Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.