An idea and Implementation of Anti-crawling web page

Source: Internet
Author: User

 

The method that can be searched is nothing more than determining whether http_referer comes from this site and adds the site name and random string after line breaks and carriage returns on the webpage. These two methods are useless for the brothers who have a little understanding of the HTTP protocol and regular expression replacement. The following are some of the methods that I know and are commonly used on the Internet:

1. Determine http_referer
Http_referer is a segment in the HTTP header. Whether you use the Winsock/inet transfer control of VB/VC or the XMLHTTPRequest object (for Ajax) that can be used by the ASP server ), for PHP socket and Python urllib2, you can easily modify the HTTP header to directly modify the value of http_referer. The lazy way is to assign the http_referer value to your website root URL or the accessed page url. This problem does not even occur in the webbrowser control in VB ...... It's just a little too slow.

This method is only valid and may capture objects using ADODB. Stream.

2 Add a random string/site name/URL
For fixed strings, such as site names and URLs, many web page collection programs provide options to replace them, which is easy to implement. The collector is suspected of placing invisible elements and stacked keywords on the page, and may be considered cheated by search engines ~

It is not possible to replace a random string directly. However, if you use a regular expression properly in combination with the page source code, it can be cracked. For example, Match [w] {, 16}
To delete
Previous 16 random letters and digit strings

3. Add scripts on the webpage that can interrupt page running

Add a script like this to the webpage to block page execution and display. The guy who came up with this method must be bt ~ But it is a bit ridiculous. This method is useless if the page is not displayed after it is captured. I will not talk about the specific reasons. In addition, this is a bad aspect for the user experience of the website to be collected.

Bytes -------------------------------------------------------------------------------------------------

4. My methods
A collection program must be written in the previous few days. A list page with printed content is required to prevent collection. After thinking about it, I think there is only one way to disrupt the output order of the table, adopt random output, and use JS operations on the client to change the DOM order. In this way, the HTML tables output by the server are messy, but the tables can be normally displayed when the customer views and prints them. The disadvantage is that the correct order still needs to be output to the page. If the collector sorts the table content according to this rule, it can still be read.

The implementation method is as follows:

 

I = I + 1
Loop
%>

The client combination method is as follows:

 

Function reorder (sobj, aorder)
...{
VaR o = Document. getelementbyid (sobj). childnodes [0];
...{
Try ...{
VaR T = O. childnodes. length;
} Catch (e )...{
Alert (E. Description );
Return;
}
}
For (VAR I = 1; I <O. childnodes. length; I ++ )...{
VaR areorder = new array ();
VaR TR = O. childnodes [I];
For (var j = 0; tr. haschildnodes (); j ++ )...{
Areorder [aorder [I] [J] = tr. removechild (tr. childnodes [0]);
}
For (var j = 0; j <areorder. length; j ++ )...{
Tr. appendchild (areorder [J]);
}
}
}

 

Although this is not an all-in-one method, it can at least be difficult to find a large number of JS programmers ......

 

This article from the csdn blog, reproduced please indicate the source: http://blog.csdn.net/slawdan/archive/2007/04/18/1568910.aspx

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.