At present, there are many methods to prevent collection, first of all, introduce the common methods of collection strategy and its drawbacks and collection countermeasures:
First, the determination of an IP in a certain period of time to the site pages of the number of visits, if significantly more than the normal person browsing speed, the denial of this IP access
Disadvantages:
1, this method is only applicable to dynamic pages, such as: asp\jsp\php ... Static pages can not determine the number of times a certain IP access to this site page.
2, this method will seriously affect the search engine spiders are included, because the search engine spiders included, browsing speed will be faster and multithreading. This method will also reject search engine spiders included in the site files
Collection strategy: can only slow down the acquisition speed, or not to pick
Recommendation: do a search engine spider's IP library, only allow search engine spiders to quickly browse the contents of the site. Search engine spider's IP library collection, also not too easy, a search engine spider, also not necessarily only a fixed IP address.
Comments: This method is more effective against collection, but it will affect the search engine for its inclusion.
Second, use JavaScript to encrypt content pages
Disadvantages: This method is applicable to static pages, but it will seriously affect the search engine on its collection, search engines received the content, are also encrypted after the content
Collection Countermeasures: The proposal is not adopted, such as the need to pick, the password to solve the JS script also picked down.
Suggestion: There are no good suggestions for improvement at present
Comments: It is recommended to rely on search engine with traffic webmaster do not use this method.
Replace specific tags in content pages with "specific tags + hidden copyright text"
Disadvantages: This method is not very bad, will only add a little bit of page file size, but easy to reverse collection
Acquisition strategy: The collection of copyright text containing hidden copyright text, or replace it with their own copyright.
Suggestion: There are no good suggestions for improvement at present
Comments: I feel a little practical value, even if it is to add random hidden words, is tantamount to the superfluous.
Four, only allow users to browse only after
Disadvantages: This method will seriously affect the search engine spiders are included in the
Collection countermeasures: At present, the outdated has been sent a countermeasure article, the specific countermeasures to see this bar "ASP thief program How to use XMLHTTP to realize the submission of forms and cookies or session of the sending"
Suggestion: There are no good suggestions for improvement at present
Comments: It is recommended to rely on search engine with traffic webmaster do not use this method. However, this method to prevent the general acquisition program, or a little effect.
Five, use JavaScript, VBScript script to do pagination
Disadvantages: Affect the search engine on its included
Acquisition strategy: Analysis of JavaScript, VBScript script, find out its paging rules, make a corresponding site of the paging collection page can be.
Suggestion: There are no good suggestions for improvement at present
Comments: People who feel the scripting language can find their paging rules
Six, only allowed through the Site page connection view, such as: Request.ServerVariables ("Http_referer")
Disadvantages: Affect the search engine on its included
Collection Countermeasures: I do not know whether to simulate the source of the Web page .... At present, I do not have a corresponding method of collection countermeasures
Suggestion: There are no good suggestions for improvement at present
Comments: It is recommended to rely on search engine with traffic webmaster do not use this method. However, this method to prevent the general acquisition program, or a little effect.
From the above can be seen, the current commonly used to prevent collection methods, or will be included in the search engine has a greater impact, or prevent the collection effect is not good, not to prevent the effect of collection. Then, there is no effective collection, but does not affect the search engine included methods? Then please continue to look down!
From the front of the collection principle you can see that most of the acquisition procedures are based on the analysis of rules to collect, such as analysis of paging file name rules, analysis of page code rules.
First, pagination file name rules to prevent collection countermeasures
Most of the collectors rely on the analysis of the paging file name rules, batch, multi-page collection. If others cannot find out the file name rules of your paging file, then others will not be able to do a lot of your site to collect multiple pages.
Implementation method:
I think it is a good way to encrypt the paging file name with MD5, here, some people will say, you use MD5 to encrypt the paging file name, others according to this rule can also simulate your encryption rules to get your paging file name.
I want to point out that when we encrypt the paging file name, don't just encrypt the part of the file name change
If I represents pagination page number, then we do not encrypt: Page_name=md5 (i,16) & ". htm"
It is best to follow up one or more characters on the page number that you want to encrypt, such as: Page_name=md5 (i& "any one or several letters",) & ". htm"
Because the MD5 can not be decrypted, the other people see the page letters are MD5 encrypted results, so add people can not know what you follow after the letter is what, unless he used violent ****md5, but not too realistic.
Second, the page code rules to prevent collection countermeasures
If our content page has no code rules, then people can't extract the content they need from your code. So we want to do this step to prevent collection, it is necessary to make code without rules.
Implementation method:
To make the other person need to extract the tag randomization
1, custom multiple page templates, each page template in the important HTML tags are different, rendering the page content, random selection of page templates, some pages with Css+div layout, and some pages with table layout, this method is troublesome point, a content page, to do more than a few template pages, But the collection itself is a very cumbersome thing, to do a template, can play a role in the collection, for many people, are worth.
2, if the above method too troublesome, the page of the important HTML tag randomization, you can.
The more Web templates you do, the more random HTML code is, the other side analysis of the content code, the more trouble, the other side for your site to write a collection strategy, more difficult, at this time, most people will shrink, because this person is lazy, will collect other people's website data ~ ~ ~ Again, At present, most people are to take the acquisition program developed by others to collect data, the development of their own collection procedures to collect data, after all, is a minority.
There are some simple ideas to provide to you:
1, to the data collector is important, but not important to the search engine content with client script display
2, a page of data, divided into N page display, but also to increase the difficulty of collecting methods
3, with a deeper connection, because most of the current acquisition program can only collect the site content of the first 3 layers, if the content of the connection layer deeper, can also avoid being collected. However, this may cause the customer to browse the inconvenience. Such as:
Most of the sites are home----content index paging----content page
If changed to:
Home----Content index pagination----content page entry----content page
Note: Content page entry is best to add code that automatically goes to the content page
<meta http-equiv= "Refresh" content= "6;url= content page (http://www.oureve.net)" >
In fact, as long as the first step to prevent collection (encryption paging file name rules), the effect of the collection has been good, or recommend two methods to use the collection method at the same time, to gather more difficulty, so that they know difficult to page back.
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.