The author himself is to write collector, so the site to prevent some experience. Because it is in the office hours, the various methods are simply mentioned.
A lot of collection methods in the implementation of the need to consider whether the impact of search engine on the site crawl, so first to analyze the general collector and search engine crawler acquisition is different.
Same point:
A. Both need to crawl directly to the Web page source to work effectively,
B. The two units of time will be a large number of times to crawl the content of the site visited;
C. In macro terms both IP will change;
D. Two more impatient to crack your Web page some encryption (verification), such as Web page content through JS file encryption, such as the need to enter the verification code to browse content, such as the need to log in to access content and so on.
Different points:
Search engine crawler first ignores the entire page source script and style as well as HTML tag code, and then to the remainder of the text part of the word parsing syntax analysis, such as a series of complex processing. And the collector is generally through the HTML tag characteristics to crawl the required data, in the production of acquisition rules need to fill in the target content of the beginning of the mark of the end of the logo, so that the need to locate the content, or the use of specific Web pages to create specific regular expressions, to filter out the needs Whether it's using the start-end flag or regular expression, it involves HTML tags (Web page structure analysis).
And then we come up with some methods to prevent the collection.
1. Limit the number of visits per unit time of IP address
Analysis: No ordinary person can visit the same site 5 times a second, unless it is a program access, and there is such a preference, the remaining search engine crawler and annoying collector.
Disadvantages: Across, this will also prevent the search engine on the site included
Suitable website: Not too rely on search engine website
What the collector will do: Reduce the number of visits per unit of time and reduce the efficiency of acquisition
2, Shielding IP
Analysis: Through the background counter, record the visitor IP and the frequency of visits, artificial analysis visit records, shielding suspicious IP.
Disadvantages: There seems to be no malpractice, is the webmaster busy point
Applicable sites: All sites, and webmaster can know which is Google or Baidu's robot
What the collector would do: fight guerrilla warfare! The use of IP agents to collect a change once, but will reduce the efficiency of the collector and speed (with the agent well).
3, the use of JS encrypted Web content
Note: I have not contacted this method, just from elsewhere
Analysis: No analysis, search engine crawler and collector kill
Web site: Web sites that hate search engines and collectors
The collector would do this: you're so bull, you're going to have to go, he's not picking you up.
4, the Web site to hide the copyright or some random junk text, the text style written in the CSS file
Analysis: Although can not prevent collection, but will be collected after the content of your site is full of copyright instructions or some junk text, because the general collector will not collect your CSS files at the same time, those words have no style, it shows up.
Applicable sites: All sites
What will the collector do: for the copyright text, good to do, replace. For random spam words, there is no way, diligent point.
5, user login to access the site content
Analysis: Search engine crawlers do not design login procedures for each of these types of Web sites. I heard that the collector can be designed for a Web site to simulate user login submission form behavior.
Applicable sites: Extremely annoying search engines, and want to block most of the collector's website
What will the collector do: Make a module that the user logs in to submit form behavior
6. Use scripting language to do pagination (hide pagination)
Analysis: Or that sentence, search engine crawler will not be for a variety of web site hidden pagination analysis, which affect the search engine on its included. However, when the collector writes the collection rule, it is necessary to analyze the target web code, the person who knows the point of the script knowledge, will know the real link address of the paging.
Applicable sites: Not high dependence on the search engine site, and, collect your people do not understand the script knowledge
What the collector would do: what the collector would do, he would have to analyze your page code, and by the way analyze your paging script and not spend much extra time.
7, anti-theft chain measures (only allowed through the Site page connection view, such as: Request.ServerVariables ("Http_referer"))
Analysis: ASP and PHP can read the requested Http_referer attribute, to determine whether the request from the site, so as to limit the collector, but also limit the search engine crawler, seriously affect the search engine on the site part of the chain of anti-theft content included.
Applicable sites: Not too much to consider search engine included in the site
What will the collector do: camouflage http_referer, not difficult.
8, full flash, pictures or PDFs to render the content of the site
Analysis: Search engine crawler and collector support is not good, this a lot of people who know some SEO
Applicable website: Media design category and do not care about the search engine included in the site
What will the collector do: no, leave.
9, the site randomly using different template version
Analysis: Because the collector is based on the Web page structure to locate the required content, once successively two times template replacement, the acquisition rules will fail, good. And this has no effect on search engine crawler.
Applicable website: Dynamic website, and do not consider the user experience.
What will the collector do: a website template can not be more than 10, each template to get a rule on the line, different templates using different collection rules. If more than 10 templates, since the target site is so laborious to replace the template, to fulfill his, withdraw.
10, the use of dynamic irregular HTML tags
Analysis: This is more abnormal. Given that the HTML tag contains a space and a blank space effect is the same, so < div > and < div > is the same for the page display, but the tag as a collector is two different tags. If the number of spaces in the HTML tab of the secondary page is random, then
The rules of acquisition fail. However, this has little impact on search engine crawlers.
Web site: All dynamic websites that do not want to adhere to web design specifications.
What the collector will do: there are still some countermeasures, now HTML cleaner is still a lot of, first clean up the HTML tags, and then write collection rules, you should use the acquisition rules before cleaning the HTML tags, or to get the required data.
Summary:
Once you want to search engine crawler and collector, this is very frustrating thing, because the first step is to collect the search engine content, this is the same as the collector principle, so a lot of methods to prevent collection also hindered the search engine on the site included, helpless, right? The above 10 suggestions although not completely prevent collection, but the combination of several methods has rejected a large part of the collector.