10-point anti-gold collection suggestions

Source: Internet
Author: User

I write the collector myself, so I have some experience on website anti-collection. Because it is at work time, a variety of methods are just a simple reference.
During the implementation of many anti-DDoS collection methods, you need to consider whether or not to affect the crawling of websites by search engines. Therefore, let's first analyze the differences between general collectors and search engine crawler collection.

Similarities:

A. Both of them need to directly capture the source code of the webpage to work effectively,
B. The two websites crawl the accessed content multiple times within the unit time;

C. From a macro perspective, both IP addresses will change;

D. the two are impatient to crack some of your encryption (verification) on the webpage. For example, the webpage content is encrypted through js files. For example, you need to enter a verification code to view the content, for example, You need to log on to access the content.

Differences:

Search engine crawlers ignore the source code scripts, styles, and html Tag code of the entire webpage, and then perform Syntactic Analysis on the remaining text. The collector generally captures the required data based on the characteristics of html tags. When creating collection rules, the collector needs to fill in the start and end signs of the target content, so as to locate the required content; you can also create a regular expression for a specific webpage to filter out the desired content. Html tags (Web Page Structure Analysis) are involved in both the start and end signs and regular expressions ).

Then we will propose some anti-collection methods.

1. restrict the number of visits to IP addresses per unit time

Analysis: no common person can access the same website five times in one second, unless it is a program access, but with this kind of preference, there will be a search engine crawler and a nasty collector.

Disadvantages: one-size-fits-all, which also prevents search engines from indexing websites.

Applicable websites: websites that do not rely heavily on search engines

What can collectors do: Reduce the number of visits per unit time and reduce the Collection Efficiency

2. Shielding ip addresses

Analysis: the background counter is used to record the visitor's ip address and Access frequency, and the visitor's access records are analyzed manually to shield suspicious Ip addresses.

Disadvantages: It seems that the webmaster is busy.

Applicable websites: All websites, and Webmasters can know which robots are google or Baidu

What will the collector do: guerrilla warfare! The ip proxy is used for collection once and once, but it will reduce the collector's efficiency and network speed (using proxy ).

3. Use js to encrypt webpage content

Note: I have never touched this method, but it seems to me elsewhere.

Analysis: No need to analyze, search engine crawlers and collectors kill

Applicable websites: websites that hate search engines and collectors

The collector will do this: if you are so arrogant, they will not pick you up.

4. Hiding website copyrights or random spam text on webpages. These text styles are written in css files.

Analysis: Although collection cannot be prevented, the collected content will be filled with copyright instructions or spam texts on your website, because the collector does not collect your css files at the same time, those texts are displayed without style.

Applicable websites: All websites

What will the collector do: it is easy to replace and replace the copyrighted text. There is no way for random spam text. It's easy.

5. Users can log on to access website content

Analysis: search engine crawlers do not design logon programs for each of these types of websites. I heard that the collector can design a simulated User Login submission form for a website.

Applicable websites: websites that hate search engines and want to block most collectors

What will the collector do: Create a module for user login submission form Behavior
6. pagination by scripting (hiding pages)

Analysis: in that sentence, search engine crawlers do not analyze the hidden pages of various websites, which affects search engine indexing. However, when writing collection rules, the collector needs to analyze the target webpage code and understand the script knowledge to know the real link address of the page.

Applicable websites: websites that are not highly reliant on search engines, and those who collect you do not know about scripts

What will the collector do: it should be said that the collector will do what it will do. It will analyze your webpage code anyway. By the way, it will not take much time to analyze your paging script.

7. Anti-leech measures (you can only view the anti-leech measures through the link on this site, for example, Request. ServerVariables ("HTTP_REFERER")

Analysis: asp and php can read the HTTP_REFERER attribute of the request to determine whether the request comes from the website. This restricts the collector and also limits the search engine crawler, this seriously affects the search engine's indexing of some anti-leech content on the website.

Applicable websites: websites not indexed by search engines

What Will collectors do: it's not hard to pretend to be HTTP_REFERER.

8. The website content is displayed in full flash, images, or pdf.

Analysis: poor support for search engine crawlers and collectors. Many people familiar with seo know this.

Applicable websites: media design websites that do not care about search engine Indexing

What will the collector do: No, leave

9. websites use different templates at random

Analysis: The Collector locates the required content based on the webpage structure. Once the template is changed twice, the collection rule becomes invalid, which is good. In addition, this does not affect search engine crawlers.

Applicable websites: Dynamic websites without considering the user experience.

What can a collector do: there cannot be more than 10 templates for a website. Just create one rule for each template. Different templates adopt different collection rules. If there are more than 10 templates, since the target website is so hard to replace the template, complete it, and withdraw.

10. Use Dynamic and irregular html tags

Analysis: This is abnormal. Considering that html tags contain spaces and do not contain spaces, <div> and <div> have the same Display Effect on the page, but they are marked differently as collectors. If the number of spaces in the html tag of the next page is random
The collection rule becomes invalid. However, this has little impact on search engine crawlers.

Suitable for websites: All dynamic websites that do not want to comply with webpage design specifications.

What will the collector do: there are still countermeasures. There are still a lot of html cleaner, and the html Tag is cleared first, and then the collection rules are written. The html tag should be cleared before the collection rules are used, you can still get the required data.

Summary:

Once you need to search engine crawlers and collectors at the same time, this is very helpless, because the first step of the search engine is to collect the content of the target webpage, which is the same as the principle of the collector, so many methods to prevent the collection also impede the search engine's indexing of websites. Why? Although the above 10 suggestions cannot prevent data collection by, a majority of collectors have been rejected when they are applied together.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.