Theoretical Analysis and ten countermeasures against website collection page 1/2

Source: Internet
Author: User

Similarities:
A. Both of them need to directly capture the source code of the webpage to work effectively,
B. The two websites crawl the accessed content multiple times within the unit time;
C. From a macro perspective, both IP addresses will change;
D. the two are impatient to crack some of your encryption (verification) on the webpage. For example, the webpage content is encrypted through JS files. For example, you need to enter a verification code to view the content, for example, You need to log on to access the content.

Differences:
Search engine crawlers ignore the source code scripts, styles, and HTML tags of the entire webpage.CodeAnd then perform Syntactic Analysis on the remaining text parts. The collector usually captures the required data based on the characteristics of HTML tags. When creating collection rules, the collector needs to fill in the start and end signs of the target content, so as to locate the required content; you can also create a regular expression for a specific webpage to filter out the desired content. HTML tags (Web Page Structure Analysis) are involved in both the start and end signs and regular expressions ).

Then we will propose some anti-collection methods.
1. restrict the number of visits to IP addresses per unit time
Analysis: no common person can access the same website five times in one second, unlessProgramBut with this kind of preference, there will be a search engine crawler and a nasty collector.

Disadvantages: one-size-fits-all, which also prevents search engines from indexing websites.

Applicable websites: websites that do not rely heavily on search engines

What can collectors do: Reduce the number of visits per unit time and reduce the Collection Efficiency

2. Shielding IP addresses
Analysis: the background counter is used to record the visitor's IP address and Access frequency, and the visitor's access records are analyzed manually to shield suspicious IP addresses.

Disadvantages: It seems that the webmaster is busy.

Applicable websites: All websites, and Webmasters can know which robots are Google or Baidu

What will the collector do: guerrilla warfare! The IP proxy is used for collection once and once, but it will reduce the collector's efficiency and network speed (using proxy ).

3. Use js to encrypt webpage content
Note: I have never touched this method, but it seems to me elsewhere.
Analysis: No need to analyze, search engine crawlers and collectors kill

Applicable websites: websites that hate search engines and collectors

The collector will do this: if you are so arrogant, they will not pick you up.

4. Hiding website copyrights or random spam text on webpages. These text styles are written in CSS files.
Analysis: Although collection cannot be prevented, the collected content will be filled with copyright instructions or spam texts on your website, because the collector does not collect your CSS files at the same time, those texts are displayed without style.

Applicable websites: All websites

What will the collector do: it is easy to replace and replace the copyrighted text. There is no way for random spam text. It's easy.

5. Users can log on to access website content
Analysis: search engine crawlers do not design logon programs for each of these types of websites. I heard that the collector can design a simulated User Login submission form for a website.

Applicable websites: websites that hate search engines and want to block most collectors

What will the collector do: Create a module for user login submission form Behavior

6. pagination by scripting (hiding pages)
Analysis: in that sentence, search engine crawlers do not analyze the hidden pages of various websites, which affects search engine indexing. However, when writing collection rules, the collector needs to analyze the target webpage code and understand the script knowledge to know the real link address of the page.

Applicable websites: websites that are not highly reliant on search engines, and those who collect you do not know about scripts

What will the collector do: it should be said that the collector will do what it will do. It will analyze your webpage code anyway. By the way, it will not take much time to analyze your paging script.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.