Prevent web pages to be collected by search engine crawlers and Web Capture methods summary

Source: Internet
Author: User

Source: Scripting House http://www.jb51.net/yunying/28470.html

The following methods can be treated as a specimen:
1, limit the number of times the IP address unit time Access
Analysis: No one can access the same site 5 times a second, unless it is a program access, and there is such a preference, there is a search engine crawler and annoying collector.
Cons: Fits, which also prevents search engines from being included in the site
Applicable websites: Websites that don't rely heavily on search engines
What the collector will do: Reduce the number of visits per unit of time and reduce acquisition efficiency

2. Shielded IP
Analysis: Through the background counter, record the visitor IP and access frequency, artificial analysis of visiting records, shielding suspicious IP.
Disadvantages: There seems to be no drawbacks, is the webmaster busy point
Applicable website: All sites, and webmaster can know which is Google or Baidu's robot
What the collector will do: Guerrilla warfare! The use of IP proxy acquisition once, but will reduce the efficiency of the collector and speed (with agent).

3. Use JS to encrypt Web content
Note: This method I have not touched, but from elsewhere it seems
Analysis: No analysis, search engine crawler and collector kill
Web site: Websites that hate search engines and collectors
The collector would do this: you're so bull, you're going to take it, and he's not going to pick you up.

4, the Web site to hide the copyright or some random junk text, these text style written in the CSS file
Analysis: Although not to prevent collection, but will be collected after the content of your site is full of copyright notes or some junk text, because the general collector will not collect your CSS files at the same time, those words have no style, it shows up.
Applicable websites: All sites
What the collector will do: for the copyright text, good to do, replace. For random junk text, no way, diligent point.

5. User Login to access site content *
Analysis: Search engine crawlers do not design login programs for each of these types of websites. I heard that the collector can design a simulated user login to submit form behavior for a website.
Web site: Websites that hate search engines and want to block most collectors
What the Collector will do: Create a module that proposes user login submission form behavior

6, the use of scripting language to do pagination (hidden paging)
Analysis: Or that sentence, search engine crawler will not target a variety of web sites to analyze the hidden pages, which affect the search engine for its inclusion. However, when the collector is writing the collection rules, it is necessary to analyze the target page code, and the person who knows the script knowledge will know the real link address of the paging.
Applicable Web site: The Web site is not highly dependent on the search engine, and the people who collect you don't know scripting knowledge
What the collector will do: it should say what the collector will do, and he will have to analyze your page code anyway, and analyze your paging script by the way, it won't take much extra time.

7, anti-theft chain measures (only allowed through the Site page connection view, such as: Request.ServerVariables ("Http_referer"))
Analysis: ASP and PHP can read the request Http_referer attribute, to determine whether the request from this site, thereby restricting the collector, also restricts the search engine crawler, seriously affect the search engine on the site part of the anti-theft chain content.
Applicable website: Do not consider the website that search engine collects very much
What the collector will do: camouflage http_referer, not difficult.

8. Full flash, picture or PDF to present the content of the website
Analysis: The search engine crawler and collector support is not good, this many understand the point SEO people know
Applicable Web site: Media design class and do not care about search engine indexed sites
What the collector will do: no, No.

9, the website randomly uses the different template
Analysis: Because the collector is based on the structure of the Web page to locate the required content, once the two times the template replacement, the collection rules will fail, good. And this does not affect the search engine crawler.
Applicable Web sites: Dynamic sites, without regard to the user experience.
What the collector will do: a site template can not be more than 10, each template to get a rule on the line, different templates use different collection rules. If more than 10 templates, since the target site is so laborious to replace the template, to fulfill him, withdraw.

10, the use of dynamic irregular HTML tags
Analysis: This is more perverted. Given that the HTML tag contains spaces and no spaces, the < div > and < div > are the same for the page, but the tag for the collector is two different tags. If the number of spaces in the HTML tag of each page is random, then
The collection rules are not valid. However, this does not affect the search engine crawler much.
Web site: All dynamic websites that do not want to comply with web design specifications.
What the collector will do: there are still some countermeasures, HTML cleaner is still a lot of, first clean up the HTML tags, and then write the collection rules, should be used before the collection rules to clean up the HTML tags, or to get the required data.

Prevent web pages to be collected by search engine crawlers and Web Capture methods summary

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.