Website Anti-Collection Technology _ website application

Source: Internet
Author: User
Tags md5
Anti-acquisition technology of web site

Some time ago with a few friends to talk about the content collection, because I have never been involved in this field, I think I can find a way to reverse collection, after a period of research, it seems that there is a way to do, not to say that completely prevent collection, but to collect the cost of higher, or let the content after collection can not be used, or after the collection of content to spend a lot of labor to analyze, filter.

The following method, transferred from someone else's article, Author: Shangai (Xiao-qi), I extracted some of the contents of the reverse collection.

First, pagination file name rules to prevent collection countermeasures

Most of the collectors rely on the analysis of the paging file name rules, batch, multi-page collection. If others cannot find out the file name rules of your paging file, then others will not be able to do a lot of your site to collect multiple pages.
Implementation method:
I think it is a good way to encrypt the paging file name with MD5, here, some people will say, you use MD5 to encrypt the paging file name, others according to this rule can also simulate your encryption rules to get your paging file name.

I want to point out that when we encrypt the paging file name, don't just encrypt the part of the file name change
If I represents the page number of pagination, then let's not encrypt it
PAGE_NAME=MD5 (i,16) & ". htm"

It is best to follow up one or more characters on the page number that you want to encrypt, such as: Page_name=md5 (i& "any one or several letters",) & ". htm"

Because MD5 can not be decrypted, the other people see the page letter is MD5 encrypted results, so add people can not know what you follow after the letter is what, unless he used brute force to crack MD5, but not too realistic.

Second, the page code rules to prevent collection countermeasures

If our content page has no code rules, then people can't extract the content they need from your code.
So we want to do this step to prevent collection, it is necessary to make code without rules.
Implementation method:
To make the other person need to extract the tag randomization
1, custom multiple page templates, each page template in the important HTML tags are different, rendering the page content, random selection of page templates, some pages with Css+div layout, and some pages with table layout, this method is troublesome point, a content page, to do more than a few template pages, But the collection itself is a very cumbersome thing, to do a template, can play a role in the collection, for many people, are worth.
2, if the above method too troublesome, the page of the important HTML tag randomization, you can.

The more Web templates you do, the more random HTML code is, the other side analysis of the content code, the more trouble, the other side for your site to write a collection strategy, more difficult, at this time, most people will shrink, because this group of people is lazy, will collect other people's website data to say again, At present, most people are to take the acquisition program developed by others to collect data, the development of their own collection procedures to collect data, after all, is a minority.
(The acquisition program is generally general, only a limited number of parameters can be set)

There are some simple ideas to provide to you:
1, to the data collector is important, but not important to the search engine content with client script display
2, a page of data, divided into N page display, but also to increase the difficulty of collecting methods
3, with a deeper connection, because most of the current acquisition program can only collect the site content of the first 3 layers, if the content of the connection layer deeper, can also avoid being collected. However, this may cause the customer to browse the inconvenience.
Such as:
Most of the sites are home----content index paging----content page
If changed to:
Home----Content index pagination----content page entry----content page
Note: Content page entry is best to add code that automatically goes to the content page

In fact, as long as the first step to prevent collection (encryption paging file name rules), the effect of the collection has been good, or recommend two methods to use the collection method at the same time, to collect more difficult to collect, so that they know difficult page and retreat.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.