Collection | strategy | introduction | static | Search Engine | Page from the front of the collection principle you can see that most of the acquisition program is based on the analysis of rules to collect, such as analysis of paging file name rules, analysis of page code rules.
First, pagination file name rules to prevent collection countermeasures
Most of the collectors rely on the analysis of the paging file name rules, batch, multi-page collection. If others cannot find out the file name rules of your paging file, then others will not be able to do a lot of your site to collect multiple pages.
Implementation method:
I think it is a good way to encrypt the paging file name with MD5, here, some people will say, you use MD5 to encrypt the paging file name, others according to this rule can also simulate your encryption rules to get your paging file name.
I want to point out that when we encrypt the paging file name, don't just encrypt the part of the file name change
If I represents pagination page number, then we do not encrypt: Page_name=md5 (i,16) & ". htm"
It is best to follow up one or more characters on the page number that you want to encrypt, such as: Page_name=md5 (i& "any one or several letters",) & ". htm"
Because the MD5 can not be decrypted, the other people see the page letters are MD5 encrypted results, so add people can not know what you follow after the letter is what, unless he used violent ****md5, but not too realistic.
Second, the page code rules to prevent collection countermeasures
If our content page has no code rules, then people can't extract the content they need from your code. So we want to do this step to prevent collection, it is necessary to make code without rules.
Implementation method:
To make the other person need to extract the tag randomization
1, custom multiple page templates, each page template in the important HTML tags are different, rendering the page content, random selection of page templates, some pages with Css+div layout, and some pages with table layout, this method is troublesome point, a content page, to do more than a few template pages, But the collection itself is a very cumbersome thing, to do a template, can play a role in the collection, for many people, are worth.
2, if the above method too troublesome, the page of the important HTML tag randomization, you can.
The more Web templates you do, the more random HTML code is, the other side analysis of the content code, the more trouble, the other side for your site to write a collection strategy, more difficult, at this time, most people will shrink, because this person is lazy, will collect other people's website data ~ ~ ~ Again, At present, most people are to take the acquisition program developed by others to collect data, the development of their own collection procedures to collect data, after all, is a minority.
There are some simple ideas to provide to you:
1, to the data collector is important, but not important to the search engine content with client script display
2, a page of data, divided into N page display, but also to increase the difficulty of collecting methods
3, with a deeper connection, because most of the current acquisition program can only collect the site content of the first 3 layers, if the content of the connection layer deeper, can also avoid being collected. However, this may cause the customer to browse the inconvenience. Such as:
Most of the sites are home----content index paging----content page
If changed to:
Home----Content index pagination----content page entry----content page
Note: Content page entry is best to add code that automatically goes to the content page
<meta http-equiv= "Refresh" content= "6;url= content page (http://www.xiaoqi.net)" >
In fact, as long as the first step to prevent collection (encryption paging file name rules), the effect of the collection has been good, or recommend two methods to use the collection method at the same time, to gather more difficulty, so that they know difficult to page back.
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.