Teach you how to completely eliminate Web site information is crawled by other sites

Source: Internet
Author: User
Keywords Crawl

Intermediary transaction http://www.aliyun.com/zixun/aggregation/6858.html ">seo diagnose Taobao guest cloud host technology Hall

For many webmaster, especially the original webmaster, the information is crawled by other web site is a very distressed thing, many people studied a lot of methods, to achieve can not let other site bulk crawl but not affect the search engine, which the self feel effective including the content of the article using 16 encryption ( In fact, this is actually very easy to crack) and so on, the results are not effective to achieve the goal. There is no way can be indexed by the search engine can effectively prevent other sites crawl, there are some, and very simple can be achieved. Cut the crap, get into the subject.

"Method One", can effectively prevent some common crawler, search engine normal included (but not completely prevent crawl).

Or to my site to cite examples, my site is a Chinese learning network, you can enter the list page, such as http://www.xue163.com/zhanzhang/yz/yh/, look at the following list page links, the first page is: http://www.xue163.com/ Zhanzhang/yz/yh/1247_1_2e8c.htm (pictured):

  

The second page is: http://www.xue163.com/zhanzhang/yz/yh/1247_2_6EAA.htm (pictured)

  

Other pages Similarly, do you see a different place? The http://www.xue163.com/zhanzhang/yz/yh/1247_1_ is the same, with a piece in the back, but not the same. Below I analyze the principle (for some generic crawlers only):

1, for the General Crawler, will be from two aspects to crawl, the first is the list page, the second is the ultimate page, so we must let him not convenient to get the corresponding address.

2, how to crawl the list page, such as the above address, if my first page is: http://www.xue163.com/zhanzhang/yz/yh/1247_1.htm The second page is Http://www.xue163.com/zhanzhang /yz/yh/1247_2.htm. Other pages and so on, then the crawler is very simple, direct transformation of the corresponding number of pages, such as the fifth page should be http://www.xue163.com/zhanzhang/yz/yh/1247_5.htm, and I have here in the back with the corresponding parameters, And this parameter is different for each column/page, the common crawler can't get my list page address through general rules. And the parameters of how to come, how effective to prevent others to find your law, in fact, very simple, using MD5 encryption, in the page is the case:, change it to, here ClassID is the corresponding article category ID, The xue163.com in MD5 is a variable i randomly set, this variable can be set to any letter, for each category, each page address is different, followed by two variables, one is the category, one is the current number of pages, add three variables together for MD5 encryption, because the MD5 encryption is too long, So use left to get the front several, the effect is the same. Of course, remember to generate the page when the corresponding generated address according to the above rules to generate the page. This simple to engage in, you can prevent a lot of programs crawl.

3, the list page is finished, but also to make the ultimate page to prevent crawl, the same can use the second method of the ultimate page for effective encryption, so that the general crawler can not get the next record address. Of course, there is a way to generate the page in accordance with the date directory to store the page, the same can prevent the General crawler crawl The ultimate page, such as my: http://www.xue163.com/html/2010510/5634498/.

The above method is just very simple to prevent the general procedures to crawl, for the strong point of their own can write the crawl program is not, because you can read the next page from the current page address, also achieve the goal of grasping. Is there no way? The way is yes, please see method two.

Method two, completely prevent crawl

After reading "Method One", you should know that to eliminate the crawl can not let others through the program to find your list address, the ultimate page to prevent the crawl according to the "method one" said on the line, the list of how to do, if not on the page with the "Next" address, search engines can not crawl, if you add the "next page" address, The person who writes the crawler can also read my next page address to achieve the goal of grasping. Let me introduce the corresponding method:

First, all list pages follow method one to generate the corresponding list page

Second, the corresponding link address is not found on the list page, through JavaScript to post to a dynamic page processing such as: page.asp (preferably can submit two parameters, one is the current ClassID, one is the number of pages), And page.asp get the submitted parameters, according to "method one" to get the real list address, and go to the corresponding page. The purpose of this is to make the author of the crawler can not get the next page address from the current page, but there is a problem search engine also can not get the next page address, you can not include more pages. In fact, there is a workaround to the "method one" of all the information generated by the list page, these list pages are not with the "previous", "Next" page address, is a pure information list. On the ultimate page, the different ultimate pages take several different list page addresses in the form of black chains (and also through the Ming chain). The only purpose of this is to give the search engine entrance, which would normally have been the search engine to get the entrance:

Home page "List page"

Now replace:

Homepage "Ultimate Page" List page "more Ultimate page" More List Page "more Ultimate page"

According to the actual operation, this method is more effective than the normal list page to be crawled by the search engine.

In fact, the principle is very simple, that is how not to get the crawler to obtain a valid address, so that the crawler can not batch crawl to the corresponding list of information. Method Two is complex point, but after testing, not only improve the search engine revenue, more effective to prevent others crawl. Perhaps I write not quite understand, welcome to add stationmaster qq:20127430 Discussion Welcome reprint this article author: study net

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.