Evade the search engine's discernment

Source: Internet
Author: User
Keywords Search engine Robot this

Intermediary transaction http://www.aliyun.com/zixun/aggregation/6858.html ">seo diagnose Taobao guest cloud host technology Hall

Why do we have to do this in a perverse way?

If you are a webmaster, it is estimated that you are always doing everything possible to find your site in search engines, and can be ranked in search engine rankings, but sometimes, you may not have landed any search engine, but inexplicably found that you can search through it to your site. Perhaps some of the home page content you enjoy the world knows, but some content you do not want to be insight, index. You may require users to verify, but this can not evade search engines, as long as the search engine inside your page, without the password so you can log in. And simple encryption is often easily compromised. Do you use a database? This not only consumes valuable web site space resources, for some simple site, and can not be achieved. What to do? Search engine is not a deceive unreasonable, the bully burglar. How to keep the search engine shut out?

Explore the principles of search engines

First, we need to know how the search engine works. Network search engine mainly by network robot (Robot, this is the key of the full text), index database and query service three parts. Pages that are found by a web robot will be indexed in the search engine's database. Using the query client, you will be able to find your Web page. So the key here is to study this network robot. The principle of indexing database and query service is not analyzed in detail.

Web robot is actually a program, it can detect a large number of Internet web site hypertext structure and Web page URL connection, recursively retrieve all the content of the network site. These programs are sometimes called "Spiders (Spider)", "Internet tramps (Web Wanderer)", "Network Worms" (web worms), or web crawler. The large search engine site (search engines) has a dedicated web robot program to complete this collection of information. High-performance Web root automatically searches the Internet for information. A typical network robot works by looking at a page and finding relevant keywords and web information, such as headlines, pages in the browser title, and words that are often used for searching, and so on. Then it starts from all the links on the page and continues to look for relevant information, and so on, until it is exhausted. In order to realize its fast browsing the whole Internet, network robot usually uses preemptive multithread technology to gather information on the Internet. Using preemptive multithreading, it can index a Web page based on URL links, start a new thread following each new URL link, and index a new URL starting point. By indexing the search information, users can search. Oh, you may think, this goes on, not an infinite cycle ah? Of course, robots also need to rest, the network robot is issued on a regular basis, complete a working time to end. So, just make the finished page, not immediately be Revenue search engine index. Here, the basic working principle of web search engine let everybody understand basically. Command This network robot, do not let it see the door to enter, see the road is rushed, is the next work.

Evade the search engine's discernment

As a search engine developer, it is also left to network administrators or web-makers to provide ways to limit the actions of cyber bots:

When a robots visit a website (such as/google), first look at the house to see if it agrees to enter, like a strange visitor to a big mansion. If not, it silent away, and if it agrees, it will see that the host only allows it to enter those rooms. Network robot first check whether the site exists/google/robots.txt this file, if you can not find this file, then the robot will be cross straight into, check it needs to find information. If the robot finds this file, it will determine the scope of its access rights based on the contents of the file. Of course, if the content of the file is empty, then it is equivalent to not find the file, as bold. Remember that the robots.txt file should be placed in the root directory of the site.

The records in the robots.txt file usually start with one or more lines user followed by a number of disallow lines, as follows:

User:

This value is used to describe the name of the search engine robot, different search engines have different names, in the "robots.txt" file, if there are multiple user records that there are several robot will be limited by the agreement, for this document, if you need to restrict the robots , then at least one user record. If the value of the item is set to *, the protocol is valid for any robot, and in the "robots.txt" file, there can be only one record for "User: *".

Disallow:

This value is used to limit the URL that robot accesses, which can be a complete path or a partial one, and any URLs that start with disallow will not be accessed by robot. For example, "Disallow:/hacker" for/hacker.html and/hacker/index.html do not allow search engine access, and "Disallow:/hacker/" can also access to the robot Hacker.html, and cannot access/hacker/index.html. Any Disallow record is empty, that is to say, in multiple Disallow records, as long as one is written as "Disallow:" That all content of the site is allowed to be accessed, at least one Disallow record in the "/robots.txt" file.

Here are some examples of Robot.txt, as long as you save any of the following code as robots.txt, and then upload to the specified location, you can escape the search engine's discernment:

Example 1. Prohibit all search engines from accessing any part of the site:

User: *
Disallow:/

Example 2. Allow all robot access:

User: *
Disallow:

Example 3. Prohibit access to a search engine:

User-agent:badbot
Disallow:/

Example 4. Allow access to a search engine:

User-agent:baiduspider
Disallow:
User: *
Disallow:/

Example 5. A simple example:

In this example, the site has three directories that restrict access to search engines, that is, search engines do not access these three directories. It is important to note that each directory must be declared separately and not written as "Disallow: CGI/bbs/". User: The following * has a special meaning that represents "any robot", so there can be no "Disallow:/bbs/*" or "Disallow: *.gif" records in the file.

User: *
Disallow: CGI
Disallow:/bbs/
Disallow:/~private/

Conclusion: Is this setup, the search engine immediately can not find our limited page? No, as the article began to say before, the network robot is issued on a regular basis, once the index database in the record, it is necessary to wait for the next update of the database will be effective. A quick way is to go to the search engine to write off your Web page, but this is also a few days to wait. If you are on a very important page, just change the directory or file name.

For pages that you already want to keep confidential, do not have URLs in other unclassified pages that are connected to these pages, and in the working principle of a network robot that has been said, it can start from all the links in the page and continue to look for relevant information.

Maybe here, you're already feeling secure about your confidential Web pages. However, you think that no, for plain text files, can be downloaded via HTTP, or FTP. In other words, people with ill-intentioned can find clues through this robots.txt. The solution is, it is best to use disallow when used to limit the directory, and the directory needs to be confidential Web pages, the use of special file names, do not use the name of index.html, otherwise, this is as easy as guessing weak password. Your Web page will be much more secure with some form of d3gey32.html file names.

Finally do not trust the word to the confidential Web page on a password verification of insurance, let you sit back and relax.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.