The search engine uses a program robot (also called Spider) to automatically access webpages on the Internet and obtain webpage information.
You can create a pure robot file robots.txt on your website, which declares that the website does not want to be accessed by the robot. In this way, some or all of the content of the website will not be indexed by the search engine, or the specified search engine only contains the specified content. The robots.txt file should be placed in the root directory of the website.
When a search robot (called a search spider) crawls a site, it first checks that the site root directory contains robots.txt. If so, the search robot determines the access range based on the content in the file. If the file does not exist, the search robot crawls the link.
Robots.txt file format:
The “robots.txt file contains one or more records separated by empty rows (with Cr, CR/NL, or NL as the terminator). The format of each record is as follows:
"<Field >:< optionalspace> <value> <optionalspace> ".
In this file, you can use # For annotation. The usage is the same as that in UNIX. The record in this file usually starts with one or more lines of User-Agent, followed by several disallow lines. The details are as follows:
User-Agent:
The value of this item is used by the worker to search for the name of the engine robot. in the "Robot robots.txt" file, if Multiple User-Agent records indicate that multiple robots are restricted by this Protocol, at least one User-Agent record is required for this file. If the value of this item is set to *, the Agreement applies to all robots. In the robot robots.txt file, there can be only one record such as "User-Agent.
Disallow:
The value of this item is used to describe a URL that you do not want to access. This URL can be a complete or partial path, any URL starting with disallow will not be accessed by the robot. For example, "disallow:/help" does not allow access by search engines to/help.html and/help/index.html, while "disallow:/help/" allows robot access to/help.html, but cannot access/help/index.html. If any disallow record is blank, it means that all parts of the website are allowed to be accessed. At least one disallow record is required in the "/robots.txt" file. If "/robots.txt" is an empty file, the website is open to all search engine robots.
Example of robots.txt File Usage:
Example 1. prohibit all search engines from accessing any part of the website to download the robots.txt file User-Agent: * disallow :/
Example 2. Allow all robots to access (or create an empty file "/robots.txt" file) User-Agent: * disallow:
Example 3. Disable access to User-Agent: badbotdisallow:/from a search engine :/
Example 4: Allow a search engine to access User-Agent: baiduspiderdisallow: User-Agent: * disallow :/
Example 5: In this example, the website has three directories that restrict access to the search engine. That is, the search engine does not access these three directories. Note that each directory must be declared separately, rather than "disallow:/cgi-bin/tmp /". User-Agent: * after a special meaning, represents "any robot", so the file cannot contain "disallow:/tmp/*" or "disallow :*. GIF. user-Agent: * disallow:/cgi-bin/disallow:/tmp/disallow :/~ JOE/
Special robot parameters:
1. Google
Googlebot allowed:
If you want to intercept all browsers other than googlebot to access your webpage, you can use the following syntax:
User-Agent: disallow :/
User-Agent: googlebot
Disallow:
Googlebot follows the line pointing to its own, rather than pointing to all the rows of the roaming bot.
"Allow" extension:
Googlebot can recognize the robots.txt standard extension known as "allow. The extension may not be recognized by another search engine, so use another search engine you are interested in for search ." The "allow" line works exactly the same as the "disallow" line. You only need to list the directories or pages you want to allow.
You can also use "disallow" and "allow ". For example, to intercept all pages other than a page in a subdirectory, you can use the following entries:
User-Agent: googlebot
Disallow:/folder1/
Allow:/folder1/myfile.html
These entries intercept all pages except myfile.html in the folder1 directory.
If you want to intercept googlebot and allow another Google Explorer (such as googlebot-mobile), you can use the "allow" rule to allow access to the browser. For example:
User-Agent: googlebot
Disallow :/
User-Agent: googlebot-mobile
Allow:
Use * to match character sequences:
You can use an asterisk (*) to match character sequences. For example, to intercept access to all subdirectories starting with "private", you can use the following entries:
User-Agent: googlebot
Disallow:/private */
To intercept all question marks (?) You can use the following entries to access the website:
User-Agent :*
Disallow :/*? *
Use $ to match the end character of the URL
You can use the $ character to specify the end character of the URL. For example, to intercept a URL Ending with. asp, you can use the following entries:
User-Agent: googlebot
Disallow:/*. ASP $
You can use this mode in combination with the allow command. For example, if? Indicates a session ID. You can exclude all URLs containing this ID and ensure that googlebot does not capture duplicate webpages. However? The ending URL may be the version of the webpage you want to include. In this case, you can set the robots.txt file as follows:
User-Agent :*
Allow :/*? $
Disallow :/*?
Disallow :/*? Will one row be intercepted? (Specifically, it intercepts all URLs starting with your domain name, followed by any strings, and then question marks (?), And then the URL of any string ).
Allow :/*? $ A row can contain any? URL (specifically, it will allow all URLs starting with your domain name, followed by any string, followed by question marks (?), The question mark is followed by a URL without any characters ).
Sitemap website map:
The support formula for the website map is as follows: links to the sitemap file are directly included in the robots.txt file.
Like this:
Sitemap: http://www.etcis.com/sitemap.xml
Currently, Google, Yahoo, ask and MSN are supported search engine companies.
However, I recommend that you submit the job in Google sitemap, which provides many functions to analyze your link status.
Benefits of robots.txt:
1. Sorry, of course, the premise is that this file exists on the website. For websites without robots.txt configured, SPIDER will be redirected to 404 documents-although it is not a pure text file-which will cause a lot of trouble for the SPIDER indexing website, this affects the search engine's indexing of website pages.
2. robots.txt can prevent unnecessary search engines from occupying valuable server bandwidth, such as email retrievers, Which is meaningless to most websites, and image strippers, for most non-graphic websites, it does not make much sense, but consumes a lot of bandwidth.
3. Search Engines may even index temporary files.
4. the significance of configuring robots.txt is even more significant for a website with many pages, because it often suffers from the huge pressure on the website from the search engine spider: flood-like spider access. without control, normal Website access may even be affected.
5. In the same way, if the website contains repeated content, you can use robots.txt to restrict some pages from being indexed or indexed by the search engine. This prevents the website from being punished by the search engine for duplicate content and ensures that the website ranking is not affected.
Risks Caused by robots.txt and solutions:
1. Everything is beneficial. robots.txt also brings some risks: it also gives attackers the directory structure and private data location of the website. Although this is not a serious problem when the security measures of web servers are properly configured, it reduces the difficulty of malicious attacks.
For example, if the private data on the website is accessed through www.yourdomain.com/private/index.html, the configuration in robots.txt may be as follows:
User-Agent :*
Disallow:/private/
In this case, the attacker only needs to download robots.txt to know where you want to hide the content. In the browser, enter www.yourdomain.com/private/ to access the content we do not want to disclose. In this case, the following methods are generally used:
Set access permissions to implement password protection for/private/content, so that attackers cannot access.
Another method is to rename the primary file index.html, such as abc-protect.html. In this way, the address of the content is changed to "contents.
2. If it is not set correctly, the search engine will delete all indexed data.
User-Agent :*
Disallow :/
The above code will disable all search engine indexes.
Currently, a huge number of search engine robots comply with the robots.txt rules. Currently, the robots meta tag does not support much, but is gradually increasing. For example, Google, a famous search engine, is fully supported, in addition, Google also adds the command "ARCHIVE" to limit whether Google retains web snapshots. For example:
<Meta name = "googlebot" content = "index, follow, noarchive">
Captures the pages on the site and crawls them along the links on the page, but does not keep the snapshots of the pages on goolge.
-----Example of the robots.txt file ---
User-Agent :*
Disallow:/default. aspx? Langtypeid
Disallow:/_ newimg/
Disallow:/bin/
Disallow:/test. aspx
Disallow:/* langtypeid *
Disallow:/adv/
Allow:/adv/adv.html