1. What is robots.txt?
Robots.txt is a plain text file for website and search engine protocols. When a search engine spider comes to visit the site, it first crawls to check if there is a robots.txt in the root directory of the site,
If it exists, the access is determined according to the contents of the file, and if not, the spider crawls along the link. Robots.txt is placed in the root directory of the project.
2. Robots.txt Grammar
1 Allow all search engines to access all parts of the site
Robots.txt is written as follows:
User-agent: *
Disallow:
Or
User-agent: *
Allow:/
Note: 1. The first English to uppercase, the colon is in English state, after the colon has a space, these points must not write wrong.
2 prohibit all search engines from accessing all parts of the site
Robots.txt is written as follows:
User-agent: *
Disallow:/
3 only need to prohibit the spider access to a directory, such as the prohibition of admin, CSS, images and other directories are indexed
Robots.txt is written as follows:
User-agent: *
Disallow:/css/
Disallow:/admin/
Disallow:/images/
Note: The path is followed by a slash and no slashes: for example disallow:/images/has a slash is prohibited to crawl images the entire folder, disallow:/images No slash means that all the path inside the/images keyword will be shielded 4) shielding a Folder/templets, but can also crawl one of the file's writing:/templets/main
Robots.txt is written as follows:
User-agent: *
Disallow:/templets
Allow:/main
5 Prohibit access to all URLs under the ". php" suffix in the/html/directory (including subdirectories)
Robots.txt is written as follows:
User-agent: *
Disallow:/html/*.php
6 only allow access to a file with a suffix in a directory, use "$"
Robots.txt is written as follows:
User-agent: *
Allow:. html$
Disallow:/
7 Disable indexing of all dynamic pages in the site
For example, the limit here is "?" The domain name, such as Index.php?id=1
Robots.txt is written as follows:
User-agent: *
Disallow:/*?*
8 prohibit search engine to crawl all the pictures on our website (if your site uses the name of the other suffix, you can add it directly here)
Sometimes, in order to save the server resources, we need to prohibit all kinds of search engines to index our site pictures, the method here in addition to the use of "disallow:/images/" such as direct shielding folder, you can also take the direct screen image suffix name of the way.
Robots.txt is written as follows:
User-agent: *
Disallow:. jpg$
Disallow:. jpeg$
Disallow:. gif$
Disallow:. png$
Disallow:. bmp$
Write robots.txt to pay attention to the place
1. The first English to capitalize, the colon is in English state, after the colon has a space, these points must not write wrong.
2. Slash:/On behalf of the entire site
3. If a space is appended to the "/", the entire Web site is blocked
4. Do not prohibit the normal content
5. The effective time is a few days to two months
The following case has a gray line of words to show that robots.txt is playing a role. Only included in the site's address bar: