writing the original content every day, and then updated to their own website, we have thought about what is the purpose of this? Of course, in order to get a good search engine, we all know that the search engine is very important to the original content, for the original content of the collection quickly, on the contrary, if your site is full of a large number of copied content, Then I can only regret to tell you that the future of the site is slim. However, this also tells us to actively use
view the Web site's robots do not need to explain it), see this, understand it! The source code of the above site is not the reason to/pages/business this painting, is to let everyone see the screen of the robots, Disallow:/pages. His source code inside the link is called pages this folder inside the file, just friendship links are placed in this file, and the
1. robots is a file that must be placed under the root directory of the website, telling the spider that directories can be crawled and those files that cannot be crawled. The full name is "robots.txt ".
2. How to Write robots:
① Prohibit all se objects from capturing directories:
User-Agent: * disallow:/directory 1/disa
choose E-books and then use the software to export such articles are mostly original content, is a good tip.
3, the file specification
I once asked a fellow in a robots
disallow:/kaifengseo/
Disallow:/kaifengseo
What's the difference between these two?
Certainly there is a difference, the first line Plus/is to prevent Spiders crawling Kaifengseo folder und
commands are executed after all commands in the. WGETRC, and therefore overwrite the same configuration items in the. Wgetrc. Here Robots=off is because wget by default will be based on the robots.txt of the site to operate, if Robots.txt is user-agent: * Disallow:/, wget is not able to mirror or download the directory, using the-e robots=off parameter to bypass
following two sentences should be generated dynamically in the actual use of the database, etc.Response. Write ("user-agent: * \ n");Response. Write ("Disallow:/news/231.html \ n");//refers to a static robots file content, which stores the content of the mask that does not changeResponse. WriteFile ("~/static-robots.txt");Response. Flush ();} PublicBOOLIsReusable {Get{returnfalse;}}} The general handler im
pseudo-command User-Agent indicates the specific Web spider/robot/Web crawler. For example, User-Agent: Googlebot indicates GoogleBot Web crawler, while User-Agent: * indicates all Web crawlers, robots, and Web Crawlers:User-agent :*The role of the pseudo-command Disallow is to specify which resources are disabled for spiders, robots, and web crawlers. In the pr
Hosts file.
Robots.txt must be placed in the root directory of the website and the file name must be in lowercase. The following is an example:
# Robots.txt file start
# Exclude files from all robots:
User-Agent :*Disallow:/security/Disallow:/admin/Disallow:/admin.htm
# End robots.txt File
Forbidden access statem
reasons. // The following two statements should be dynamically generated by databases in actual use. Response. Write ( " User-Agent: * \ n " );Response. Write ( " Disallow:/news/231.html \ n " ); //Reference the content of a static robots file, which stores unaltered blocked contentResponse. writefile ("~ Static-robots.txt"); Response. Flush ();}PublicBoolIsreusable {Get{ReturnFalse;}} } Th
If some level two domain name is not willing to be revenue, if these two level domain name can not be accessed, that is directly to a specific domain name to do 301 Redirect to the primary domain name, or the several two-level domain names are bound to the new directory or subdirectory, and then use robots to do directory restrictions crawl. If the level two domain name still needs to use, that will sacrifice a period of time, the domain name alone do
; /*** * Description: Drawing tools * @author you forever * * */public class Searchcrawler implements runnable{/* Disallowlistcache cache robot does not allow search URLs. The robot protocol sets a robots.txt file in the root directory of the Web site, * Specify theWhich pages are restricted to search. The search program should skip these areas during the search, following is an example of robots.txt: # robots.txt for http://somehost.com/User-agent: * Disall
access the PHP document via http://localhost/phpdocs/.
28. Create a file called/usr/local/apache/htdocs/robots.txt, so that you can avoid the search engine to index phplib,phpmyadmin and PHP files, the newly created file should include the following setting line:
#robots. txt for (hostname)
User-agent *
Disallow:/phpdocs/
Disallow:/php/
/src/php-3.0.11/doc/manual.html \
/usr/local/src/php-3.0.11/doc/index.html
Create a symbolic link so that most Web browsers will automatically display the PHP file start page correctly.
Ln-s \
/usr/local/src/php-3.0.11/doc \
/usr/local/apache/htdocs/phpdocs
Create a symbolic link so that you can access the PHP document via http://localhost/phpdocs/.
28. Create a file called/usr/local/apache/htdocs/robots.txt, so that you can avoid the search engine to index phplib,phpmyadmin and PHP files, the n
prevent search engines from indexing Phplib,phpmyadmin and PHP files, and the newly created file should include the following setup lines:
#robots. txt for (hostname)
User-agent *
Disallow:/phpdocs/
Disallow:/php/
disallow:/phpmyadmin/
2.7 Installation Phplib
Before you read chapter 15th, "Processing concurrent acces
address, the best control within two levels, such as how to lose weight after http://www.***.cn/rh/4.htm, up to two levels of directory, small and medium web site two directories enough, if the portal site, you can consider the three-tier directory structure.
Page title, whether it is collected, or write their own, must not be the search engine has been the title, write the title, you can search Baidu, if already have, appropriate changes in the search, know completely without.
article conten
robots.txt file to avoid reading important directories or files by search engines. For example:Allow all robots to access all filesUser-agent :*Disallow:The following configuration does not allow search engines to include all directories.User-agent: * www.2cto.comDisallow :/I can also disable the search engine from indexing the specified directory. The following example will disable the search engine from indexing the infosec directory and Its subdir
, perform a local test or a third-level domain name test on the server. After the test is complete, add the content. At least fill the content on the page and then resolve the domain name to go online, baidu Spider crawlers can capture a lot of content. This greatly increases the chance of indexing.3.robots.txt: due to the bug in the configuration of robots.txt, robots has set up to prohibit crawlers from crawling or the
contents of the file.robots.txt formatThe file contains one or more records that are separated by a blank line (with CR,CR/NL, or NL as The Terminator), and the format of each record is as follows: ":". You can use # for annotations in this file. The records in this file typically start with one or more lines of user-agent, followed by a number of disallow and allow lines, as detailed below.User-agent:The value of the item is used to describe the nam
Use the Wget tool to crawl Web pages and pictures:Include css\js\html\ picture fileWGET-E robots=off-w 1-xq-np-nh-pk-m- t 1-p "$PATH" "$URL"Robots=off here is because wget by default will be based on the robots.txt of the site to operate, if Robots.txt is user-agent: * Disallow:/, wget is not able to mirror or download the directory-E is used to perform addition
Weaving Dream Dedecms program.Sometimes we can not find the above dynamic link, because the site webmaster may have done pseudo-static, these links are shown to be static. This time we can look at its robots file, the general Weaving Dream site Robots files are the default of weaving dream, in addition we can also see the site's membership page is not member/such links, some of the site's Member Center is
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.