Web Robot design Roadmap for Web sites

Source: Internet
Author: User
Keywords Can whether

Intermediary transaction http://www.aliyun.com/zixun/aggregation/6858.html ">seo diagnose Taobao guest cloud host technology Hall

The internet is getting cooler, and WWW's popularity is at its zenith. Publishing corporate information on the internet and conducting E-commerce has evolved from Vogue to fashion. As a Web Master, you may be familiar with HTML, Javascript, Java, and ActiveX, but do you know what a Web Robot is? Do you know what the Web robot has to do with the homepage you are designing?




Tramps on the internet---Web Robot




Sometimes you will somehow find that the content of your homepage is indexed in a search engine, even if you have never had any contact with them. This is actually the credit of Web robot. Web robot is actually a program that can traverse the hypertext structure of a large number of Internet URLs and recursively retrieve all the content of the Web site. These programs are sometimes called "Spiders (Spider)", "Internet tramps (Web Wanderer)", "Network Worms" (web worms), or web crawler. Some internet famous search engine sites (search engines) have dedicated web robot programs to collect information, such as Lycos,webcrawler,altavista, and Chinese search engine sites such as Polaris, NetEase, Goyoyo.




Web robot is like an uninvited guest, regardless of whether you care, it will be loyal to its owner's responsibility, hard, tireless and tirelessly on the World Wide Web space, of course, will visit your homepage, retrieve the homepage content and generate the record format it needs. Perhaps some of the home page content you enjoy the world knows, but some content you do not want to be insight, index. Can you just let it be "rampant" in your home page space, can command and control the whereabouts of the Web robot? The answer, of course, is yes. As long as you read this article, you can like a traffic police, decorate a signpost, tell the Web robot how to retrieve your homepage, which can be retrieved, which can not be accessed.




In fact, Web robot can understand you.




Do not think that the web robot is not organized, without a bundle of running. Many web robot software provides two ways for administrators of Web sites or Web content producers to limit the whereabouts of Web robot:




1. Exclusion Kyoto Protocol




The administrator of the Web site can create a specially formatted file on the site to indicate which part of the site can be accessed by robot, which is placed in the root directory of the site, i.e. Http://.../robots.txt




2, the Robots META tag




A Web page author can use a special HTML META tag to indicate whether a Web page can be indexed, parsed, or linked.




These methods are suitable for most web Robot, as to whether these methods are implemented in the software, but also rely on the Robot developers, is not guaranteed to be effective for any Robot. If you desperately need to protect your content, consider other protection methods such as adding passwords.




Using the Exclusion Kyoto Protocol




When robot accesses a Web site, such as http://www.sti.net.cn/, it first checks the file http://www.sti.net.cn/robots.txt. If the file exists, it will be parsed according to the record format:




User: *
Disallow: CGI
Disallow:/tmp/
Disallow:/~joe/




To determine if it should retrieve the site's files. These records are dedicated to the Web robot, the general visitors will probably never see this file, so don't be whimsical to add in the <img src=*> class HTML statements or "How did you do?" Where are you? Fake greeting.




There can be only one "/robots.txt" file on a site, and each letter of the file name requires all lowercase. Each individual "Disallow" line in the robot record format represents a URL that you do not want robot to access, each URL must have a separate line, and cannot appear CGI such as "Disallow:/tmp/sentences." You cannot also have blank rows in a record because a blank row is a flag for multiple record splits.




The user line indicates the name of the robot or other agent. In the user line, ' * ' denotes a particular meaning---all robot.




Here are a few examples of robot.txt:




Reject all robots on the entire server:




User: *
Disallow:/




Allow all robots to access the entire site:
User: *
Disallow:
or generate an empty "/robots.txt" file.




Partial content of the server allows all robot access
User: *
Disallow: CGI
Disallow:/tmp/
Disallow:/private/




To reject a specific robot:
User-agent:badbot
Disallow:/




Allow only one robot to patronize:
User-agent:webcrawler
Disallow:
User: *
Disallow:/




Finally, we give the robots.txt on the http://www.w3.org/site:
# for use by search.w3.org
User-agent:w3crobot/1
Disallow:
User: *
Disallow:/member/# This are restricted to the
Disallow:/member/# This are restricted to the
Disallow:/team/# This are restricted to the consortium only
Disallow:/tands/member # This are restricted to the
Disallow:/tands/team # This are restricted to the consortium only
Disallow:/project
Disallow:/systems
Disallow: Self-Help
Disallow:/team




Using the Robots META tag method




The Robots META tag allows an HTML Web page author to indicate whether a page can be indexed or if it can be used to find more linked files. Only some robot have implemented this function at present.




The format of the Robots META tag is:




<meta name= "ROBOTS" content= "NOINDEX, NOFOLLOW" >
Like any other meta tag, it should be placed in the head area of the HTML file:
<html>
<head>
<meta name= "Robots" content= "Noindex,nofollow" >
<meta name= "description" content= "This page ..." >
<title>...</title>
</head>
<body>
...




The Robots META tag directives are separated by commas, and the instructions that can be used include [No]index and [NO] FOLLOW. The index instruction indicates whether an indexed robot can index this page, and FOLLOW instructions indicate whether robot can track the links on this page. The default is index and follow. For example:




<meta name= "Robots" content= "Index,follow" >
<meta name= "Robots" content= "Noindex,follow" >
<meta name= "Robots" content= "Index,nofollow" >
<meta name= "Robots" content= "Noindex,nofollow" >




A good web site administrator should take into account robot management, so that robot for their own home page service, without compromising their own web page security.




the large role of small meta in HTML document


robots.txt and the Robots meta tag


Robots.txt Guide

Use of
Robots Meta tag

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.