Design a roadmap for Web robot on your home page

Last Update:2017-02-28 Source: Internet

Author: User

Tags format net access root directory

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

web| Design the Internet more and more cool, www reputation at the zenith. Publishing corporate information on the internet and conducting E-commerce has evolved from Vogue to fashion. As a Web Master, you may be familiar with HTML, Javascript, Java, and ActiveX, but do you know what a Web Robot is? Do you know what the Web robot has to do with the homepage you are designing?

Tramps on the internet---Web Robot

Sometimes you will somehow find that the content of your homepage is indexed in a search engine, even if you have never had any contact with them. This is actually the credit of Web robot. Web robot is actually a program that can traverse the hypertext structure of a large number of Internet URLs and recursively retrieve all the content of the Web site. These programs are sometimes called "Spiders (Spider)", "Internet tramps (Web Wanderer)", "Network Worms" (web worms), or web crawler. Some Internet-renowned search engine sites (search engines) have dedicated web robot programs to collect information, such as Lycos,webcrawler,altavista, and Chinese search engine sites such as Polaris, NetEase, Goyoyo and so on.

Web robot is like an uninvited guest, regardless of whether you care, it will be loyal to its owner's responsibility, hard, tireless and tirelessly on the World Wide Web space, of course, will visit your homepage, retrieve the homepage content and generate the record format it needs. Perhaps some of the home page content you enjoy the world knows, but some content you do not want to be insight, index. Can you just let it be "rampant" in your home page space, can command and control the whereabouts of the Web robot? The answer is certainly yes. As long as you read this article, you can like a traffic police, decorate a signpost, tell the Web robot how to retrieve your homepage, which can be retrieved, which can not be accessed.

In fact, Web robot can understand you.

Do not think that the web robot is not organized, without a bundle of running. Many web robot software provides two ways for administrators of Web sites or Web content producers to limit the whereabouts of Web robot:

1. Exclusion Protocol Protocol

The administrator of the Web site can create a specially formatted file on the site to indicate which part of the site can be accessed by robot, which is placed in the root directory of the site, i.e. http://.../robots.txt.

2, the Robots META tag

A Web page author can use a special HTML META tag to indicate whether a Web page can be indexed, parsed, or linked.

These methods are suitable for most web Robot, as to whether these methods are implemented in the software, but also rely on the Robot developers, is not guaranteed to be effective for any Robot. If you desperately need to protect your content, consider other protection methods such as adding passwords.

Using the Exclusion Protocol protocol

When robot accesses a Web site, such as http://www.sti.net.cn/, it first checks the file http://www.sti.net.cn/robots.txt. If the file exists, it will be parsed according to the record format:

User-agent: *
Disallow:/cgi-bin/
Disallow:/tmp/
Disallow:/~joe/

To determine if it should retrieve the site's files. These records are dedicated to the Web robot, the general visitors will probably never see this file, so don't be whimsical to add in the class HTML statements or "How did you do?" Where are you from? " Fake greeting.

There can be only one "/robots.txt" file on a site, and each letter of the file name requires all lowercase. Each individual "disallow" line in the robot record format represents a URL that you do not want robot to access, each URL must have a separate line, and cannot appear/cgi-bin/such as "Disallow:/tmp/wrong sentences." You cannot also have blank rows in a record because a blank row is a flag for multiple record splits.

The User-agent line indicates the name of the robot or other agent. In the user-agent line, ' * ' denotes a particular meaning---all robot.

Here are a few examples of robot.txt:

Reject all robots on the entire server:
User-agent: *
Disallow:/

Allow all robots to access the entire site:
User-agent: *
Disallow:
Or create an empty "/robots.txt" file.

Partial content of the server allows all robot access
User-agent: *
Disallow:/cgi-bin/
Disallow:/tmp/
Disallow:/private/

To reject a specific robot:
User-agent:badbot
Disallow:/

Allow only one robot to patronize:
User-agent:webcrawler
Disallow:
User-agent: *
Disallow:/

Finally, we give the robots.txt on the http://www.w3.org/ site:
# for use by search.w3.org
User-agent:w3crobot/1
Disallow:
User-agent: *
Disallow:/member/# This are restricted to the
Disallow:/member/# This are restricted to the
Disallow:/team/# This are restricted to the consortium only
Disallow:/tands/member # This are restricted to the
Disallow:/tands/team # This are restricted to the consortium only
Disallow:/project
Disallow:/systems
Disallow:/web
Disallow:/team

Using the Robots META tag method

The Robots META tag allows an HTML Web page author to indicate whether a page can be indexed or if it can be used to find more linked files. Only a partial robot has implemented this function at the moment.

The format of the Robots META tag is:
<meta name= "ROBOTS" content= "NOINDEX, NOFOLLOW" >
Like any other meta tag, it should be placed in the head area of the HTML file:
<meta name= "Robots" content= "Noindex,nofollow" >
<meta name= "description" content= "This page ..." >
<title>...</title>
<body>
...

The Robots META tag directives are separated by commas, and the instructions that can be used include [No]index and [NO] FOLLOW. The index instruction indicates whether an indexed robot can index this page, and FOLLOW instructions indicate whether robot can track the links on this page. The default scenario is index and follow. For example:
<meta name= "Robots" content= "Index,follow" >
<meta name= "Robots" content= "Noindex,follow" >
<meta name= "Robots" content= "Index,nofollow" >
<meta name= "Robots" content= "Noindex,nofollow" >

A good web site administrator should take into account robot management, so that robot for their own home page service, without compromising their own web page security.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Design a roadmap for Web robot on your home page

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Design a roadmap for Web robot on your home page

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support