Sitemap format details

Source: Internet
Author: User
Tags url example
Overview

The sitemaps protocol allows you to notify the search engine of the URLs available for crawling on your website. The easiest way is to use the sitemaps protocol to list XML files with all the URLs of a website. This Protocol is highly scalable and therefore applicable to websites of all sizes. It also enables the website administrator to provide other information about each website (the last update time, the frequency of changes, and its importance compared with other websites) so that the search engine can capture the website more intelligently.

Sitemaps is particularly useful when users cannot access all the regions of the website through a browser interface. (Generally, a user cannot access a specific page or area of a website through a tracking Link .) For example, any website that can only access some of its pages through a search form will benefit from creating sitemaps and submitting it to the search engine.

This file describes the format of the sitemaps file and explains the location where you post the sitemaps file so that the search engine can retrieve it.

Pay attention to adding the sitemaps protocol instead of replacing the crawling mechanism that the search engine has used to discover websites. By submitting a sitemaps (or multiple sitemaps) to the search engine, the engine can better capture your website.

Use this protocol andNoMake sure that the search index contains your webpage. (Please note that using this protocol will not affect Google's ranking of Your webpage .)

Sitemaps 0.84 is provided in accordance with the terms of attribution-extract alike Creative Commons license.

XML Sitemaps format

The sitemaps protocol format is composed of XML tags. All data values of sitemaps should be escaped by the entity. The file itself should be encoded by the UTF-8.

The following is a sitemaps example that contains only one URL and uses all the optional tags. Optional. It is italic.

<?xml version="1.0" encoding="UTF-8"?>  < urlset xmlns="http://www.google.com/schemas/sitemap/0.84">   < url>    < loc>http://www.example.com/</loc>    < lastmod>2005-01-01</lastmod>    < changefreq>monthly</changefreq>    < priority>0.8</priority>   </url>    </urlset>

Sitemaps should:

  • To<Urlset>Start to mark start,</Urlset>End Tag ends.
  • Each URL contains<Url>.
  • Every<Url>The parent tag includes<Loc>Child tag entries.

XML tag Definition

The available XML tags are described below.

<urlset>
Required Encapsulate this file and provide the current protocol standard as a reference.
<url>
Required Each URL entry has a parent tag. The remaining sub-tags marked for this tag.
<loc>
Required The URL of the page. If your Web server requires a Web site, it should start with the Protocol (for example, http) and end with a slash. The value must be less than 2048 characters long.
<lastmod>
Optional The last modification date of the file. This date should be in W3C Datetime format. If necessary, the time part can be omitted, but only YYYY-MM-DD is used.
<changefreq>
Optional

The frequency of page changes. This value provides general information for search engines, which may be unrelated to the frequency of page capturing by search engines. Valid values:

  • Always
  • Hourly
  • Daily
  • Weekly
  • Monthly
  • Yearly
  • Never

The value "always" should be used to describe the document that changes each access. The value "never" should be used to describe the archived website.

Note that the value of this tag is consideredPromptInstead of commands. Although search engine crawling tools consider this information when making decisions, they may capture pages marked as "hourly" less frequently than once an hour, the page marked as "yearly" may be crawled more frequently than once a year. Capture tools may also regularly crawl pages marked as "never" so that they can handle unexpected changes to these pages.

<priority>
Optional

The priority of this website is related to the priority of other websites on your website. Valid values range from 0.0 to 1.0. This value does not affect the comparison between your web page and other web pages. It only tells the search engine that you think your web page is the most important, in this way, they can sort your page captures in your favorite way.

The default priority of a webpage is 0.5.

Note that the priority you specify for the pageNo effectThe ranking of your website on the search engine result page. This information is used when the search engine selects different URLs for the same website. Therefore, you can use this tag to increase the possibility of displaying relatively important URLs in the search index.

In addition, it is good for you to specify a high priority for all the URLs on your website. Because priorities are interrelated, they are only used to select between webpages on your own website. The priority of your webpage is not compared with that of other websites.

Entity escape

We require that your Sitemaps file be encoded in a UTF-8 (which is typically done when you save the file ). For all XML files, any data value (including the URL) should use entity escape codes for the characters listed in the following table.

Character Escape code
& Symbol & & Amp;
Single quotes ' & Apos;
Double quotation marks " & Quot;
Greater > & Gt;
Less < & Lt;

In addition, all URLs (including the URLs of your Sitemaps) should be encoded for identification by the Web servers where they are located and URL escaping. However, if you use any script, tool, or log file to generate a URL (any method other than manual input), this is usually done for you. If you submit Sitemaps and receive an error message that Google cannot find some URLs, check and make sure that your web site complies with the RFC-3986 URI standard, RFC-3987IRI standard, and XML standard.

Use non-ASCII characters (ü) And characters that require entity escaping (&) URL example:

http://www.example.com/ümlat.html&q=name

The following are ISO-8859-1-encoded (managed on the server that uses the encoding) and the same URL escaped by the URL:

http://www.example.com/%FCmlat.html&q=name

The following is the same URL encoded by the UTF-8 (hosted on the server that uses the encoding) and escaped by the URL:

http://www.example.com/%C3%BCmlat.html&q=name

The following are the URLs escaped by entities:

http://www.example.com/%C3%BCmlat.html&amp;q=name

XML Sitemaps example

The following example shows Sitemaps in XML format. In the example, Sitemaps contains a few URLs.<Loc>XML tag to identify. In this example, a set of optional parameters are provided for each URL.

<?xml version="1.0" encoding="UTF-8"?><urlset xmlns="http://www.google.com/schemas/sitemap/0.84">   <url>      <loc>http://www.example.com/</loc>      <lastmod>2005-01-01</lastmod>      <changefreq>monthly</changefreq>      <priority>0.8</priority>   </url>   <url>      <loc>http://www.example.com/catalog?item=12&amp;desc=vacation_hawaii</loc>      <changefreq>weekly</changefreq>   </url>   <url>      <loc>http://www.example.com/catalog?item=73&amp;desc=vacation_new_zealand</loc>      <lastmod>2004-12-23</lastmod>      <changefreq>weekly</changefreq>   </url>   <url>      <loc>http://www.example.com/catalog?item=74&amp;desc=vacation_newfoundland</loc>      <lastmod>2004-12-23T18:00:15+00:00</lastmod>      <priority>0.3</priority>   </url>   <url>      <loc>http://www.example.com/catalog?item=83&amp;desc=vacation_usa</loc>      <lastmod>2004-11-23</lastmod>   </url></urlset>

You can use gzip to compress your Sitemaps files. Compressing Sitemaps files reduces bandwidth requirements. Please note that,UncompressedThe Sitemaps file cannot exceed 10 MB.

Use Sitemaps to index files (group multiple Sitemaps files)

You can provide multiple Sitemaps files, but each provided Sitemaps file contains no more than 50,000 URLs, and cannot exceed 10 MB (10,485,760) without compression ). These restrictions help ensure that the Web server is not in trouble when transferring very large files.

To list more than 50,000 URLs, you must create multiple Sitemaps files. If you predict that the number of your Sitemaps websites will exceed 50,000 or the size exceeds 10 MB, consider creating multiple Sitemaps files. If you do provide multiple Sitemaps, you canSitemaps index file. The Sitemaps index file can only list up to 1,000 Sitemaps.

The XML format of the Sitemaps index file is very similar to that of the Sitemaps file. The Sitemaps index file uses the following XML tag:

  • Loc
  • Lastmod
  • Sitemap
  • Sitemapindex

Note: The Sitemaps index file can only be specified for Sitemaps located on the same website. For example,Http://www.yoursite.com/sitemap_index.xmlCan containHttp://www.yoursite.comBut cannot containHttp://www.example.comOrHttp://yourhost.yoursite.com. Like Sitemaps, your Sitemaps index file should be UTF-8 encoded.

XML Sitemaps index example

The following example shows a Sitemaps index in XML format. The Sitemaps index lists two Sitemaps:

<?xml version="1.0" encoding="UTF-8"?>   <sitemapindex xmlns="http://www.google.com/schemas/sitemap/0.84">   <sitemap>      <loc>http://www.example.com/sitemap1.xml.gz</loc>      <lastmod>2004-10-01T18:23:17+00:00</lastmod>   </sitemap>   <sitemap>      <loc>http://www.example.com/sitemap2.xml.gz</loc>      <lastmod>2005-01-01</lastmod>   </sitemap>   </sitemapindex>

Note: Like all values in the XML file, the Sitemaps URL requires entity replacement.

Sitemaps index XML tag Definition

  • Yes<Loc>Mark and use it to identify the location of Sitemaps.

  • <Lastmod>A tag is an optional tag used to indicate the modification time of the corresponding Sitemap file. It does not correspond to the modification time of any web page listed in the Sitemap. The value marked by lastmod should be in W3C Datetime format.

    By providing the recently modified timestamp to enable the search engine crawling tool, the crawling tool only retrieves a subset of Sitemaps in the index. That is to say, the crawling tool only retrieves the modified Sitemaps after a specific date. Through this incremental Sitemaps extraction mechanism, you can quickly find new URLs on super-large websites.

  • <Sitemap>Tags encapsulate information about a single Sitemaps.

  • <Sitemapindex>The tag compresses information about all Sitemaps in the file.

Location of the Sitemaps File [Content]

The location of the Sitemaps file determines a set of URLs that can be included in Sitemaps. Located inHttp://example.com/catalog/sitemap.gzThe Sitemaps file can containHttp://example.com/catalog/Any starting URL, but cannot containHttp://example.com/images/Start URL.

If you have changedHttp://example.org/path/sitemap.gzYou can also provideHttp://example.org/path/The URL information that is prefixed.Http://example.com/catalog/sitemap.gzThe following sample URLs are considered valid:

http://example.com/catalog/show?item=23http://example.com/catalog/show?item=233&user=3453

Http://example.com/catalog/sitemap.gzWebsites that are considered invalid include:

http://example.com/image/show?item=23http://example.com/image/show?item=233&user=3453https://example.com/catalog/page1.html

Websites that are considered invalid will not be considered. We strongly recommend that you place sitemaps in the root directory of the Web server. For example, if the Web ServerExample.com, The sitemaps index file should be located inHttp://example.com/sitemap.gz. In some cases, you need to create corresponding sitemaps for different paths. For example, if you have security permissions in your organization, the upload permissions are assigned to different directories.

Verify your Sitemaps [Content]

Google uses the XML schema to define the elements and attributes that can appear in the sitemaps file. You can download this architecture from the following link:

For Sitemaps:Http://www.google.com/schemas/sitemap/0.84/sitemap.xsd
For Sitemaps index files:Http://www.google.com/schemas/sitemap/0.84/siteindex.xsd

There are multiple tools to help you verify your sitemaps structure based on this architecture. You can find a list of XML-related tools at each of the following locations:

Http://www.w3.org/XML/Schema#Tools
Http://www.xml.com/pub/a/2000/12/13/schematools.html

To verify your sitemaps or sitemaps index file based on a schema, the XML file must have additional headers. If you are using the sitemaps generator, these headers are included. If you use different tools to create sitemaps, the header in the XML file should be shown in the following example.

Sitemaps:

<?xml version='1.0' encoding='UTF-8'?><urlset xmlns="http://www.google.com/schemas/sitemap/0.84"xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"xsi:schemaLocation="http://www.google.com/schemas/sitemap/0.84http://www.google.com/schemas/sitemap/0.84/sitemap.xsd"><url>...</url></urlset>

Sitemaps index file:

<?xml version='1.0' encoding='UTF-8'?><sitemapindex xmlns="http://www.google.com/schemas/sitemap/0.84"xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"xsi:schemaLocation="http://www.google.com/schemas/sitemap/0.84http://www.google.com/schemas/sitemap/0.84/siteindex.xsd"><sitemap>...</sitemap></sitemapindex>

 

FAQs [Content]
 

Problem:How do I represent a URL in Sitemaps?

For all XML files, any data value (including the URL) should use the following character entity escape code: & Symbol (&), single quotation marks ('), double quotation marks ("), less than (<), and greater than (> ). It should also ensure that all web sites comply with RFC-3986 URI standards, RFC-3987 Siri standards, and XML standards. If you want to use a script to generate a URL, you can use the URL escape method to escape it as part of the script. And you still need to escape them. For example, the following Python script entity escapeHttp://www.example.com/view? Widget = 3 & count> 2

$ pythonPython 2.2.2 (#1, Feb 24 2003, 19:13:11)  >>> import xml.sax.saxutils>>> xml.sax.saxutils.escape("http://www.example.com/view?widget=3&count>2")

The URL obtained from the preceding example is:

http://www.example.com/view?widget=3&amp;count&gt;2

Problem:Which character encoding method is used to generate the Sitemaps file?

Yes. Your Sitemaps file should be UTF-8 encoded.

Problem:How to specify the time?

Use W3C Datetime encoding for the lastmod Timestamp and all other dates and times in this Protocol. For example, 2004-09-22T14: 12: 14 + 00: 00.

This encoding allows you to save time in ISO8601 format, for example. However, if your website is frequently changed, you are encouraged to use the time Section so that the crawling tool can obtain more comprehensive information about your website.

Problem:How do I calculate the lastmod date?

For static files, this is the actual file update date. You can use the UNIX date command to obtain the date:

$ date --iso-8601=seconds -u -r /home/foo/www/bar.html>> 2004-10-26T08:56:39+00:00

For many dynamic web sites, you can easily calculate lastmod dates based on the time when the basic data is changed or using some approximate values based on regular updates (if feasible. Using an approximate date or Timestamp can help capture tools to avoid crawling URLs that have not been changed. This will reduce the bandwidth and CPU requirements of Web servers.

Problem:Where can I place Sitemaps?

It is strongly recommended to place Sitemaps in the root directory of the HTML server, that is, place it inHttp://example.com/sitemap.xml.gz.

In some cases, you may want to create corresponding Sitemaps for different paths on the website-for example, if the security license is in your organization, upload permissions are assigned to different directories.

We believe that if you have an uploadHttp://example.com/path/sitemap.xml.gzYou can alsoHttp://example.com/path/Report metadata.

Problem:How big can my Sitemaps be?

Sitemaps cannot exceed 10 MB (10,485,760 bytes) during compression and can contain up to 50,000 URLs. That is to say, if your website contains more than 50,000 URLs or Sitemaps is greater than 10 MB, you need to create multiple Sitemaps files and use the Sitemaps index file. Use Sitemaps to index files even if your website is small but the number of planned URLs exceeds 50,000 or the file size exceeds 10 MB.

Problem:My website has tens of millions of websites. Can I submit only recently changed websites in some way?

You can list updated URLs in a few frequently changed Sitemaps, and then use the lastmod tag in the Sitemaps index file to verify these Sitemaps files. Then, the search engine can gradually crawl (only capture) These changed Sitemaps.

Problem:What will happen after Sitemaps is created?

After creating Sitemaps, you must notify the search engine of the location of Sitemaps. The notified search engine can retrieve your Sitemaps and enable the URLs to be captured by the crawling tool.

Problem:Do I need to specify the URL in Sitemaps completely?

Yes. You need to provide the Protocol (for example,Http). If the Web server has requirements, you also need to provide trailing slashes. For example,Http://www.google.com/Is a valid Sitemaps URL, whileWww.google.comNo.

Problem:My website has two versions: "http" and "https. Do I need to list them all?

No. Please list the URLs of only one version in your Sitemaps. Websites of multiple versions may fail to be completely crawled by the crawling tool.

Problem:The URL on my website contains the session ID. Do I need to delete it?

Yes. The website contains the session ID, which may cause incomplete and repeated crawling.

Problem:Does the website location in Sitemaps affect its usage?

No. The location of a website in Sitemaps does not affect the use or recognition of the website by the search engine.

Problem:Some web pages on our website use frameworks. Should we provide the URL of the Framework Group or framework content?

Please include both types of URLs.

Problem:Can Sitemaps be compressed? Or do I need to use gzip for compression?

Use gzip to compress your Sitemaps.

Problem:In XML Sitemaps, the "priority" prompt will change the ranking code of my webpage in the search result?

No. The "priority" prompt in Sitemaps only indicatesOn your website, The relative importance of a specific website and other websites.

Problem:Is there an XML architecture that can be used as the basis for XML Sitemaps verification?

The XML schema that is located on the http://www.google.com/schemas/sitemap/0.84/sitemap.xsd applies to the Sitemaps file and the schema that corresponds to the Sitemaps index file can be found on the http://www.google.com/schemas/sitemap/0.84/siteindex.xsd.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.