I was not very concerned about Google's sitemap. Although I think this is indeed a good way for search engines to be lazy, I don't have to work hard to search every page. However, to allow users to actively submit content requires a great appeal. Otherwise, it is difficult to become a standard. In addition, creating sitemap is quite troublesome. Generally, it is not easy for a webmaster to learn it. In fact, Google's sitemap is not very popular, especially in China.
However, a recent agreement between Google, Microsoft, and Yahoo will use the unified sitemaps standard, so the original scope of use is limited to sitemaps of the Google website administrator tool, it will also be accepted by the other two search engines, and more search engines may adopt this Protocol in the future. It seems that webmasters still need to use sitemap. Although Baidu is the only one in China, the other three cannot be ignored. We have to do the work we should do.
Currently, the sitemap protocol used by Google is 0.84, and the three companies will adopt version 0.9 together. It should not change much, so you don't need to ignore it for the moment.
Create sitemap summary:
1. Create a sitemap file:
Http://www.google.com/support/webmasters/bin/answer.py? Answer = 34654 & HL = zh_cn
Sitemap is an XML file. It is very simple. You can edit it by yourself, as long as the syntax is correct. However, if the content of the site is large, it is impossible to manually write the content, so we need to use the automatic builder of sitemap. This is provided by Google:
Http://www.google.com/webmasters/sitemaps/sitemap_generator
There are also many third-party generators, or even online versions, which sound attractive, but I am too lazy to try them out one by one. I still feel more comfortable using official versions.
First, write a configuration file: mysite_config.xml
Inside:
<Site
Base_url = "http://www.mysite.com /"
Store_into = "/www/Site1/root/sitemap.xml.gz"
Verbose = "1"
>
Then run the generator script, which is a software written in Python. Running the command on the command line is simple: Python sitemap_gen.py -- Config = mysite_config.xml
In the/www/Site1/root/directory, sitemap.xml.gz is generated.
Gzip-D sitemap.xml.gz
Decompress the file to generate sitemap. XML in the root directory of the site. After the builder executes the generated file, it will also tell Google that your sitemap is updated.
Now you can try http://www.mysite.com/sitemap.xml.pdf. After the creation is successful, submit to Google (https://www.google.com/webmasters/tools/), or put a link on the home page.
II. There are several notes when writing configuration files.
First, the static website is very simple. Specify the directory path and the generator will be very smart to traverse the corresponding directory:
<Directory
Path = "/var/www/docroot"
Url = "http://www.example.com /"
Default_file = "index.html"
>
If your site is a dynamic web page, the builder cannot get every address (like http://yoursite.com? ArticleID = 234) because the tool runs in the command line and traverses the directory locally, it does not access the server through an HTTP connection. So you have to go to cofig. use Apache user access logs in XML to obtain the dynamic URL address. (If some dynamic webpages in the site are not accessed, isn't it necessary to add them to sitemap? I don't know)
Find the following parts:
<! -- ** Modify or delete **
"Accesslog" nodes tell the script to scan webserver log files
Extract URLs on your site. Both common logfile format (Apache's default
Logfile) and extended logfile format (IIS's default logfile) can be read.
Required attributes:
Path-path to the file
Optional attributes:
Encoding-encoding of the file if not US-ASCII
-->
<Accesslog Path = "/etc/httpd/logs/mysite-access.log" encoding = "UTF-8"/>
Also, you can use a filter to search some webpages that you do not want to search by Google:
<Filter action = "Drop" type = "wildcard" pattern = "* Private *"/>
<Filter action = "Drop" type = "Regexp" pattern = "//. [^/] *"/>
Drop indicates that the content is not included in sitemap. The first rule uses wildcards, indicating that all URLs containing private strings are not included in sitemap. The second example uses a regular expression, indicating that hidden files or directories (File/directory name. ABC) under * nix are not listed ).