Note: XHTML and HTML pages first declare language information

Source: Internet
Author: User
Tags locale min reserved rfc
xhtml| Web page

When I started XHTML 1.1, I didn't know what to write on Xml:lang, I want to use Chinese, is it the value of EN, zh-cn/zh-cn or gb2312/gbk/gb18030 or UTF8? I usually have problems with the first Google Chinese, but also can not find the answer. I almost believed it when I saw some authoritative websites using gb2312, but based on my experience with using Linux to set up my language, I intuitively told me it was wrong. So began to narrow the scope to the global consortium to Google, found tutorial:using language information in XHTML, HTML and CSS (DRAFT), read, and finally out of the misunderstanding, is willing to share experience with you.

Still translation, but this article is too long, and there are many we do not need information, this time I chose only part of the hope that the problem can be said clearly.

Declaring documents and text languages

Why to declare a language

Information about the language of the document is important for screen readers and ease of use, and is advantageous from the outset. These programs need to understand whether they can generate output from text, or whether they need to go to a different language mode.

Markup language information is also good for applying appropriate style changes. For example, you need to change the font to adjust different characters, based on the language to generate unused quotes and so on.

Some browsers use language information for Chinese Simplified, Chinese Traditional, Japanese and Han Wenlai to detect suitable fonts. However, in a page that uses Unicode encoding, these languages may share the same ideographic character within the code. People who speak these languages may differ in the small details used in these characters. The following illustration shows the effect of simply changing the language label on the text on Mozilla:

The shape of the same table character in a different language

Markup language information also allows you to extract elements of the specified language using a script. For example, use the XSLT lang () function to extract the text of a specified language from a file, or to apply a language-specific style when XSL-FO conversion.

In many cases, the first time you develop content, you may not be aware of the importance of these applications, although they are generally very easy to add when they are created, which can be problematic when you need a style makeover.

In addition, some programs for language tagging are still in early development or lack, but from now on you should add language information to your content to reap the benefits of the future as the technology matures.
Always declare a language for a document in the
HTML documents should generally declare the language of the document, which can be achieved by adding the lang attribute to the HTML tag. For example, a document using Canadian French (Canadian French) is declared below:

Later, we'll talk more specifically about specifying values for language attributes.

When you put XHTML servo as text/html, you should use the lang attribute and the Xml:lang attribute in the HTML element. The Xml:lang property is the standard use for determining language information in XML. Here's how you should mark the previous example of XHTML 1.0 as text/html Servo:">

The Xml:lang property is not useful when working with HTML files, but inheriting from the lang attribute means you want the document to be treated as XML by the script or validator.

If you use XML (for example, MIME types like applications/xhtml+xml) or XHTML 1.1来 servo XHTML, you no longer need the lang attribute because it is separated from the HTML language. A separate Xml:lang property is sufficient.

Always declare language changes to text

In a different text from the main language of the content, you should indicate the language of the text. method is always the same as the section on declaring the language in the
<p>the French for <em>Cat</em> are <em lang= "FR" >chat</em>.</p>

The lang attribute can be used on any HTML element other than applets, base, Basefont, BR, Frame, frameset, IFRAME, Param, and script.

Also, with the text/html servo XHTML 1.0, you can use two properties together, such as:

<p>the title in Chinese is <span lang= "en"
xml:lang= "ZH-CN" > Document Information Center of Chinese Academy of Sciences </span>.</p>

Note that in the last example, there is no tag around the Chinese text that allows us to attach language information, so we introduce a SPAN element to achieve our goal. (Please check the source code for this section--translator Note)

If you are using XML servo XHTML, as described in the previous section, you should use only the Xml:lang attribute.

Specify the value of the language attribute

Using RFC 3066 rules

RFC 3066 is a standard that defines how language tags are used to identify languages.

The language tag is separated by a main subtag, followed by 0 or more attached subtag, divided by hyphens.

The main subtag represents a language (there are two exceptions, I and X, discussed below), and any trailing subtag serves to modify the dialect or usage of the language. The subtag in the back generally represent the state, dialect or writing system.

The following example shows that the document is not only in English but also in British English, that is, written in English relative to American English.

Subtag is sensitive to case, including letters and numbers from a to Z,a to z,0 to 9, and not more than 8 characters.

Note that the HTML specification still recommends using RFC 1766来 to determine the language. RFC 3066 is an upgrade of RFC 1766 and is vastly exceeded, and there is a plan for the wrong table in the HTML specification, so you should use RFC 3066 regardless of how the HTML specification is explained at this stage.

Main Subtag

All initial subtag must be 1,2 or 3-letter lengths. All 2 and 3-letter subtag are the language codes in the ISO 639 part 2 that define the code to represent the language. The 1-letter subtag is an i-or X-prefix, which we will describe later.

Although the code is case sensitive, they are often lowercase, but this is just a convention.

Also note that ISO offers 2-letter and 3-letter code choices when you should choose 2 letters. This ensures that a unique code is used as quickly as possible for each language, with a slightly outdated 2-letter code (based on RFC 1766, which does not allow 3-letter-length code) to be changed. At the same time, the question of which 3-letter code should be avoided is not a problem, since all the few languages that have two different 3-letter codes also have 2-letter codes.

Attached subtag

The addition of subtag can indicate geographical area, dialect, text system, or other improvements to the main (language) subtag. The primary subtag can be followed by any number of subtag, although more than one is uncommon.

RFC 3066 indicates that any 2-letter subtag located at the secondary location is an ISO 3166 country code. There are no rules for using subtag in any third or next position.

The 2-letter ISO code used to represent the country is usually capitalized, but it is only a convention.

Special Master Subtag.

RFC 3066 defines some examples that may not start with the ISO language code.

The language tags starting with I are reserved for the IANA register language tags (iana-registered language tags). For some examples:

* I-mingo
* I-klingon
* I-tao

The X-Start language tag provides a widget for user-defined language tags. The label on the secondary position must be more than one letter and cannot be reserved for the following SUBTAG:AA, QM-QZ, Xa-xz, and ZZ.

Of course, these language-aware methods do not need to be used when the ISO code based on 2-letter or 3-letter is available. These methods are used to limit or prevent confusion of interoperable languages.

IANA Register language label

The IANA language tag can be registered by the email submission program referred to in RFC 3066. These tags can have 3 to 8 letters long in the secondary position code.

Registering the IANA code is better than using user-defined code because it minimizes the possibility of confusion because the IANA code is dominant for others. On the other hand, the IANA label is a new code that is not approved by the ISO standard declaration. Statements of disapproval of the IANA label include No-bok (Norwegian "book Language"-using NB of ISO 639), I-navajo (navajo--uses ISO 639 lb), I-lux (luxembourgish--uses ISO 639 lb) , there are many more. For this reason, the IANA registration code should only appear temporarily to fill the gaps in the ISO code.

Although the I-prefix is reserved for the IANA code, not all IANA codes start with it. For example, many Chinese dialects have been registered with the IANA code, including Zh-guoyu (Mandarin, hehe, why not Mandarin Putonghua?). ), Zh-hakka (Hakka), Zh-min (min), Zh-min-nan (Minnan), Zh-wuu (WU) and so on.

Also, the registered IANA code allows you to specify traditional or Simplified Chinese. In the past this must be used for Simplified Chinese ZH-CN (Chinese mainland) and for Traditional Chinese use ZH-TW (Taiwan of China). But you cannot guarantee that other people will recognize or even follow this practice. For example, someone uses ZH-HK to represent traditional Chinese. Now IANA uses the Zh-hans and zh-hant codes to specify Simplified Chinese and Traditional Chinese respectively. The following two paragraphs illustrate the use of these two codes:

<p lang= "Zh-hans" xml:lang= "Zh-hans" > when the world needs to communicate, please use unicode! </p>

<p lang= "zh-hant" xml:lang= "Zh-hant" > when the world needs to communicate, please use the system (Unicode) </p>

other points for language tags

Although most of the time RFC3066 language labels work well, there are still some questions:

* Need more code than ISO to convert the world's nearly 6000 languages have not yet covered code that needs to express common areas, for example, there is still no pan-Latin Spanish code for many organizations to create Spanish content.
* The code that needs to express a common area is not yet covered, for example, there is still no pan-Latin Spanish code for many organizations to create Spanish content.
* There is still a lack of clarity between the language label value and the locale. Locale is a combination of language plus geographic area, which is often used in software to set dates and times.
* Sometimes it is really necessary to distinguish between the use of a text system attached to the language. For example, Mongolians may write Mongolian or Slavic languages, and Croatia may also write Latin or Slavic ...

Now staff from ISO Tc37,sil and the consortium are working on solutions to these problems.

At the same time, you should always remember that you can register the language tags you need in the IANA.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.