How do you specify the encoding of a page? Do you know how the browser identifies the code?
First, a very simple example, use the HTML page of Jane to see what is different under each browser:
Browser |
Display Encoding |
Notes |
IE6 |
UTF-8 |
|
IE8 |
UTF-8 |
|
IE9 |
GB2312 |
System default Character Set |
Firefox3.5 |
GBK2312 |
System default Character Set |
Firefox4.0 |
Iso-8859-1 |
Western European language, English default encoding |
Chrome |
GBK |
System default Character Set |
Opera |
Chinese-automatic detection |
Should be GB2312, too. |
As can be seen from the table, the browser has different parsing for pages that do not use any means to declare the encoding. Of course, in the simplest pages, no matter what encoding (but, of course, the hyper-set of ASCII), there is no effect, but enough to show the importance of correctly setting the encoding.
Code DeclarationHTML4 and HTML5 each use a chapter to illustrate the method of the code declaration, you can click here to see the relevant sections of HTML4 or click here to see the relevant chapters of HTML5.
The source of this article: Http://www.otakustay.com/learning-html5-charset
First, what is coding? Encoding is a way to specify the browser (or user agent) to parse the byte stream with a special algorithm to get the real right content. In the standard of HTML, the encoding can be represented by using aliases. The encoded alias is derived from the IANA definition, and only the encoding that appears in the list can be recognized by the browser. So if UTF-8 is written as a UTF8, the browser may be completely ignored. In addition, the encoding alias is case insensitive.
In HTML4, there are 3 ways to specify the encoding of the page, according to the priority level is:
- The Content-type field after the HTTP header follows the character set.
- Use
<meta http-equiv="Content-Type">
tags to declare.
- For some external resources, such as
<script>
the tag-loaded JS file, you can declare it via the CharSet attribute on the label.
There is no doubt about this nature, it should be noted that through the <meta http-equiv="Content-Type">
label to declare the page, when the browser encounters the tag, if you find that the code you use and the label declaration does not match, will return to the head to re-parse the page. This causes the part of the page to be re-parsed, so if you attempt to declare the encoding using a label, it is recommended to write the label as far as possible. A best practice is to write the label before any other tags. Google Pagespeed also has a corresponding introduction to this.
The evolution of timesBut as time went on, the developers gradually discovered something. Just like DOCTYPE's simplest statement, the browser <meta>
does not strictly follow the standard when it reads the encoding of the tag. All in all, since the parsing phase of HTML is based on the need to determine the encoding of a good page before the tokenizer phase, it is not possible for the browser to decompose the structure of the tag in the DOM tree as it was built, <meta>
removing its http-equiv
content
properties and Re-determine the encoding.
In reality, the browser does a very simple thing to read <meta>
the tag definition of the code:
- Make sure that this is a
<meta>
label, which is determined by the "<" character plus the "meta" string, based on the HTML parsing state machine.
- Find the string (there is no concept of a label here, just a string) and find a substring "charset".
- Read back, ignore all whitespace characters, and find the first meaningful character C.
- If C is not the "=" character, go back to the 2nd step and continue looking.
- If C is the character "=", continue to go down.
- Skip all whitespace characters and single quotes, double quotes, and so on, and scan backwards until a single quote, double quotation mark, empty characters, end tag, etc. should not appear, and intercept the string s in which it was scanned.
- Parse s to get the encoding alias.
From the above algorithm, it is not difficult to find, the following several ways, in fact, can let the browser correctly identify the code:
<meta http-equiv="Cotnent-Type" content="text/html; charset=utf-8" />
<meta charset="utf-8" />
<meta charset=utf-8 />
- ...... And a lot of other wacky writing.
So, with the advance of history, one day, the browser vendors sat together and began to discuss the issue ... In the end they were surprised to find that their implementations were very similar (perhaps they were the basis for mutual reference), so they decided to turn this into a standard ... Finally, after a long discussion, HTML5 's widely loved code-declaring approach was born. In HTML5, it is called the "meta-charset element", and its simplest form is as follows:
<meta charset=utf-8>
Of course, this is HTML syntax, and if you follow XHTML and feel that XHTML is more gracious, it <meta charset="utf-8" />
's no problem.
The algorithm of the specific acquisition code described above is also documented in detail and can be seen here.
In the HTML5 era, the standard once again to the Code of the Declaration of Correction and refinement, there must be the following differences:
- HTML5 allows the use of the BOM to determine the encoding, but only supports the UTF-16 BOM (that is, U+feff), and does not indicate how the BOM specifies the priority of the encoding.
- HTML5 added a
meta charset
label.
- HTML5 rules If a page does not have a specified encoding, ASCII is used as its encoding, and HTML4 specifies that the browser can choose according to the environment in which it is located.
Other miscellaneousIn addition to the basic declarative way of coding, there are a number of details that need to be noted in the standard:
- If you use
<meta>
a label to declare the encoding, the encoding can only be an ASCII superset of the encoding. It is easy to assume that an ASCII superset is a 256-character encoding that supports ASCII.
- HTML5 is highly recommended with UTF-8 encoding.
- The standard proposes not to use character sets such as UTF-32, jis_c6226-1983, jis_x0212-1990, hz-gb-2312, Johab, and to prohibit the use of CESU-8, UTF-7, BOCU-1, and SCSU character sets. But the fact that the browser can at least recognize UTF-7.
- For developers who want to strictly adhere to XHTML, the XML declaration should be used to specify the encoding, that is
<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
. But this will affect the DOCTYPE under the IE6, so the developer should not give a compromise on this point, obediently to use the HTML declaration method.
- This article is worth reading about the priority of the various code declarations in the real world, as well as some other details that need attention.
Best practices
- Specify the encoding using the HTTP header whenever possible.
- Use UTF-8 as much as possible, or at least use uniform encoding for all resources at all stations.
- If you want to use UTF-16, add a BOM to the file to determine if it is little endian or big endian.
- If you use
<meta>
tags to specify the encoding, you can use the Http-equiv form, but make the label appear in front, at least before any non-ASCII characters.
- Links the external script, if it is not possible to determine the same encoding, plus the CharSet property.
HTML5 page encoding