HTML5 Standard Learning-coding

Last Update:2016-03-28 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Believe that every front-end engineers have more or less encountered the "garbled" this man, no matter how solid your foundation, in the process of production are unavoidable occasionally and "garbled" brother drink a few cups of tea. As a front-end engineer, how do you specify the encoding of a page? Do you know how the browser identifies the code?

First, a very simple example, use the HTML page of Jane to see what is different under each browser:

<!DOCTYPE html>

The simplest HTML, and none of the <body> content, the server also does not give specific coding statements, directly from the local open, each browser to view the page encoding:

 
  
   
    
    Browser 
    Display Encoding 
    Notes 
    
   
   
    
    IE6 
    UTF-8 
     
    
    
    IE8 
    UTF-8 
     
    
    
    IE9 
    GB2312 
    System default Character Set 
    
    
    Firefox3.5 
    GBK2312 
    System default Character Set 
    
    
    Firefox4.0 
    Iso-8859-1 
    Western European language, English default encoding 
    
    
    Chrome 
    GBK 
    System default Character Set 
    
    
    Opera 
    Chinese-automatic detection 
    Should be GB2312, too. 
    
   
 
As can be seen from the table, the browser has different parsing for pages that do not use any means to declare the encoding. Of course, in the simplest pages, no matter what encoding (but, of course, the hyper-set of ASCII), there is no effect, but enough to show the importance of correctly setting the encoding.
Code Declaration
HTML4 and HTML5 each use a chapter to illustrate the method of the code declaration, you can click here to see the relevant sections of HTML4 or click here to see the relevant chapters of HTML5.
First, what is coding? Encoding is a way to specify the browser (or user agent) to parse the byte stream with a special algorithm to get the real right content. In the standard of HTML, the encoding can be represented by using aliases. The encoded alias is derived from the IANA definition, and only the encoding that appears in the list can be recognized by the browser. So if UTF-8 is written as a UTF8, the browser may be completely ignored. In addition, the encoding alias is case insensitive.
In HTML4, there are 3 ways to specify the encoding of the page, according to the priority level is:
 
  
   
   The Content-type field after the HTTP header follows the character set. 
   Use <meta http-equiv="Content-Type"> tags to declare. 
   For some external resources, such as <script> the tag-loaded JS file, you can declare it via the CharSet attribute on the label. 
   
 
There is no doubt about this nature, it should be noted that through the <meta http-equiv="Content-Type"> label to declare the page, when the browser encounters the tag, if you find that the code you use and the label declaration does not match, will return to the head to re-parse the page. This causes the part of the page to be re-parsed, so if you attempt to declare the encoding using a label, it is recommended to write the label as far as possible. A best practice is to write the  label before any other tags. Google Pagespeed also has a corresponding introduction to this.
The evolution of times
But as time went on, the developers gradually discovered something. Just like DOCTYPE's simplest statement, the browser <meta> does not strictly follow the standard when it reads the encoding of the tag. All in all, since the parsing phase of HTML is based on the need to determine the encoding of a good page before the tokenizer phase, it is not possible for the browser to decompose the structure of the tag in the DOM tree as it was built, <meta> removing its http-equiv content properties and Re-determine the encoding.
In reality, the browser does a very simple thing to read <meta> the tag definition of the code:
 
   
    
    Make sure that this is a <meta> label, which is determined by the "<" character plus the "meta" string, based on the HTML parsing state machine. 
    Find the string (there is no concept of a label here, just a string) and find a substring "charset". 
    Read back, ignore all whitespace characters, and find the first meaningful character C. 
      
      If C is not the "=" character, go back to the 2nd step and continue looking. 
      If C is the character "=", continue to go down. 
      
    Skip all whitespace characters and single quotes, double quotes, and so on, and scan backwards until a single quote, double quotation mark, empty characters, end tag, etc. should not appear, and intercept the string s in which it was scanned. 
    Parse s to get the encoding alias. 
    
  
From the above algorithm, it is not difficult to find, the following several ways, in fact, can let the browser correctly identify the code:
 
   
    
    <meta http-equiv="Cotnent-Type" content="text/html; charset=utf-8" /> 
    <meta charset="utf-8" /> 
    <meta charset=utf-8 /> 
    ...... And a lot of other wacky writing. 
    
  
So, with the advance of history, one day, the browser vendors sat together and began to discuss the issue ... In the end they were surprised to find that their implementations were very similar (perhaps they were the basis for mutual reference), so they decided to turn this into a standard ... Finally, after a long discussion, HTML5 's widely loved code-declaring approach was born. In HTML5, it is called the "meta-charset element", and its simplest form is as follows:
<meta charset=utf-8>
Of course, this is HTML syntax, and if you follow XHTML and feel that XHTML is more gracious, it <meta charset="utf-8" /> 's no problem.
The algorithm of the specific acquisition code described above is also documented in detail and can be seen here.
In the HTML5 era, the standard once again to the Code of the Declaration of Correction and refinement, there must be the following differences:
 
   
    
    HTML5 allows the use of the BOM to determine the encoding, but only supports the UTF-16 BOM (that is, U+feff), and does not indicate how the BOM specifies the priority of the encoding. 
    HTML5 added a meta charset label. 
    HTML5 rules If a page does not have a specified encoding, ASCII is used as its encoding, and HTML4 specifies that the browser can choose according to the environment in which it is located. 
    
  
Other miscellaneous
In addition to the basic declarative way of coding, there are a number of details that need to be noted in the standard:
 
   
    
    If you use <meta> a label to declare the encoding, the encoding can only be an ASCII superset of the encoding. It is easy to assume that an ASCII superset is a 256-character encoding that supports ASCII. 
    HTML5 is highly recommended with UTF-8 encoding. 
    The standard proposes not to use character sets such as UTF-32, jis_c6226-1983, jis_x0212-1990, hz-gb-2312, Johab, and to prohibit the use of CESU-8, UTF-7, BOCU-1, and SCSU character sets. But the fact that the browser can at least recognize UTF-7. 
    For developers who want to strictly adhere to XHTML, the XML declaration should be used to specify the encoding, that is <?xml version="1.0" encoding="UTF-8" standalone="no" ?> . But this will affect the DOCTYPE under the IE6, so the developer should not give a compromise on this point, obediently to use the HTML declaration method. 
    This article is worth reading about the priority of the various code declarations in the real world, as well as some other details that need attention. 
    
  Best practices 
   
    
    Specify the encoding using the HTTP header whenever possible. 
    Use UTF-8 as much as possible, or at least use uniform encoding for all resources at all stations. 
    If you want to use UTF-16, add a BOM to the file to determine if it is little endian or big endian. 
    If you use <meta> tags to specify the encoding, you can use the Http-equiv form, but make the label appear in front, at least before any non-ASCII characters. 
    Links the external script, if it is not possible to determine the same encoding, plus the CharSet property. 
    
  
HTML5 Standard Learning-coding

Browser	Display Encoding	Notes
IE6	UTF-8
IE8	UTF-8
IE9	GB2312	System default Character Set
Firefox3.5	GBK2312	System default Character Set
Firefox4.0	Iso-8859-1	Western European language, English default encoding
Chrome	GBK	System default Character Set
Opera	Chinese-automatic detection	Should be GB2312, too.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

HTML5 Standard Learning-coding

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

HTML5 Standard Learning-coding

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support