HTML5 UTF-8 Chinese Characters

Last Update:2014-10-12 Source: Internet

Author: User

Tags new set

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title>title of HTML5</title>
</head>
<body>
<p>Contents of HTML5! Hello</p>
</body>
</html>

I wrote it in Notepad, after the save on the Web page ran unexpectedly garbled. Replace with GB2312 to display Chinese correctly.

<!DOCTYPE html>
<html>
<head>
<meta charset="GB2312">
<title>title of HTML5</title>
</head>
<body>
<p>Contents of HTML5! Hello</p>
</body>
</html>

But after all, the standards are different. Still need to use the Utf-8. Finally found that the code is not a problem, the problem is out of Notepad . <meta charste= "Utf-8" > just tells the browser to use Utf-8 to explain, and the document encoding is determined when you save the choice. If you save ANSI and then use Utf-8 to explain, it must be garbled. Notepad, the default saved file format is ANSI. So you have to change it to uif-8 when you save it. The same shoes written in Notepad must be noticed.

The difference between UTF-8 GBK UTF8 GB2312

Utf-8:unicode Transformationformat-8bit, which allows BOM, but usually does not include BOM. is a multi-byte encoding used to solve the international character, which uses 8 bits (or one byte) in English, and Chinese uses 24 (three bytes) to encode. UTF-8 contains the characters that all countries in the world need to use, and is an international code with strong versatility. UTF-8 encoded text can be displayed on browsers that support the UTF8 character set in each country. For example, if it is UTF8 code, it can display Chinese in the foreigner's English ie, they do not need to download IE's Chinese language support package.

GBK is the standard of GB2312 compatible GB2312 on the basis of national standard. The text encoding of the GBK is expressed in double-byte notation, that is, both Chinese and English characters are represented by double-byte, in order to distinguish the language, the highest bit is set to 1. GBK contains all Chinese characters, is the country code, the generality is worse than the UTF8, but UTF8 occupies the database bigger than GBD.

GBK, GB2312, and UTF8 must be encoded in Unicode to convert from one to the other:

GBK, Gb2312--unicode--utf8

UTF8--UNICODE--GBK, GB2312

For a website, forum, if the English characters are more, it is recommended to use UTF-8 to save space. But now many forums plug-ins generally only support GBK.

In simple terms, UNICODE,GBK and five yards are encoded values, and utf-8,uft-16 is the expression of this value. And the preceding three kinds of coding is a compatible, the same Chinese character, that three code value is completely different. such as "Han" Uncode value and GBK is not the same, assuming that Uncode is A040,GBK for b030, and Uft-8 code, that is, the value of the form of expression. Utf-8 code completely only for Uncode to organize, if GBK to turn UTF-8 must first turn Uncode code, then turn Utf-8 on OK.

See the following article for details.

Talk about Unicode encoding, briefly explain UCS, UTF, BMP, BOM and other nouns

This is a funny book written by programmers to programmers. The so-called fun refers to the relatively easy to understand some of the original unclear concepts, improve knowledge, similar to the upgrade of RPG games. The motivation for finishing this article is two questions:

Question one:
Using Save as in Windows Notepad, you can convert between GBK, Unicode, Unicode big endian, and UTF-8 in several ways. The same TXT file, how does Windows recognize the encoding method?

I discovered earlier that Unicode, Unicode Bigendian, and UTF-8 encoded TXT files had a few more bytes at the beginning of the FF, Fe (Unicode), Fe, FF (Unicode Bigendian), EF, BB, BF (UTF-8). But what are the criteria for these marks?

Question two:
Recently on the Internet to see a CONVERTUTF.C, the implementation of the UTF-32, UTF-16 and UTF-8 of the three encoding methods of mutual conversion. For the encoding of Unicode (UCS2), GBK, UTF-8, I knew it. But this program makes me somewhat confused, can't remember what UTF-16 and UCS2 have relations.
Looked up the relevant information, finally made clear of these questions, incidentally, also learned some of the Unicode details. Write an article and give it to a friend who has a similar question. This article is as easy to understand as possible when writing, but requires the reader to know what is a byte and what is hexadecimal.

0. Big endian and Little endian
Big endian and Littleendian are different ways the CPU handles multibyte numbers. For example, the Unicode encoding of the word "Han" is 6c49. So when you write to a file, do you write 6C in front, or write 49 in front? If 6C is written in front, it is big endian. If the 49 is written in front, it is little endian.

The word "endian" is derived from Gulliver's Travels. The civil war in the small country stems from eating eggs is whether from the Big Head (Big-endian) or from the head (Little-endian) knocked Open, which has happened six times rebellion, an emperor sent life, the other lost the throne.

We generally translate endian into "byte order", the big endian and little endian are called "large tail" and "small tail".

1, character encoding, internal code, incidentally introduced Chinese character coding
Characters must be encoded before they can be processed by the computer. The default encoding used by the computer is the internal code of the computer. Early computers used 7-bit ASCII encoding, and in order to deal with Chinese characters, programmers designed GB2312 for Simplified Chinese and big5 for traditional Chinese.

GB2312 (1980) contains a total of 7,445 characters, including 6,763 Kanji and 682 other symbols. The inner code range of the Chinese character area is high byte from B0-f7, low byte from A1-fe, occupy code bit is 72*94=6768. 5 of these seats are d7fa-d7fe.

GB2312 supports too few Chinese characters. The 1995 Chinese character extension specification GBK1.0 contains 21,886 symbols, which are divided into Chinese characters and graphic symbol areas. The Chinese character area consists of 21,003 characters.

From ASCII, GB2312 to GBK, these encoding methods are backwards compatible, meaning that the same character always has the same encoding in these scenarios, and the latter standard supports more characters. In these codes, English and Chinese can be handled in a unified manner. The method of distinguishing Chinese encoding is that the highest bit of high byte is not 0. According to the programmer, GB2312 and GBK belong to the double-byte character set (DBCS).

The 2000 GB18030 is the official national standard for replacing GBK1.0. The standard contains 27,484 Chinese characters, as well as Tibetan, Mongolian, Uyghur and other major minority characters. From the words of Chinese characters, GB18030 added 6,582 characters (Unicode code 0X3400-0X4DB5) of CJK Extension A on the basis of the 20,902 characters of GB13000.1, which included 27,484 Chinese characters altogether.

CJK is what China and Japan and South Korea mean. Unicode in order to save code, the language of CJK three languages Unified Code. GB13000.1 is the Chinese version of ISO/IEC 10646-1, equivalent to Unicode 1.1.

The GB18030 encoding is in single-byte, double-byte, and 4-byte scenarios. Where single-byte, double-byte, and GBK are fully compatible. The 4-byte coded code bit is the 6,582 Chinese characters that are included in CJK Extension A. For example: The 0x3400 of UCS in GB18030 should be 8139ef30,ucs 0x3401 encoding in GB18030 should be 8139ef31.

Microsoft has provided a GB18030 upgrade package, but this upgrade package only provides a new set of 6,582 Chinese characters that support CJK Extension A: The new Arial-18030, does not change the inner code. The internal code of Windows is still GBK.

Here are some details:

GB2312 the original text or location code, from the location code to the inner code, you need to add A0 on the high and low byte respectively.

For any character encoding, the order of the encoded units is specified by the encoding scheme, regardless of the endian. For example, the encoding unit of GBK is Byte, which represents a Chinese character in two bytes. The order of the two bytes is fixed and is not affected by the CPU byte order. The encoding unit for UTF-16 is word (double-byte), and the order of Word is specified by the encoding scheme, and the byte arrangement within Word is affected by endian. UTF-16 will also be introduced later.

The highest bit of the two bytes of the GB2312 is 1. But the code bit that meets this condition is only 128*128=16384. So the low-byte highest bits of GBK and GB18030 are probably not 1. However, this does not affect the parsing of DBCS character streams: When reading a DBCS character stream, you can encode the next two bytes as a double byte as long as you encounter a byte with a high level of 1, without having to control what the low-byte high is.

2. Unicode, UCS, and UTF
The previously mentioned encoding methods from ASCII, GB2312, GBK to GB18030 are backwards compatible. Unicode is only compatible with ASCII (more precisely, iso-8859-1 compatible) and is incompatible with GB code. For example, the Unicode encoding of the word "Han" is 6c49, and the GB code is baba.

Unicode is also a character encoding method, but it is designed by international organizations and can accommodate all languages in the world coding scheme. The scientific name for Unicode is "Universalmultiple-octet Coded Character Set", referred to as UCS. UCS can be seen as an abbreviation for "Unicode CharacterSet".

According to Wikipedia (http://zh.wikipedia.org/wiki/), there are two organizations in history that attempt to design Unicode independently, namely the International Organization for Standardization (ISO) and the association of a software manufacturer (unicode.org). ISO 10646 project was developed and the Unicode Association developed the Unicode Project.

Around 1991, both sides recognized that the world did not need two incompatible character sets. They then began to merge the work of the two sides and work together to create a single coding table. Starting with Unicode2.0, the Unicode project uses the same font and loadline as ISO 10646-1.

At present, two projects are still present, and the respective standards are published independently. The current version of the Unicode Association is the 2005 Unicode 4.1.0. The latest ISO standards are ISO 10646-3:2003.

UCS just rules how to encode, and does not specify how to transfer and save this code. For example, the "Han" word of the UCS code is 6c49, I can use 4 ASCII numbers to transmit, save the code, or can be encoded with Utf-8:3 consecutive bytes E6 B189 to represent it. The key is that both parties must endorse the communication. UTF-8, UTF-7 and UTF-16 are widely accepted programs. A particular benefit of UTF-8 is that it is fully compatible with iso-8859-1. UTF is the abbreviation for "UCS transformation Format".

The RFC2781 and RFC3629 of the IETF, with the consistent style of RFC, describe the coding methods of UTF-16 and UTF-8 in a clear, crisp, yet rigorous manner. I always remember that the IETF is the abbreviation for Internet Engineering Task force. But the RFCs that the IETF is responsible for maintaining are the basis for all the specifications on the Internet.

2.1, Inner Code and code page
Currently, the kernel of Windows supports the Unicode character set so that it can support all language text in the world on the kernel. However, because a large number of existing programs and documents are encoded in a particular language, for example, Gbk,windows cannot support existing encodings and all use Unicode instead.

Windows uses code pages to adapt to each country and region. The code page can be understood as the inner code mentioned earlier. The GBK corresponding code page is CP936.

Microsoft has also defined code page:cp54936 for GB18030. But because GB18030 has a portion of 4-byte encoding, and the Windows code page only supports single-byte and double-byte encoding, this code page is not really used.

3, UCS-2, UCS-4, BMP
UCS has two forms: UCS-2 and UCS-4. As the name implies, UCS-2 is encoded with two bytes, and UCS-4 is encoded with 4 bytes (actually 31 bits, the highest bit must be 0). Let's do some simple math games:

UCS-2 has 2^16=65536 code bit, UCS-4 has 2^31=2147483648 yards.

The UCS-4 is divided into 2^7=128 groups according to the highest byte maximum of 0 bytes. Each group is then divided into 256 plane according to the sub-high byte. Each plane is divided into 256 rows according to the 3rd byte (rows), and each row consists of 256 cells. Of course the cells in the same row are only the last byte and the rest are the same.

The Plane 0 of group 0 is known as the basic multilingual Plane, or BMP. Or UCS-4, a code bit with a height of two bytes of 0 is called a BMP.

The UCS-4 bmp is removed from the previous two 0 bytes to get the UCS-2. The BMP of UCS-4 is obtained by adding two 0 bytes before the two bytes of the UCS-2. No characters in the current UCS-4 specification are allocated outside of the BMP.

4. UTF code

The UTF-8 is to encode the UCS as a 8-bit unit. The encoding from UCS-2 to UTF-8 is as follows:

UCS-2 encoding (16 binary) UTF-8 byte stream (binary)
0000-007f 0xxxxxxx
0080-07FF 110xxxxx 10xxxxxx
0800-FFFF 1110xxxx 10xxxxxx 10xxxxxx

For example, the Unicode encoding of the word "Han" is 6c49. 6c49 is between 0800-ffff, so I'm sure to use a 3-byte template: 1110xxxx 10xxxxxx10xxxxxx. The 6c49 is written as binary: 0110 110001 001001, which in turn replaces the X in the template, resulting in: 1110011010110001 10001001, or E6 B1 89.

Readers can use Notepad to test if our code is correct. It is important to note that UltraEdit is automatically converted to UTF-16 when opening utf-8 encoded text files, which can cause confusion. You can turn this option off in the settings. The better tool is the hex Workshop.

The UTF-16 encodes the UCS as a 16-bit unit. For UCS codes smaller than 0x10000, the UTF-16 encoding is equal to the 16-bit unsigned integer corresponding to the UCS code. For UCS codes that are not less than 0x10000, an algorithm is defined. However, because the actual use of UCS2, or UCS4 bmp is necessarily less than 0x10000, so for the time being, UTF-16 and UCS-2 can be considered basically the same. But UCS-2 is just a coding scheme, UTF-16 is used for actual transmission, so we have to consider the problem of byte order.

5. UTF byte order and BOM
UTF-8 is a byte-coded unit with no byte order problem. UTF-16 takes two bytes as the encoding unit, before interpreting a UTF-16 text, it is first to clarify the byte order of each coding unit. For example, the Unicode encoding for "Kui" is 594E, and the Unicode encoding for "B" is 4E59. If we receive the UTF-16 byte stream "594E", then is this "Kui" or "B"?

The recommended method for labeling byte order in the Unicode specification is the BOM. The BOM is not a BOM for "Bill of Material", but a byte order Mark. The BOM is a bit of a smart idea:

There is a character called "ZERO WIDTH No-breakspace" in the UCS encoding, and its encoding is Feff. Fffe is not a character in UCS, so it should not appear in the actual transmission. The UCS specification recommends that the character "ZERO WIDTH no-break SPACE" be transmitted before the byte stream is transmitted.

This means that if the recipient receives Feff, the byte stream is Big-endian, and if Fffe is received, it indicates that the byte stream is Little-endian. So the character "ZERO WIDTH no-break SPACE" is also called a BOM.

The UTF-8 does not require a BOM to indicate byte order, but it can be used to indicate the encoding using a BOM. The UTF-8 encoding of the character "ZERO WIDTH no-breakspace" is the EF BB BF (readers can verify it with the coding method we described earlier). So if the receiver receives a byte stream beginning with EF BBBF, it knows it's UTF-8 encoded.

Windows uses a BOM to mark the way a text file is encoded.

6. Further references
The main reference information in this paper is "Short Overview of Iso-iec 10646 and Unicode" (http://www.nada.kth.se/i18n/ucs/unicode-iso10646-oview.html).

I've also looked for two good-looking materials, but because I've got the answers to the questions I started, I didn't see them:

"Understanding Unicode A General Introduction to the Unicode Standard" (http://scripts.sil.org/cms/scripts/page.php?) SITE_ID=NRSI&ITEM_ID=IWS-CHAPTER04A)
"Character Set Encoding Basics Understanding Character Set encodings and Legacy encodings" (HTTP://SCRIPTS.SIL.ORG/CMS/SC RIPTS/PAGE.PHP?SITE_ID=NRSI&ITEM_ID=IWS-CHAPTER03)
I've written UTF-8, UCS-2, and GBK software packages, including versions that use Windows APIs and do not use Windows APIs. If there is time later, I will sort it down on my personal homepage (http://fmddlmyy.home4u.china.com).

I wanted to know all the questions before I began to write this article, I thought I could write it in a moment. It took a long time to think about the wording and the details of the verification, which was written from 1:30 to 9:00. It is hoped that readers will benefit from it.

Appendix 1 say area code, GB2312, code page

Some friends have questions about this sentence in the article:
The original text of the GB2312 is also the location code, from the location code to the inner code, need to add A0 on the high and low byte respectively. ”

Let me explain in more detail:

"The original text of GB2312" refers to the national standard of the People's Republic of China for the National Standard Information Interchange (Chinese character encoding set gb2312-80). This standard uses two numbers to encode Chinese characters and Chinese symbols. The first number is called the "zone" and the second number is called "bit". So also called location code. Area 1-9 is a Chinese symbol, 16-55 is a first-level Chinese character, and 56-87 is a two-level Chinese character. Now Windows also has the location input method, for example input 1601 get "ah". (This location input method can automatically recognize the 16-GB2312 and 10-binary location code, that is, the input b0a1 will also get "ah." ）

The inner code refers to the character encoding inside the operating system. The internal code of the early operating system is language-dependent. Now that Windows supports Unicode within the system and then adapts to various languages with code pages, the concept of "inside code" is blurred. Microsoft generally refers to the encoding specified by the default code page as an inner code.

There is no official definition of the word inside, and the code page is just what Microsoft is called a company. As programmers, as long as we know what they are, there is no need to research these nouns too much.

The so-called code page is the character encoding for a language literal. For example, GBK's code page is Cp936,big5 's code page is cp950,gb2312 's code page is CP20936.

Windows has the concept of a default code page, which is what encoding is used to interpret characters by default. For example, Windows Notepad opens a text file with the contents of a byte stream: Ba, BA, D7, D6. What should windows do to explain it?

Is it interpreted in terms of Unicode encoding, interpreted as GBK, interpreted as BIG5, or interpreted according to Iso8859-1? If you press GBK to explain, you will get "Chinese characters" two words. As interpreted by other encodings, the corresponding character may not be found, or the wrong character may be found. The so-called "error" refers to the text of the author's intention is inconsistent, then produced garbled.

The answer is that Windows interprets the byte stream in the text file according to the current default code page. The default code page can be set through the Regional Options in Control Panel. The Save as in Notepad has an ANSI, which is actually saved as the default code page encoding method.

The internal code for Windows is Unicode, and it can technically support multiple code pages at the same time. As long as the file can describe what encoding you are using, and the user installs the corresponding code page, Windows can display it correctly, for example, you can specify charset in an HTML file.

Some HTML file authors, especially English authors, believe that all people in the world use English and do not specify CharSet in the file. If he uses the characters between the 0x80-0xff, the Chinese Windows also follow the default GBK to explain, will appear garbled. In this case, just add the specified charset statement in this HTML file, for example: <meta http-equiv= "Content-type" content= "text/html; Charset=iso8859-1 ">

If the original code page used by the author is compatible with iso8859-1, there will be no garbled characters.

Again location code, ah of location code is 1601, written 16 is 0x10,0x01. This conflicts with the widely used ASCII encoding of the computer. In order to be compatible with 00-7f ASCII encoding, we add A0 on the high and low byte of the location code respectively. This "Ah" code becomes B0A1. We will add two A0 encoding also called GB2312 code, although GB2312 's original text does not mention this at all.

HTML5 UTF-8 Chinese Characters

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More