Character encoding detailed explanation of the difference between Unicode and UTF8 and its connection __ code

Source: Internet
Author: User
Tags lowercase
character encoding

As we've already said, strings are also a data type, but a special string is a coding problem.

Because the computer can only handle numbers, if you want to process the text, you must first convert the text to a number of characters to process. The earliest computers were designed with 8 bits (bit) as a byte, so the largest integer represented by a word saving is 255 (binary 11111111 = decimal 255), and more bytes must be used to represent a larger integer. For example, two bytes can represent the largest integer is 65535, 4 bytes can represent the largest integer is 4294967295.

Since computers were invented by Americans, only 127 characters were encoded into the computer at the earliest, that is, uppercase and lowercase letters, numbers, and symbols, which are called ASCII codes, such as the encoding of uppercase A is 65, and the lowercase z code is 122.

But it is not enough to deal with Chinese, which requires at least two bytes and does not conflict with the ASCII encoding, so China has developed a GB2312 code that is used to encode Chinese.

You can imagine that there are hundreds of languages all over the world, Japan to the Japanese Shift_JIS, South Korea to the Korean euc-kr, countries have the standard, it will inevitably conflict, the result is, in the text of the mixed language, the display will be garbled.

As a result, Unicode emerged. Unicode unifies all languages into a set of codes so that there is no more garbled problems.

The Unicode standard is also evolving, but the most common is to represent a character in two bytes (4 bytes If a very remote character is used). Unicode is directly supported by modern operating systems and most programming languages.

Now, the difference between ASCII encoding and Unicode encoding: ASCII encoding is 1 bytes, whereas Unicode encoding is usually 2 bytes.

The letter A is encoded in ASCII with the decimal 65, the binary 01000001;

The character 0 encoded in ASCII is the decimal 48, binary 00110000, note that the character ' 0 ' and the integer 0 are different;

The Chinese character has already exceeded the ASCII coding range, the Unicode encoding is the decimal 20013, the binary 01001110 00101101.

You can guess that if ASCII-encoded a is encoded with Unicode, you can only make up 0 in front, so the Unicode encoding for A is 00000000 01000001.

New problems have emerged: if unified into Unicode encoding, garbled problem has disappeared. However, if the text you write is basically all in English, Unicode encoding requires one more storage space than ASCII encoding, which is very uneconomical to store and transmit.

Therefore, in the spirit of saving, there is also the conversion of Unicode code to "variable length code" UTF-8 encoding. UTF-8 encoding encodes a Unicode character into 1-6 bytes based on a different number of digits, the commonly used English alphabet is encoded in 1 bytes, the Chinese character is usually 3 bytes, and only very uncommon characters are encoded into 4-6 bytes. If the text you want to transfer contains a large number of English characters, you can save space with UTF-8 encoding:

character ASCII Unicode UTF-8
A 01000001 00000000 01000001 01000001
In X 01001110 00101101 11100100 10111000 10101101

You can also see from the table above that the UTF-8 encoding has an added benefit that ASCII encoding can actually be viewed as part of the UTF-8 code, so a large number of legacy software that only supports ASCII encoding can continue to work under UTF-8 encoding.

Figuring out the relationship between ASCII, Unicode, and UTF-8, we can summarize the way the character encoding works today in computer systems:

In computer memory, Unicode encoding is uniformly used, which is converted to UTF-8 encoding when it needs to be saved to a hard disk or when it needs to be transmitted.

When edited with Notepad, the UTF-8 characters read from the file are converted to Unicode characters into memory, and when the editor is finished, the Unicode is converted to the UTF-8 to the file after it is saved:

When browsing the Web, the server converts dynamically generated Unicode content to UTF-8 and then to the browser:

So you see a lot of pages on the source code will have similar <meta charset= "UTF-8"/> Information, indicating that the page is the UTF-8 code.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.