Http://www.liaoxuefeng.com/wiki/001374738125095c955c1e6d8bb493182103fac9270762a000/ 001386819196283586a37629844456ca7e5a7faa9b94ee8000 character encoding
As we've already said, strings are also a type of data, but a special string is a coding problem.
Because a computer can only handle numbers, if you are working with text, you must convert the text to a number before processing it. The oldest computer was designed with 8 bits (bit) as a byte (byte), so a single word energy-saving representation of the largest integer is 255 (binary 11111111 = decimal 255), if you want to represent a larger integer, you must use more bytes. For example, two bytes can represent the largest integer is 65535
, 4 bytes can represent the largest integer is 4294967295
.
Since the computer was invented by the Americans, only 127 letters were encoded into the computer, that is, letters, numbers, and symbols, which are called ASCII
encodings, such as uppercase letters encoded in A
65
lowercase letters z
122
.
But to deal with the Chinese is clearly a byte is not enough, at least two bytes, but also can't and ASCII encoding conflict, so, China has developed a GB2312
code to put Chinese into.
What you can imagine is that there are hundreds of languages all over the world, Japan has made it Shift_JIS
in Japanese, South Korea has made it into the Korean language, and Euc-kr
countries have standards that inevitably clash, and the result is that in the mixed text of multiple languages, there will be garbled characters.
As a result, Unicode emerges. Unicode unifies all languages into a set of encodings, so there is no more garbled problem.
The Unicode standard is also evolving, but it is most commonly used to represent a character in two bytes (4 bytes If a very remote character is used). Unicode is supported directly by modern operating systems and most programming languages.
Now, the difference between ASCII encoding and Unicode encoding is smoothed: ASCII encoding is 1 bytes, and Unicode encoding is usually 2 bytes.
Letters A
with ASCII encoding are decimal 65
, binary 01000001
;
Characters 0
with ASCII encoding are decimal 48
, binary 00110000
, and note that ‘0‘
the characters and integers 0
are different;
Chinese characters are 中
beyond the ASCII encoding range, Unicode encoding is decimal 20013
, binary 01001110 00101101
.
You can guess that if you encode ASCII code in A
Unicode, you just need to make 0 in front, so A
the Unicode encoding is 00000000 01000001
.
The new problem arises again: If Unicode encoding is unified, the garbled problem disappears. However, if you write text that is basically all in English, using Unicode encoding requires more storage space than ASCII encoding, which is not cost-effective in storage and transmission.
Therefore, in the spirit of saving, there has been the conversion of Unicode encoding to "Variable length encoding" UTF-8
encoding. The UTF-8 encoding encodes a Unicode character into 1-6 bytes according to a different number size, the commonly used English letter is encoded in 1 bytes, the kanji is usually 3 bytes, and only the very uncommon characters are encoded into 4-6 bytes. If the text you want to transfer contains a large number of English characters, you can save space with UTF-8 encoding:
character |
ASCII |
Unicode |
UTF-8 |
A |
01000001 |
00000000 01000001 |
01000001 |
In |
X |
01001110 00101101 |
11100100 10111000 10101101 |
It can also be found from the table above that the UTF-8 encoding has an added benefit that ASCII encoding can actually be seen as part of the UTF-8 encoding, so a large number of legacy software that only supports ASCII encoding can continue to work under UTF-8 encoding.
Figuring out the relationship between ASCII, Unicode, and UTF-8, we can summarize how the current computer system works with character encoding:
In computer memory, Unicode encoding is used uniformly, and is converted to UTF-8 encoding when it needs to be saved to the hard disk or when it needs to be transferred.
When editing with Notepad, the UTF-8 characters read from the file are converted to Unicode characters into memory, and when the edits are complete, the conversion of Unicode to UTF-8 is saved to the file:
When you browse the Web, the server converts dynamically generated Unicode content to UTF-8 and then to the browser:
So you see a lot of pages of the source code will have similar <meta charset="UTF-8" />
information, that the page is exactly the UTF-8 encoding.
Strings and encodings