Character encoding
String encoding problem.
Because a computer can only handle numbers, if you are working with text, you must convert the text to a number before processing it. The earliest computers were designed with 8
Bit (bit) as a byte (byte), so a word energy-saving representation of the largest integer is 255 (binary 11111111 = decimal 25 5),
If you want to represent a larger integer, you must use more bytes. For example, two bytes can represent the largest integer is 65535, 4 bytes can represent the largest integer is 4294967295.
Since the computer was invented by Americans, only 127 letters were encoded into the computer, that is, letters, numbers, and symbols,
This Code table is called ASCII encoding, such as the code for capital A is 65, and the lowercase z is encoded as 122.
However, it is not enough to handle the Chinese one byte, which requires at least two bytes and cannot conflict with ASCII encoding.
So, China has developed a GB2312 code to put Chinese into it.
What you can imagine is that there are hundreds of languages all over the world, Japan has made Japanese into shift_]is, and Korea has made Korean into EUC-KR.
Countries have national standards, there will inevitably be conflicts, the result is that in multi-language mixed text, the display will be garbled.
The Unicode standard is also evolving, but it is most commonly used to represent a character in two bytes (4 bytes If a very remote character is used).
Unicode is supported directly by modern operating systems and most programming languages.
Now, the difference between ASCII encoding and Unicode encoding is smoothed: ASCII encoding is 1 bytes, and Unicode encoding is usually 2 bytes.
The letter A is ASCII encoded in decimal, binary 01000001
Character 0 ASCII encoding is decimal, binary 00110000, note that the character ^ and the integer 0 are different
Chinese characters have been beyond the ASCII encoding range, Unicode encoding is decimal 20013, binary 01001110 00101101.
You can guess that if you encode ASCII-encoded A in Unicode, you only need to make 0 on the front, so the Unicode encoding for A is 00000009 01000001.
The new problem arises again: If Unicode encoding is unified, the garbled problem disappears. But if the text you write is basically all in English,
Unicode encoding requires more storage space than ASCII encoding, which is not cost-effective in storage and transmission.
Therefore, in the spirit of saving, there has been the conversion of Unicode encoding to "Variable length encoding" Utf-8 encoding. UTF-8 encode a Unicode word
Characters are encoded into 4-6 bytes. If the text you want to transfer contains a large number of English characters, you can save space with UTF-8 encoding:
Character ASCIIUnicodeUTF-8
A 01000001 00000000 01000001 01000001 medium x01001110 00101101 11100100 10111000 10101101
It can also be found from the table above that the UTF-8 encoding has an added benefit that ASCII encoding can actually be seen as part of the UTF-8 encoding,
Therefore, a large number of legacy software that only supports ASCII encoding can continue to work under UTF-8 encoding.
Figuring out the relationship between ASCII, Unicode, and UTF-8, we can summarize how the current computer system works with character encoding:
In computer memory, Unicode encoding is used uniformly, and is converted to UTF-8 encoding when it needs to be saved to the hard disk or when it needs to be transferred.
When editing with Notepad, the UTF-8 characters read from the file are converted to Unicode characters into memory, and when the edits are complete, the UNIC ode is converted to UTF-8 saved to the file:
So you see a lot of the source of the Web page will have similar <meta charset=hutf-8h/> Information, table 7K This page is using the UTF-8 encoding.
A Python string
Because Python was born earlier than the Unicode standard, the earliest Python only supported ASCII encoding, and the normal string ' ABR is ASCII-encoded inside python.
Python provides the o"d () and ch" () functions to convert letters and corresponding numbers to each other:
> Ord (, A,) 65
> chr (65)
Python later added support for Unicode, with a Unicode-represented string with U ' ... Say, for example:
> Print U chinese
' U ' \u4e2d ' in > U '
Write U and U ' \u4e2cr are the same, \u is followed by a hexadecimal Unicode code. Therefore, UW and U, u004:r are the same.
How do two strings convert to each other? Although the string ' xxx ' is ASCII encoded, it can also be seen as UTF-8 encoding, while Ltxxx ' can only be Unico
De code.
Convert ltxxx ' to UTF-8 encoded ' xxx ' with encode (Utf-8) method:
> U abc, encode (, Utf-8,), ABC,
> U Chinese, encode (, Utf-8,) ' \xe4\xb8\xad\xe6\x96\x87 '
The UTF-8 value represented by the English character conversion is equal to the Unicode value (but with a different storage space), and the 1 Unicode characters converted into 3 UTF-8 characters after the Chinese character conversion, and the \xe4 you see is one of the bytes, because its value is 228. No corresponding letters can be displayed, so the numeric value of the bytes is displayed in hexadecimal. The Len () function can return the length of a string:
> Len (U ' ABC ')
> Len (' ABC ')
> Len (u Chinese,)
> Len C \xe4\xb8\xad\xe6\x96\x87,)
In turn, convert the string _xxx_ of the UTF-8 encoded table 7K to the Unicode string U ' xxx_ with the decode (' Utf-8 ') method:
>,ABC, Decode (, Utf-8,) u abc '
>,\xe4\xb8\xad\xe6\x96\x87 ' Decode (, Utf-8,) u \u4e2d\u6587 '
> print,\xe4\xb8\xad\xe6\x96\x87, Decode (, Utf-8,) Chinese
This article is from the "Home of Linux" blog, please be sure to keep this source http://pyhton.blog.51cto.com/8763915/1688489
ASCLL UNICODE UTF-8 Detailed