Tags: nbu atq lbs SCM Woe Rup nbsp Lin sdlCharacter encoding as we've already said, strings are also a type of data, but a special string is a coding problem. Because computers can only handle numbers, if you want to work with text, you must convert the text to a number before processing it. The oldest computer was designed with 8 bits (bit) as a byte (byte), so a single word energy-saving representation of the largest integer is 255 (binary 11111111 = decimal 255), if you want to represent a larger integer, you must use more bytes. For example, two bytes can represent the largest integer is 65535, 4 bytes can represent the largest integer is 4294967295. Since the computer was invented by the Americans, only 127 characters were encoded into the computer, that is, uppercase and lowercase letters, numbers, and some symbols, this encoding table is called ASCII encoding, such as the code of capital A is 65, and the code for lowercase z is 122. But to deal with Chinese obviously a byte is not enough, requires at least two bytes, and does not conflict with ASCII encoding, so China has developed a GB2312 code to put Chinese into it. What you can imagine is that there are hundreds of languages all over the world, Japan has made Japanese into Shift_JIS, South Korea has made it into EUC-KR, and countries have standards that inevitably clash, and the result is that there will be garbled characters in the mixed text of multiple languages. Therefore, Unicode emerges. Unicode unifies all languages into a set of encodings, so there is no more garbled problem. The unicode standard is also evolving, but the most common is to use two bytes to represent a character (4 bytes if you want to use very remote characters). Unicode is supported directly by modern operating systems and most programming languages. Now, smoothing out the difference between ASCII encoding and Unicode encoding: ASCII encoding is 1 bytes, and Unicode encoding is usually 2 bytes. Letter A with ASCII encoding is decimal 65, binary 01000001; character 0 with ASCII encoding is decimal 48, binary 00110000, note the characters ' 0 ' and integer 0 are different; The ASCII encoding range has been exceeded in Chinese characters, with Unicode encoding being decimal 20013, binary 01001110 00101101. You can guess that if you encode ASCII-encoded A in Unicode, you just need to make 0 on the front, so the Unicode encoding for A is 00000000 01000001. New problems arise: If Unicode encoding is unified, the problem of garbledLost. However, if you write text that is basically all in English, using Unicode encoding requires more storage space than ASCII encoding, which is not cost-effective in storage and transmission. Therefore, in the spirit of saving, there has been the conversion of Unicode encoding to "Variable length encoding" UTF-8 encoding. The UTF-8 encoding encodes a Unicode character into 1-6 bytes according to a different number size, the commonly used English letter is encoded in 1 bytes, the kanji is usually 3 bytes, and only the very uncommon characters are encoded into 4-6 bytes. If the text you want to transfer contains a large number of English characters, you can save space with UTF-8 encoding: 01001110 00101101
|in||x||11100100 10111000 10101101|
Be aware of the distinction between ' abc ' and ' B ' abc ', which is STR, although the content is the same as the former, but each character of bytes occupies only one byte. STR expressed in Unicode through the Encode () method can be encoded as a specified bytes, such as: >>> ' abc '. Encode (' ASCII ') b ' abc ' >>> ' Chinese ' . Encode (' utf-8 ') b ' \xe4\xb8\xad\xe6\x96\x87 ' >>> ' Chinese '. Encode (' ASCII ') Traceback (most recent call last): File "<stdin>", line 1, in <module>unicodeencodeerror: ' ASCII ' codec can ' t encode characters in Positio N 0-1: Ordinal not in range (128) The English-language STR can be encoded in ASCII as bytes, the content is the same, the Chinese-containing STR can be encoded with UTF-8 as bytes. STR, which contains Chinese, cannot be ASCII encoded because the range of Chinese encodings exceeds the ASCII encoding range and Python will error. in bytes, the bytes that cannot be displayed as ASCII characters are #显示 with \x#. Conversely, if we read the byte stream from the network or disk, then the data read is bytes. To turn bytes into STR, you need to use the Decode () method: >>> B ' abc '. DECODE (' ASCII ') ' abc ' >>> b ' \xe4\xb8\xad\xe6\x96\ X87 '. Decode (' utf-8 ') ' Chinese ' to calculate how many characters str contains, you can use the Len () function: >>> len (' ABC ') 3>>> len (' Chinese ') 2len () The function calculates the number of characters in STR and computes the number of bytes: >>> len (b ' ABC ') 3>>> len (b ' \xe4\xb8\xad\xe6\x96\x87) if replaced by the Bytes,len () function. ') 6>>> len (' Chinese '. Encode (' Utf-8 ')) 6 visible, 1Chinese characters that are UTF-8 encoded typically consume 3 bytes, while 1 English characters take up only 1 bytes. When working with strings, we often encounter the mutual conversion of STR and bytes. To avoid garbled problems, you should always use UTF-8 encoding to convert str and bytes. Because the Python source code is also a text file, when your source code contains Chinese, it is necessary to save it as UTF-8 encoding. When the Python interpreter reads the source code, in order for it to be read by UTF-8 encoding, we usually write these two lines at the beginning of the file: #!/usr/bin/env python3#-*-coding:utf-8-*-The first line of comments is to tell linux/ OS x System, this is a python executable, the Windows system ignores this comment; the second line of comments is to tell the Python interpreter, read the source code according to UTF-8 encoding, otherwise, you write in the source code of the Chinese output may be garbled. Affirms that UTF-8 encoding does not mean that your. py file is UTF-8 encoded, and you must make sure that the text editor is using UTF-8 without BOM encoding: if the. py file itself uses UTF-8 encoding and also affirms #-*-Coding:utf-8-*-, open command prompt test to display Chinese: formatting the last common question is how to output a formatted string. We will often output similar ' Dear XXX Hello! You xx month's bill is XX, the balance is xx ' and so on the string, and the XXX content is varies according to the variable, therefore, needs a simple format string the way. in Python, the format used is consistent with the C language, implemented in%, for example the following: >>> ' Hello,%s '% ' world ' Hello, world ' >>> ' Hi,%s, you had $%d. '% (' Michael ', 1000000) ' Hi, Michael, you have $1000000. ' As you may have guessed, the% operator is used to format the string. Inside the string,%s is replaced with a string,%d is replaced with an integer, there are several% placeholder, followed by a number of variables or values, the order to correspond well. If there is only one%, the parentheses can be omitted. Common placeholders are:
x = B ' ABC '
|%f||Floating point number|
Python Base character encoding