Python Base character encoding

Last Update:2017-05-06 Source: Internet

Author: User

Tags ming ord

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Character encoding as we've already said, strings are also a type of data, but a special string is a coding problem. Because computers can only handle numbers, if you want to work with text, you must convert the text to a number before processing it. The oldest computer was designed with 8 bits (bit) as a byte (byte), so a single word energy-saving representation of the largest integer is 255 (binary 11111111 = decimal 255), if you want to represent a larger integer, you must use more bytes. For example, two bytes can represent the largest integer is 65535, 4 bytes can represent the largest integer is 4294967295. Since the computer was invented by the Americans, only 127 characters were encoded into the computer, that is, uppercase and lowercase letters, numbers, and some symbols, this encoding table is called ASCII encoding, such as the code of capital A is 65, and the code for lowercase z is 122. But to deal with Chinese obviously a byte is not enough, requires at least two bytes, and does not conflict with ASCII encoding, so China has developed a GB2312 code to put Chinese into it. What you can imagine is that there are hundreds of languages all over the world, Japan has made Japanese into Shift_JIS, South Korea has made it into EUC-KR, and countries have standards that inevitably clash, and the result is that there will be garbled characters in the mixed text of multiple languages. Therefore, Unicode emerges. Unicode unifies all languages into a set of encodings, so there is no more garbled problem. The unicode standard is also evolving, but the most common is to use two bytes to represent a character (4 bytes if you want to use very remote characters). Unicode is supported directly by modern operating systems and most programming languages. Now, smoothing out the difference between ASCII encoding and Unicode encoding: ASCII encoding is 1 bytes, and Unicode encoding is usually 2 bytes. Letter A with ASCII encoding is decimal 65, binary 01000001; character 0 with ASCII encoding is decimal 48, binary 00110000, note the characters ' 0 ' and integer 0 are different; The ASCII encoding range has been exceeded in Chinese characters, with Unicode encoding being decimal 20013, binary 01001110 00101101. You can guess that if you encode ASCII-encoded A in Unicode, you just need to make 0 on the front, so the Unicode encoding for A is 00000000 01000001. New problems arise: If Unicode encoding is unified, the problem of garbledLost. However, if you write text that is basically all in English, using Unicode encoding requires more storage space than ASCII encoding, which is not cost-effective in storage and transmission. Therefore, in the spirit of saving, there has been the conversion of Unicode encoding to "Variable length encoding" UTF-8 encoding. The UTF-8 encoding encodes a Unicode character into 1-6 bytes according to a different number size, the commonly used English letter is encoded in 1 bytes, the kanji is usually 3 bytes, and only the very uncommon characters are encoded into 4-6 bytes. If the text you want to transfer contains a large number of English characters, you can save space with UTF-8 encoding: 01001110 00101101

character	ASCII	Unicode	UTF-8
a	01000001	00000000 01000001	01000001
in	x	11100100 10111000 10101101

It can also be found from the table above that the UTF-8 encoding has an added benefit that ASCII encoding can actually be seen as part of the UTF-8 encoding, so a large number of legacy software that only supports ASCII encoding can continue to work under UTF-8 encoding. Figuring out the relationship between ASCII, Unicode, and UTF-8, we can summarize how the current computer system's character encoding works: in computer memory, using Unicode encoding uniformly, when it needs to be saved to the hard disk or when it needs to be transmitted, is converted to UTF-8 encoding. When editing with Notepad, the UTF-8 characters read from the file are converted to Unicode characters into memory, and when the edits are complete, the conversion of Unicode to UTF-8 is saved to the file: when the page is browsed. The server will convert dynamically generated Unicode content to UTF-8 and then to the browser: so you see a lot of Web page source will have similar <meta charset= "UTF-8"/> Information, Indicates that the Web page is using the UTF-8 encoding. After python the string to figure out the annoying character coding problem, we'll look at the python string. in the latest version of Python 3, strings are encoded in Unicode, that is, Python strings support multiple languages, such as: >>> print (' str with Chinese ') Contains the Chinese str for a single character encoding, Python provides the Ord () function to get the integer representation of the character, the Chr () function converts the encoding to the corresponding character: >>> ord (' A ') 65>>> ord (' Middle ' 20013>>> chr ' B ' >>> chr (25991) ' text ' If you know the integer encoding of the character, you can also write it in hexadecimal str: >>> ' \u4e2d\ u6587 ' Chinese ' is completely equivalent to the two types of notation. Because Python's string type is str, it is represented in memory in Unicode, and one character corresponds to several bytes. If you want to transfer on the network, or save to disk, you need to turn str into bytes in bytes. python a single or double quotation mark with a B prefix for data of type bytes:

  x = B ' ABC '

Be aware of the distinction between ' abc ' and ' B ' abc ', which is STR, although the content is the same as the former, but each character of bytes occupies only one byte. STR expressed in Unicode through the Encode () method can be encoded as a specified bytes, such as: >>> ' abc '. Encode (' ASCII ') b ' abc ' >>> ' Chinese ' . Encode (' utf-8 ') b ' \xe4\xb8\xad\xe6\x96\x87 ' >>> ' Chinese '. Encode (' ASCII ') Traceback (most recent call last): File "<stdin>", line 1, in <module>unicodeencodeerror: ' ASCII ' codec can ' t encode characters in Positio N 0-1: Ordinal not in range (128) The English-language STR can be encoded in ASCII as bytes, the content is the same, the Chinese-containing STR can be encoded with UTF-8 as bytes. STR, which contains Chinese, cannot be ASCII encoded because the range of Chinese encodings exceeds the ASCII encoding range and Python will error. in bytes, the bytes that cannot be displayed as ASCII characters are #显示 with \x#. Conversely, if we read the byte stream from the network or disk, then the data read is bytes. To turn bytes into STR, you need to use the Decode () method: >>> B ' abc '. DECODE (' ASCII ') ' abc ' >>> b ' \xe4\xb8\xad\xe6\x96\ X87 '. Decode (' utf-8 ') ' Chinese ' to calculate how many characters str contains, you can use the Len () function: >>> len (' ABC ') 3>>> len (' Chinese ') 2len () The function calculates the number of characters in STR and computes the number of bytes: >>> len (b ' ABC ') 3>>> len (b ' \xe4\xb8\xad\xe6\x96\x87) if replaced by the Bytes,len () function. ') 6>>> len (' Chinese '. Encode (' Utf-8 ')) 6 visible, 1Chinese characters that are UTF-8 encoded typically consume 3 bytes, while 1 English characters take up only 1 bytes. When working with strings, we often encounter the mutual conversion of STR and bytes. To avoid garbled problems, you should always use UTF-8 encoding to convert str and bytes. Because the Python source code is also a text file, when your source code contains Chinese, it is necessary to save it as UTF-8 encoding. When the Python interpreter reads the source code, in order for it to be read by UTF-8 encoding, we usually write these two lines at the beginning of the file: #!/usr/bin/env python3#-*-coding:utf-8-*-The first line of comments is to tell linux/ OS x System, this is a python executable, the Windows system ignores this comment; the second line of comments is to tell the Python interpreter, read the source code according to UTF-8 encoding, otherwise, you write in the source code of the Chinese output may be garbled. Affirms that UTF-8 encoding does not mean that your. py file is UTF-8 encoded, and you must make sure that the text editor is using UTF-8 without BOM encoding: if the. py file itself uses UTF-8 encoding and also affirms #-*-Coding:utf-8-*-, open command prompt test to display Chinese: formatting the last common question is how to output a formatted string. We will often output similar ' Dear XXX Hello! You xx month's bill is XX, the balance is xx ' and so on the string, and the XXX content is varies according to the variable, therefore, needs a simple format string the way. in Python, the format used is consistent with the C language, implemented in%, for example the following: >>> ' Hello,%s '% ' world ' Hello, world ' >>> ' Hi,%s, you had $%d. '% (' Michael ', 1000000) ' Hi, Michael, you have $1000000. ' As you may have guessed, the% operator is used to format the string. Inside the string,%s is replaced with a string,%d is replaced with an integer, there are several% placeholder, followed by a number of variables or values, the order to correspond well. If there is only one%, the parentheses can be omitted. Common placeholders are:

%d	Integer
%f	Floating point number
%s	String
%x	hexadecimal integer

where formatted integers and floating-point numbers can also specify whether to fill 0 and integers with fractional digits: >>> '%2d-%02d '% (3, 1) ' 3-01 ' >>> '%.2f '% 3.1415926 ' 3.14 ' If you're not sure what you should use,%s will always work, and it will convert any data type to a string: >>> ' Age:%s. Gender:%s '% (+ True) ' age:25. Gender:true ' Sometimes, what if the% inside the string is an ordinary character? This time you need to escape, with a percent of a%: >>> ' growth rate:%d percent '% 7 ' growth rate:7% ' practice Xiao Ming's score from last year's 72 points to the year's 85 points, please calculate the percentage of small Ming's performance increase, Use string formatting to display ' xx.x% ', leaving only 1 digits after the decimal point: #-*-coding:utf-8-*-S1 = 72s2 = Run Summary python 3 's string uses Unicode, directly supports multiple languages. When Str and bytes are converted to each other, the encoding needs to be specified. The most commonly used encoding is UTF-8. Python, of course, also supports other encodings, such as encoding Unicode into GB2312: >>> ' Chinese '. Encode (' gb2312 ') B ' \xd6\xd0\xce\xc4 ' but this is a way of asking for trouble, If there are no special business requirements, keep in mind that only UTF-8 encoding is used. When you format a string, you can use Python's interactive command line to test it quickly and easily.

Python Base character encoding

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More