Character encoding
As we've already said, strings are also a data type, but a special string is a coding problem.
Because the computer can only handle numbers, if you want to process the text, you must first convert the text to a number of characters to process. The earliest computers were designed with 8 bits (bit) as a byte, so the largest integer represented by a word saving is 255 (binary 11111111 = decimal 255), and more bytes must be used to represent a larger integer. For example, two bytes can represent the largest integer is 65535, 4 bytes can represent the largest integer is 4294967295.
Since computers were invented by Americans, only 127 letters were encoded into the computer at the earliest, that is, uppercase and lowercase letters, numbers, and symbols, which are called ASCII codes, such as the encoding of uppercase A is 65 and the code of lowercase z is 122.
But it is not enough to deal with Chinese, which requires at least two bytes and does not conflict with the ASCII encoding, so China has developed a GB2312 code that is used to encode Chinese.
You can imagine that there are hundreds of languages all over the world, Japan to the Japanese Shift_JIS, South Korea to the Korean euc-kr, countries have the standard, it will inevitably conflict, the result is, in the text of the mixed language, the display will be garbled.
As a result, Unicode emerged. Unicode unifies all languages into a set of codes so that there is no more garbled problems.
The Unicode standard is also evolving, but the most common is to represent a character in two bytes (4 bytes If a very remote character is used). Unicode is directly supported by modern operating systems and most programming languages.
Now, the difference between ASCII encoding and Unicode encoding: ASCII encoding is 1 bytes, whereas Unicode encoding is usually 2 bytes.
The letter A is encoded in ASCII with the decimal 65, the binary 01000001;
The character 0 encoded in ASCII is the decimal 48, binary 00110000, note that the character ' 0 ' and the integer 0 are different;
The Chinese character has already exceeded the ASCII coding range, the Unicode encoding is the decimal 20013, the binary 01001110 00101101.
You can guess that if ASCII-encoded a is encoded with Unicode, you can only make up 0 in front, so the Unicode encoding for A is 00000000 01000001.
New problems have emerged: if unified into Unicode encoding, garbled problem has disappeared. However, if the text you write is basically all in English, Unicode encoding requires one more storage space than ASCII encoding, which is very uneconomical to store and transmit.
Therefore, in the spirit of saving, there is also the conversion of Unicode code to "variable length code" UTF-8 encoding. UTF-8 encoding encodes a Unicode character into 1-6 bytes based on a different number of digits, the commonly used English alphabet is encoded in 1 bytes, the Chinese character is usually 3 bytes, and only very uncommon characters are encoded into 4-6 bytes. If the text you want to transfer contains a large number of English characters, you can save space with UTF-8 encoding:
Character ASCII Unicode UTF-8
A 01000001 00000000 01000001 01000001
Medium x 01001110 00101101 11100100 10111000 10101101
You can also see from the table above that the UTF-8 encoding has an added benefit that ASCII encoding can actually be viewed as part of the UTF-8 code, so a large number of legacy software that only supports ASCII encoding can continue to work under UTF-8 encoding.
Figuring out the relationship between ASCII, Unicode, and UTF-8, we can summarize the way the character encoding works today in computer systems:
In computer memory, Unicode encoding is uniformly used, which is converted to UTF-8 encoding when it needs to be saved to a hard disk or when it needs to be transmitted.
When edited with Notepad, the UTF-8 characters read from the file are converted to Unicode characters into memory, and when the editor is finished, the Unicode is converted to the UTF-8 to the file after it is saved:
When browsing the Web, the server converts dynamically generated Unicode content to UTF-8 and then to the browser:
So you see a lot of pages on the source code will have similar <meta charset= "UTF-8"/> Information, indicating that the page is the UTF-8 code.
Python's String
After figuring out the headaches of character coding, let's look at Python's support for Unicode.
Since Python was born earlier than the Unicode standard, the earliest Python only supports ASCII encoding, and the normal string ' ABC ' is ASCII-encoded inside python. Python provides the Ord () and Chr () functions that convert letters and corresponding numbers to each other:
>>> Ord (' a ')
>>> chr ("
a")
Python later added support for Unicode, using Unicode as a string with U ' ... ' For example:
>>> print u ' Chinese '
Chinese
>>> u ' \u4e2d '
Write U ' medium ' and U ' \u4e2d ' is the same, \u is followed by hexadecimal Unicode code. Therefore, U ' A ' and U ' \u0041 ' are the same.
How do two strings convert to each other? The string ' xxx ', while ASCII-encoded, can also be seen as UTF-8 encoding, while U ' xxx ' is only Unicode encoded.
convert u ' xxx ' to UTF-8 encoded ' xxx ' using encode (' Utf-8 ') method:
>>> u ' abc ' Encode (' utf-8 ')
' abc '
>>> u ' Chinese '. Encode (' utf-8 ')
' \xe4\xb8\xad\xe6\x96\ x87 '
The UTF-8 value is equal to the Unicode value (but occupies different storage space) after the conversion of the English character, and the 1 Unicode characters converted into 3 UTF-8 characters after the Chinese character conversion, and the \xe4 you see is one of the bytes, because its value is 228, There is no corresponding letter to display, so the byte value is displayed in hexadecimal. The Len () function can return the length of the string:
>>> len (U ' abc ')
3
>>> len (' abc ')
3
>>> len (U ' Chinese ')
2
>> > Len (' \xe4\xb8\xad\xe6\x96\x87 ')
6
Conversely, convert the UTF-8 encoded string ' xxx ' to a unicode string u ' xxx ' using the decode (' Utf-8 ') method:
>>> ' abc '. Decode (' Utf-8 ')
u ' abc '
>>> ' \xe4\xb8\xad\xe6\x96\x87 '. Decode (' Utf-8 ')
u ' \u4e2d\u6587 '
>>> print ' \xe4\xb8\xad\xe6\x96\x87 '. Decode (' utf-8 ')
Chinese
Since the Python source code is also a text file, when your source code contains Chinese, it is important to specify that you want to save the UTF-8 encoding when you save it. When the Python interpreter reads the source code, in order for it to read in UTF-8 encoding, we usually write the two lines at the beginning of the file:
#!/usr/bin/env python
#-*-Coding:utf-8-*-
The first line of comments is to tell the Linux/os x system that this is a Python executable program that the Windows system ignores.
The second line of comments is to tell the Python interpreter to read the source code according to the UTF-8 code, otherwise the Chinese output you write in the source code may be garbled.
Stating that UTF-8 encoding does not mean that your. py file is UTF-8 encoded, and you must make sure that notepad++ is using UTF-8 without BOM encoding:
If you use notepad++ for editing, in addition to adding #-*-Coding:utf-8-*-, the Chinese string must be a Unicode string:
Stating that UTF-8 encoding does not mean that your. py file is UTF-8 encoded, and you must make sure that notepad++ is using UTF-8 without BOM encoding:
If the. py file itself uses UTF-8 encoding and also affirms the #-*-Coding:utf-8-*-, open the command prompt test to display the Chinese correctly:
Formatting
The last common question is how to output a formatted string. We often output similar ' Dear XXX Hello! Your xx month's charge is XX, the balance is xx ' such string, and the XXX content all is according to the variable change, therefore, needs a simple format string the way.
In Python, the format used is consistent with the C language, and is implemented in%, for example as follows:
>>> ' Hello,%s '% ' world '
Hello, world '
>>> ' Hi,%s, have $%d. '% (' Michael ', 1000000)
' Hi, Michael, you have $1000000. '
As you may have guessed, the% operator is used to format strings. Inside the string,%s represents a string substitution,%d is replaced with an integer, a few% placeholders, followed by a few variables or values, the order should be good. If there is only one%, parentheses can be omitted.
The common placeholders are:
- %d integer
- %f floating-point numbers
- %s string
- %x Hex Integer
in which, formatting integers and floating-point numbers can also specify whether to complement 0 and the digits of integers and decimals:
>>> '%2d-%02d '% (3, 1)
' 3-01 '
>>> '%.2f '% 3.1415926 '
3.14 '
If you are unsure what to use,%s will always work, and it converts any data type to a string:
>>> ' Age:%s. Gender:%s '% (True)
' age:25. Gender:true '
For Unicode strings, the usage is exactly the same, but it is best to ensure that the replaced string is also a Unicode string:
>>> u ' Hi,%s '% u ' michael '
U ' Hi, Michael '
Sometimes, the% of a string is a normal character. This time you need to escape and use percent% to represent one%:
>>> ' growth rate:%d%% 7 '
growth rate:7% '
Summary
Due to legacy issues, Python 2.x version supports Unicode, but requires both the ' xxx ' and ' xxx ' string representations in syntax.
Python, of course, supports other encodings, such as encoding Unicode into GB2312:
>>> u ' Chinese ' encode (' gb2312 ')
' \xd6\xd0\xce\xc4 '
But this is a way of asking for trouble, and if you don't have special business requirements, remember to use only Unicode and UTF-8 encoding.
In the Python 3.x version, the ' xxx ' and u ' xxx ' are unified into Unicode encoding, that is, writing the prefix U is the same, and the string in bytes must be prefixed with a B prefix: B ' xxx '.
When formatting strings, you can use Python's interactive command line to test it quickly and easily.