Python: Familiar and unfamiliar character encoding
Character encoding is an unavoidable problem in computer programming, whether you use Python2 or Python3, or C + +, Java, etc., I feel it is necessary to clarify the concept of character encoding in computer. This article is mainly divided into the following sections:
Basic concepts
Introduction to common character encodings
Python's default encoding
Character types in the Python2
Unicodeencodeerror & Unicodedecodeerror Roots
Basic concepts
In the field of computer and telecommunications, the character is a unit of information, it is a general term for all kinds of words and symbols , including the national text, punctuation, graphic symbols, numbers and so on. For example, a Chinese character, an English letter, a punctuation mark, etc. are all characters.
a character set is a collection of characters . There are many types of character sets, and each character set contains a different number of characters. For example, a common character set is the ASCII character set, the GB2312 character set, the Unicode character set, and so on, where the ASCII character set consists of 128 characters, including displayable characters (such as English uppercase and lowercase, Arabic numerals) and control characters (such as SPACEBAR, enter); GB2312 The character set is the Chinese national Standard's Simplified Chinese character set, contains the simplified characters, the general symbol, the numeral and so on; the Unicode character set contains all the characters used in all languages of the world.
character encoding, which refers to characters in a character set, encodes it into a specific binary number for the computer to process. Common character encodings include ASCII encoding, UTF-8 encoding, GBK encoding, and so on. In general, character sets and character encodings are often considered synonymous concepts, for example, for character set ASCII, except for the "set of characters" meaning, which also contains the meaning of "encoding", that is,ASCII Both indicates that the character set also represents the corresponding character encoding .
Let's summarize the following in a table:
Introduction to common character encodings
Common character encodings include ASCII encoding, GBK encoding, Unicode encoding and UTF-8 encoding, and so on. Here, we mainly introduce ASCII, Unicode, and UTF-8.
Ascii
The computer was born in the United States, people use English, but in the English world, but is the English alphabet, numbers and some common symbols of the combination.
In the 1960s, the United States developed a set of character encoding schemes, which stipulate the conversion relationship between English letters, numbers and some common symbols with binary, known as ASCII (American standard Code For information interchange, US information Interchange standard Code) code.
For example, the binary representation of the uppercase English letter A is 01000001 (decimal 65), the binary representation of the lowercase English letter A is 01100001 (decimal 97), and the binary representation of the spaces space is 00100000 (decimal 32).
Unicode
The ASCII code only specifies a 128-character encoding, which is sufficient in the United States. However, the computer later spread to Europe, Asia, and all over the world, and the world's language is almost completely different, ASCII code to express other languages is far from enough, so, different countries and regions have developed their own coding scheme, such as the Chinese mainland GB2312 code and GBK code, etc. Japan's shift_jis coding and so on.
Although various countries and regions can develop their own coding scheme, but different countries and regions of the computer in the process of data transmission will appear a variety of garbled (mojibake), this is undoubtedly a disaster.
What to do? The idea is also simple, is to unify all the languages of the world into a set of coding scheme, this coding scheme is called Unicode, it is a unique binary encoding for each character of each language, so it can be cross-language, cross-platform text processing, is not great!
The Unicode version 1.0 was born in October 1991 and is still being revised, adding new characters to each new version, which is currently available on June 21, 2016 at 9.0.0.
The Unicode standard uses hexadecimal digits, and prefix u+ in front of the digits, for example, the Unicode encoding of the uppercase letter "a" is u+0041, and the Unicode encoding of the Chinese character "Yan" is u+4e25. More symbol correspondence table, can query unicode.org, or special Chinese character correspondence table.
UTF-8
Unicode appears to be perfect, and unification has been implemented. However, there is a big problem with Unicode: waste of resources.
Why do you say that? Originally, Unicode, in order to be able to represent all the languages of the world, began with two bytes, and later found that two bytes were not enough, and used four bytes. For example, the Chinese character "strict" Unicode encoding is hexadecimal number 4E25, converted to binary has 15 bits, that is 100111000100101, so at least two bytes to represent the Chinese character, but for other characters, it may require three or four bytes, or even more.
At this point, the problem is, if the previous ASCII character set is also expressed in this way, it is not a waste of storage space. For example, the uppercase letter "a" binary encoding is 01000001, it only needs one byte is enough, if Unicode unified use three bytes or four bytes to represent the character, then the "a" binary encoding of the first few bytes is 0, which is a waste of storage space.
To solve this problem, on the basis of Unicode, people realized the UTF-16, UTF-32 and UTF-8. Here are just a few words about UTF-8.
UTF-8 (8-bit Unicode Transformation Format) is a variable-length character encoding for Unicode that uses one to four bytes to represent characters, for example, the ASCII character continues to use a byte encoding, Arabic, Greek and so on using two byte encoding, commonly used Chinese characters using three byte encoding, and so on.
Therefore, we say that UTF-8 is one of the ways Unicode is implemented, and other implementations include UTF-16 (characters in two or four-byte representations) and UTF-32 (characters in four-byte notation).
Python's default encoding
The default encoding for Python2 is Utf-8, which is the default encoding for Ascii,python3, which can be obtained in the following way:
Python 2.7.11 (default, Feb : :)
[GCC 4.2.1 Compatible Apple LLVM 7.0.2 (clang- 700.1.81)] on Darwin
Type ' help ', ' copyright ', ' credits ' or ' license ' for More information.
>>> import sys
>>> sys. Getdefaultencoding()
' ASCII '
Python 3.5.2 (default, June) : (+/-)
[GCC 4.2.1 Compatible Apple LLVM 7.3.0 (clang- 703.0.31)] on Darwin
Type ' help ', ' copyright ', ' credits ' or ' license ' for More information.
>>> import sys
>>> sys. Getdefaultencoding()
' Utf-8 '
Character types in the Python2
There are two kinds of string-related types in Python2: STR and Unicode, and their parent class is basestring. where the str type string is encoded in several ways, by default is ASCII, and Gbk,utf-8, and so on, the Unicode type string uses u ' ... ' form, the following figure shows the relationship between STR and Unicode:
The mutual conversions of the two strings are summarized as follows:
>>> ' Chinese '. Decode(' utf-8 ')
U' \u4e2d\u6587 '
>>> u' Chinese '. Encode(' utf-8 ')
' \xe4\xb8\xad\xe6\x96\x87 '
Unicodeencodeerror & Unicodedecodeerror Roots
Using Python2 to write programs often encounter Unicodeencodeerror and unicodedecodeerror, the root of which is that if the code is mixed with the STR type and Unicode type of string, Python will default to use a This error is likely to occur when the SCII encoding attempts to encode a string of Unicode types (encode), or to decode a string of type str (decode).
Here are two common scenarios that we'd better keep in mind:
Let's take a look at the example:
>>> S = ' Hello ' # str type, UTF-8 encoded
>>> u = u' World ' # unicode type
>>> s + u # will be implicitly converted, i.e. S.decode (' ASCII ') + u
Traceback (mostrecent ):
File "<stdin>", line 1, in <module >
Unicodedecodeerror: ' ASCII ' codec can 't decode byte 0xe4 in position 0: ordinal not in range(+)
In order to avoid errors, we need to display the specified use of ' utf-8 ' for decoding, as follows:
>>> S = ' Hello ' # str type, UTF-8 encoded
>>> u = u' World '
>>>
>>> s. Decode(' utf-8 ') + u # display specifies ' Utf-8 ' for conversion
U' \u4f60\u597d\u4e16\u754c ' # Note that this is not an error, this is a Unicode string
If an object such as a function or class receives a string of type STR, but you pass it by default to encode it to the STR type using ASCII, it is easy to unicode,python2 unicodeencodeerror.
Let's take a look at the example:
>>> U_str = u 'hello '
>>> str(u_str)
Traceback (mostrecent ):
File "<stdin>", line 1, in <module >
Unicodeencodeerror: ' ASCII ' codec can 't encode characters in position 0-1: ordinal not in range
In the above code, U_STR is a Unicode-type string, because the parameter of STR () can only be a str type, at which time Python attempts to encode it into ASCII using ASCII, which is:
U_str.encode (' ASCII ')//U_str is a Unicode string
The above uses ASCII encoding for the Unicode type of Chinese, which is sure to go wrong.
Take a look at an example using raw_input, note that Raw_input only receives a string of type str:
>>> Name = raw_input(' Input your name: ')
Input your name: Ethan
>>> name
' Ethan '
>>> Name = raw_input(' Enter your name: ')
Enter your name: xiaoming
>>> name
' \xe5\xb0\x8f\xe6\x98\x8e '
>>> type(name)
<type ' str '>
>>> Name = raw_input(u' Enter your name: ') # will try to use U ' Enter your name '. Encode (' ASCII ')
Traceback (mostrecent ):
File "<stdin>", line 1, in <module >
Unicodeencodeerror: ' ASCII ' codec can' t encode characters in position 0-5: Ordinal No in RA Nge (+)
>>> name = raw_input (U ' Enter your name : '. Encode ('UTF-8')) #可以, but at this point name is not a Unicode type
Enter your name: xiaoming
>>> Name
'\xe5\xb0\x8f\xe6\x98\x8e'
>>> type (name)
<type 'str' >
>>> name = raw_input (U ' Enter your name : '. Encode ('UTF-8'). Decode ('UTF- 8') # recommended
Enter your name: xiaoming
>>> Name
U '\u5c0f\u660e'
>>> type (name)
<type 'Unicode '>
Look again at an example of redirection:
Hello = u 'hello '
Print Hello
Save the above code to the file hello.py, the terminal execution Python hello.py can print normally, but if you redirect it to file Python hello.py > result will find Unicodeencodeerror.
This is because, when output to the console, print uses the default encoding of the console, and when redirected to a file, print does not know what encoding to use, and the default encoding ASCII is used to cause a coding error.
Should be changed to the following:
Hello = u 'hello '
Print Hello. Encode(' utf-8 ')
Doing Python hello.py > result is no problem.
Summary
UTF-8 is a variable-length character encoding for Unicode, which is one of the ways Unicode is implemented.
The Unicode character set has several coding standards, such as UTF-8, UTF-7, UTF-16.
When working with string operations that include both the STR and Unicode types, Python2 decodes str (decode) into Unicode and then operations.
If an object such as a function or class receives a string of type STR, you pass it by default to encode it to the STR type by using ASCII unicode,python2.
Python: Familiar and unfamiliar character encoding (go to Python developer)