- Unicode and Utf-8 in Python
The history of the character set mentioned in this article is a brief explanation of the relationship between Unicode and Utf-8, briefly summarizing:Utf-8 and Utf-16, Utf-32 is a class, the realization of the function is the same, but the most widely used utf-8, but Unicode and utf-8 is not the same class, Unicode is the form of expression, Utf-8 is the form of storage
- Unicode is a representation (Utf-8 can be decoded into Unicode)
- Utf-8, utf-16, utf-32 are storage formats (Unicode can be encoded as UTF-8)
Understanding: Storage needs to be encoded into utf-8, when performance is a utf-8 need to decode into Unicode, in other words, the code is processed in Unicode, in the file is stored in the form of utf-8.
In [1]: name = ‘Zhang San’
In [2]: print name
Zhang San
In [3]: name
Out [3]: ‘\ xe5 \ xbc \ xa0 \ xe4 \ xb8 \ x89’ # utf8 encoding, storage form
In [4]: len (name)
Out [4]: 6
In [5]: name [0: 2] #Shard operation
Out [5]: ‘\ xe5 \ xbc’
In [6]: print name [0: 1]
?
In [7]: type (name) #Type is a string type
Out [7]: str
In [8]: type
- Use Unicode in the form of:
Python2 inside, is directly in front of the string plus a U
In [8]: name = u ‘张三‘
In [9]: name
Out [9]: u ‘\ u5f20 \ u4e09’ #Unicode encoding
In [10]: print name
Zhang San
In [11]: print name [0: 1]
Zhang
In [12]: name [0: 1]
Out [12]: u ‘\ u5f20’
In [13]: len (name)
Out [13]: 2
In [15]: type (name)
Out [15]: unicode #type is a unicode
Here 's the point.
- decoding function and Coding function
Conversion of Unicode to Utf-8: A built-in method is provided in Python: decode (); Encode ()
- Code: Encode (): From Representation to storage form
- Decoding: Decode (): From storage form to presentation form
where Unicode is not bound to a certain type of decoding.
In [37]: name = u‘Zhang San ’
In [38]: b_name = name.encode (‘utf-8’) #Encoding into different storage forms, both can be encoded as utf-8
In [39]: b_name
Out [39]: ‘\ xe5 \ xbc \ xa0 \ xe4 \ xb8 \ x89’
In [47]: type (b_name) #type is str
Out [47]: str
In [40]: b_name2 = name.encode (‘utf-16’) # can also be encoded as utf-16
In [41]: b_name2
Out [41]: ‘\ xff \ xfe _ \ tN’
In [42]: b_name3 = name.encode (‘utf-32’) #can also be encoded as utf-32
In [43]: b_name3
Out [43]: ‘\ xff \ xfe \ x00 \ x00 _ \ x00 \ x00 \ tN \ x00 \ x00’
In [44]: j_name = b_name.decode (‘utf-8’) #Decode utf-8 into Unicode
In [45]: j_name
Out [45]: u ‘\ u5f20 \ u4e09’
In [46]: type (j_name) #Type is Unicode
Out [46]: unicode
so in summary Unicode write to a file in error, error: ASCII encoding can not be greater than 128,ASCII encoding range of 0-128, of course, Chinese characters beyond the ASCII encoding range The understanding of error: Unicode is a form of expression, the specific storage must be encoded in a way of encoding, the default ASCII encoding in Python2, so storage ascii, but I now have Chinese, Chinese is much larger than ASCII, So the error can not be saved:
In [47]: name = u‘Zhang San ’
In [50]: with open (‘/ tmp / test’, ‘w’) as f:
...: f.write (name)
...:
-------------------------------------------------- -------------------------
UnicodeEncodeError Traceback (most recent call last)
<ipython-input-4-0d87fa01de83> in <module> ()
1 with open (‘/ tmp / test’, ‘w’) as f:
----> 2 f.write (name)
UnicodeEncodeError: ‘ascii’ codec ca n’t encode characters in position 0-1: ordinal not in range (128)
So the solution is, first coded as utf-8 or utf-16 and so on.
In [51]: with open (‘/ tmp / test’, ‘w’) as f:
...: f.write (name.encode (‘utf-8‘)) #Encode to utf-8 and write to the file
...:
In [52]: with open (‘/ tmp / test’, ‘r’) as f:
...: new_name = f.read ()
...:
In [53]: new_name.decode (‘utf-8’) #Decode utf-8 into Unicode
Out [53]: u ‘\ u5f20 \ u4e09’
- The differences between Python2 and Python3 about character sets
- Differences in character sets for Python2 and Python3:
- There are two types of Python 3 that represent the sequence of characters: bytes and str. An instance of the former contains the original 8-bit value, and the instance of the latter contains Unicode characters
- Python 2 also has two types that represent the sequence of characters, called STR and Unicode, respectively. Unlike Python 3, an instance of STR contains the original 8-bit value, whereas an instance of Unicode contains Unicode characters
1, Python2 inside Str represents a normal string, and Unicode is a Unicode is said to say: When the type is not specified is a str, when specified as Unicode is the Unicode type
In [15]: name = u‘Zhang San ’
In [16]: type (name)
Out [16]: unicode
2, Python3 inside does not specify the string type when is a str. 3.Python3 inside the STR is Python2 inside the unicode,python2 inside the STR is Python3 inside the bytes!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Python2 has a standard library codecs module that helps us automatically encode and decode the codecs module provides a encoding parameter for the Open function
In [55]: import codecs
In [56]: name = u‘Zhang San ’
In [57]: with open (‘/ tmp / test’, ‘w’, encoding = ‘utf-8’) as f:
...: f.write (name)
...:
In [58]: with open (‘/ tmp / test’, ‘r’, encoding = ‘utf-8’) as f:
...: new_name = f.read ()
...:
In [59]: new_name
Out [59]: u ‘\ u5f20 \ u4e09’
Python3
The Open function itself provides the encoding parameter, which we can specify by encoding, in the same way as the Python2 codecs module,
>>> name = ‘Zhang San’
>>> name
‘Zhang San’
>>> with open (‘/ tmp / test’, ‘w’, encoding = ‘utf-8’) as f:
... f.write (name)
...
#Summary!!!!!!!!!!!!!!!!!!! There are many ways to represent Unicode characters as binary data, and the most common way to encode them is utf-8.!!!!!!!!!!!!! Python 3 's str and Python 2 Unicode are not associated with a specific binary encoding. If you want to convert Unicode characters to binary data, you must use the Encode method, if you want to convert binary data to Unicode characters, you must use decode when programming, it is necessary to put the encoding and decoding operations on the outermost interface to do, The core part of the program should use Unicode character types, rather than making any assumptions about character encodings. Python3
#In Python3, we need to write methods that accept str or bytes and always return str:
def to_str (bytes_or_str):
if isinstance (bytes_or_str, bytes):
value = bytes_or_str.decode (‘utf-8’)
else:
value = bytes_or_str
return value # Instance of str
#Also, you need to write a method that accepts str or bytes and always returns bytes:
def to_bytes (bytes_or_str):
if isinstance (bytes_or_str, str):
value = bytes_or_str.encode (‘utf-8)
else:
value = bytes_or_str
return value # Instance of bytes
Python2
#In Python2, you need to write a method that accepts str or unicode and always returns unicode:
# python2
def to_unicode (unicode_or_str):
if isinstance (unicode_or_str, str):
value = unicode_or_str.decode (‘utf-8’)
else:
value = unicode_or_str
return value # Instance of unicode
#In addition, you need to write a method that accepts str or unicode and always returns str:
# Python2
def to_str (unicode_or_str):
if isinstance (unicode_or_str, unicode):
value = unicode_or_str.encode (‘utf-8’)
else:
value = unicode_or_str
reutrn vlaue # Instance of str
The following part of the article is excerpted from the effective Python: 59 effective ways to write high-quality Python code 3rd: Understanding Bytes, str, and Unicode differences
Character set encoding and Python (ii) Unicode and Utf-8