Character set encoding and Python (ii) Unicode and Utf-8

Source: Internet
Author: User


    • Unicode and Utf-8 in Python
The history of the character set mentioned in this article is a brief explanation of the relationship between Unicode and Utf-8, briefly summarizing:Utf-8 and Utf-16, Utf-32 is a class, the realization of the function is the same, but the most widely used utf-8, but Unicode and utf-8 is not the same class, Unicode is the form of expression, Utf-8 is the form of storage
    • Unicode is a representation (Utf-8 can be decoded into Unicode)
    • Utf-8, utf-16, utf-32 are storage formats (Unicode can be encoded as UTF-8)
Understanding: Storage needs to be encoded into utf-8, when performance is a utf-8 need to decode into Unicode, in other words, the code is processed in Unicode, in the file is stored in the form of utf-8.
    • Do not use Unicode form
In [1]: name = ‘Zhang San’
 
In [2]: print name
Zhang San
 
In [3]: name
Out [3]: ‘\ xe5 \ xbc \ xa0 \ xe4 \ xb8 \ x89’ # utf8 encoding, storage form
 
In [4]: len (name)
Out [4]: 6
 
In [5]: name [0: 2] #Shard operation
Out [5]: ‘\ xe5 \ xbc’
 
In [6]: print name [0: 1]
?
 
In [7]: type (name) #Type is a string type
Out [7]: str
 
In [8]: type




    • Use Unicode in the form of:
Python2 inside, is directly in front of the string plus a U
In [8]: name = u ‘张三‘
 
In [9]: name
Out [9]: u ‘\ u5f20 \ u4e09’ #Unicode encoding
 
In [10]: print name
Zhang San
 
In [11]: print name [0: 1]
Zhang
 
In [12]: name [0: 1]
Out [12]: u ‘\ u5f20’
 
In [13]: len (name)
Out [13]: 2
 
In [15]: type (name)
Out [15]: unicode #type is a unicode




Here 's the point. 
    • decoding function and Coding function
Conversion of Unicode to Utf-8: A built-in method is provided in Python: decode (); Encode ()
    • Code: Encode (): From Representation to storage form
    • Decoding: Decode (): From storage form to presentation form


where Unicode is not bound to a certain type of decoding. 
In [37]: name = u‘Zhang San ’
 
In [38]: b_name = name.encode (‘utf-8’) #Encoding into different storage forms, both can be encoded as utf-8
 
In [39]: b_name
Out [39]: ‘\ xe5 \ xbc \ xa0 \ xe4 \ xb8 \ x89’
 
In [47]: type (b_name) #type is str
Out [47]: str
 
In [40]: b_name2 = name.encode (‘utf-16’) # can also be encoded as utf-16
 
In [41]: b_name2
Out [41]: ‘\ xff \ xfe _ \ tN’
 
In [42]: b_name3 = name.encode (‘utf-32’) #can also be encoded as utf-32
 
In [43]: b_name3
Out [43]: ‘\ xff \ xfe \ x00 \ x00 _ \ x00 \ x00 \ tN \ x00 \ x00’
 
In [44]: j_name = b_name.decode (‘utf-8’) #Decode utf-8 into Unicode
 
In [45]: j_name
Out [45]: u ‘\ u5f20 \ u4e09’
 
In [46]: type (j_name) #Type is Unicode
Out [46]: unicode




so in summary Unicode write to a file in error, error: ASCII encoding can not be greater than 128,ASCII encoding range of 0-128, of course, Chinese characters beyond the ASCII encoding range The understanding of error: Unicode is a form of expression, the specific storage must be encoded in a way of encoding, the default ASCII encoding in Python2, so storage ascii, but I now have Chinese, Chinese is much larger than ASCII, So the error can not be saved: 
In [47]: name = u‘Zhang San ’
 
In [50]: with open (‘/ tmp / test’, ‘w’) as f:
     ...: f.write (name)
     ...:
-------------------------------------------------- -------------------------
UnicodeEncodeError Traceback (most recent call last)
<ipython-input-4-0d87fa01de83> in <module> ()
       1 with open (‘/ tmp / test’, ‘w’) as f:
----> 2 f.write (name)
 
UnicodeEncodeError: ‘ascii’ codec ca n’t encode characters in position 0-1: ordinal not in range (128)
So the solution is, first coded as utf-8 or utf-16 and so on.
In [51]: with open (‘/ tmp / test’, ‘w’) as f:
     ...: f.write (name.encode (‘utf-8‘)) #Encode to utf-8 and write to the file
     ...:
 
 
In [52]: with open (‘/ tmp / test’, ‘r’) as f:
     ...: new_name = f.read ()
     ...:
 
In [53]: new_name.decode (‘utf-8’) #Decode utf-8 into Unicode
Out [53]: u ‘\ u5f20 \ u4e09’




    • The differences between Python2 and Python3 about character sets
    • Differences in character sets for Python2 and Python3:
    • There are two types of Python 3 that represent the sequence of characters: bytes and str. An instance of the former contains the original 8-bit value, and the instance of the latter contains Unicode characters
    • Python 2 also has two types that represent the sequence of characters, called STR and Unicode, respectively. Unlike Python 3, an instance of STR contains the original 8-bit value, whereas an instance of Unicode contains Unicode characters
 1, Python2 inside Str represents a normal string, and Unicode is a Unicode is said to say: When the type is not specified is a str, when specified as Unicode is the Unicode type
In [15]: name = u‘Zhang San ’
 
In [16]: type (name)
Out [16]: unicode




2, Python3 inside does not specify the string type when is a str. 3.Python3 inside the STR is Python2 inside the unicode,python2 inside the STR is Python3 inside the bytes!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
    • Open function
Python2 has a standard library codecs module that helps us automatically encode and decode the codecs module provides a encoding parameter for the Open function
In [55]: import codecs
 
In [56]: name = u‘Zhang San ’
 
In [57]: with open (‘/ tmp / test’, ‘w’, encoding = ‘utf-8’) as f:
     ...: f.write (name)
     ...:
 
In [58]: with open (‘/ tmp / test’, ‘r’, encoding = ‘utf-8’) as f:
     ...: new_name = f.read ()
     ...:
 
In [59]: new_name
Out [59]: u ‘\ u5f20 \ u4e09’
Python3




The Open function itself provides the encoding parameter, which we can specify by encoding, in the same way as the Python2 codecs module,
>>> name = ‘Zhang San’
>>> name
‘Zhang San’
>>> with open (‘/ tmp / test’, ‘w’, encoding = ‘utf-8’) as f:
... f.write (name)
... 







#Summary!!!!!!!!!!!!!!!!!!! There are many ways to represent Unicode characters as binary data, and the most common way to encode them is utf-8.!!!!!!!!!!!!! Python 3 's str and Python 2 Unicode are not associated with a specific binary encoding. If you want to convert Unicode characters to binary data, you must use the Encode method, if you want to convert binary data to Unicode characters, you must use decode when programming, it is necessary to put the encoding and decoding operations on the outermost interface to do,     The core part of the program should use Unicode character types, rather than making any assumptions about character encodings. Python3




#In Python3, we need to write methods that accept str or bytes and always return str:
def to_str (bytes_or_str):
   if isinstance (bytes_or_str, bytes):
     value = bytes_or_str.decode (‘utf-8’)
   else:
     value = bytes_or_str
   return value # Instance of str


#Also, you need to write a method that accepts str or bytes and always returns bytes:
def to_bytes (bytes_or_str):
   if isinstance (bytes_or_str, str):
     value = bytes_or_str.encode (‘utf-8)
   else:
     value = bytes_or_str
   return value # Instance of bytes 




Python2
#In Python2, you need to write a method that accepts str or unicode and always returns unicode:
# python2
def to_unicode (unicode_or_str):
   if isinstance (unicode_or_str, str):
     value = unicode_or_str.decode (‘utf-8’)
   else:
     value = unicode_or_str
   return value # Instance of unicode


#In addition, you need to write a method that accepts str or unicode and always returns str:
# Python2
def to_str (unicode_or_str):
   if isinstance (unicode_or_str, unicode):
     value = unicode_or_str.encode (‘utf-8’)
   else:
     value = unicode_or_str
   reutrn vlaue # Instance of str 




The following part of the article is excerpted from the effective Python: 59 effective ways to write high-quality Python code 3rd: Understanding Bytes, str, and Unicode differences


Character set encoding and Python (ii) Unicode and Utf-8


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.