Understanding Python encoding and Unicode

Source: Internet
Author: User
I'm sure there are a lot of explanations for Unicode and python, but I'm going to write something about them to make it easier for me to understand.





Byte stream vs Unicode Object



Let's start by defining a string in Python. When you use the string type, you actually store a byte string.


[  a ][  b ][  c ] = "abc"
[ 97 ][ 98 ][ 99 ] = "abc"


In this case, the ABC string is a byte-string. 97.,98,,99 is an ASCII code. The definition in Python 2.x is to treat all strings as ASCII. Unfortunately, ASCII is the least common standard in Latin-style character sets.



ASCII uses the first 127 digits to make a character map. Character mappings such as windows-1252 and UTF-8 have the same first 127 characters. It is safe to mix string encodings in your string with a value below 127 per byte. However, making this assumption is a very dangerous thing, and will be mentioned below.



There is a problem when the byte value in your string is greater than 126. Let's take a look at a string encoded with windows-1252. The character Map in Windows-1252 is a 8-bit character Map, so there will be a total of 256 characters. The first 127 are the same as ASCII, and the next 127 are other characters defined by windows-1252.


A windows-1252 encoded string looks like this:
[ 97 ] [ 98 ] [ 99 ] [ 150 ] = "abc–"


Windows-1252 is still a byte string, but you have not seen the last byte value is greater than 126. If Python tries to decode the byte stream with the default ASCII standard, it will give an error. Let's see what happens when Python decodes this string:


>>> x = "abc" + chr(150)
>>> print repr(x)
'abc\x96'
>>> u"Hello" + x
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeDecodeError: 'ASCII' codec can't decode byte 0x96 in position 3: ordinal not in range(128)


Let's use UTF-8 to encode another string:


A UTF-8 encoded string looks like this:
[ 97 ] [ 98 ] [ 99 ] [ 226 ] [ 128 ] [ 147 ] = "abc–"
[0x61] [0x62] [0x63] [0xe2]  [ 0x80] [ 0x93] = "abc-"


If you pick up the Unicode encoding table you are familiar with, you will find that the Unicode encoding point for the English dash corresponds to 8211 (0x2013). This value is greater than the ASCII maximum value of 127. A value that is greater than one byte can store. Because 8211 (0x2013) is two bytes, UTF-8 must use some tricks to tell the system that storing a character requires three bytes. Let's see if Python is going to use the default ASCII to encode a UTF-8 encoded string with a character value greater than 126.


>>> x = "abc\xe2\x80\x93"
>>> print repr(x)
'abc\xe2\x80\x93'
>>> u"Hello" + x
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeDecodeError: 'ASCII' codec can't decode byte 0xe2 in position 3: ordinal not in range(128)


As you can see, Python has always been using ASCII encoding by default. When it handles the 4th character, Python throws an error because it has a value of 226 greater than 126. This is the problem of mixed coding.



Decode Byte stream



Decoding this term can be confusing when you start learning Python Unicode. You can decode a byte stream into a Unicode object and encode a Unicode object as a byte stream.



Python needs to know how to decode a byte stream into a Unicode object. When you get a byte stream, you call it the "decoding method to create a Unicode object from it."



You'd better decode the byte stream to Unicode as soon as possible.


>>> x = "abc\xe2\x80\x93"
>>> x = x.decode("utf-8")
>>> print type(x)
<type 'unicode'>
>>> y = "abc" + chr(150)
>>> y = y.decode("windows-1252")
>>> print type(y)
>>> print x + y
abc–abc–


Encode Unicode as a byte stream



A Unicode object is a representation of the encoded agnosticism of a literal. You cannot simply output a Unicode object. It must be turned into a byte string before the output. Python is a great fit to do this, and while Python encodes Unicode as a byte stream by default when ASCII is used, this default behavior can be a cause for many headaches.


>>> u = u"abc\u2013"
>>> print u
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position 3: ordinal not in range(128)
>>> print u.encode("utf-8")
abc–


Using the Codecs module



The codecs module is a great help when it comes to processing byte streams. You can use the defined encoding to open the file and the content you read from the file is automatically converted to a Unicode object.



Try this:


>>> import codecs
>>> fh = codecs.open("/tmp/utf-8.txt", "w", "utf-8")
>>> fh.write(u"\u2013")
>>> fh.close()


All it does is get a Unicode object and write it to the file in Utf-8 encoding. You can use it in other situations as well.



Try this:



When reading data from a file, Codecs.open creates a file object that automatically converts the Utf-8 encoded file into a Unicode object.



Let's proceed to the above example, this time using the Urllib stream.


>>> stream = urllib.urlopen("http://www.google.com")
>>> Reader = codecs.getreader("utf-8")
>>> fh = Reader(stream)
>>> type(fh.read(1))
<type 'unicode'>
>>> Reader
<class encodings.utf_8.StreamReader at 0xa6f890>


Single-line version:


>>> fh = codecs.getreader("utf-8")(urllib.urlopen("http://www.google.com"))
>>> type(fh.read(1))


You must be very careful with the codecs module. The thing you pass in must be a Unicode object, or it will automatically decode the byte stream as ASCII.


>>> x = "abc\xe2\x80\x93" # our "abc-" utf-8 string
>>> fh = codecs.open("/tmp/foo.txt", "w", "utf-8")
>>> fh.write(x)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.5/codecs.py", line 638, in write
  return self.writer.write(data)
File "/usr/lib/python2.5/codecs.py", line 303, in write
  data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 3: ordinal not in range(128)


Hey, I'm going, and Python is starting to decode everything with ASCII.



The issue of UTF-8 byte stream slicing



Because a UTF-8 encoded string is a list of bytes, Len () and slice operations do not work correctly. First use the string we used earlier.


[98] [[226] [] [] [147] = "abc–"


Next do the following:


>>> my_utf8 = "abc–"
>>> print len(my_utf8)
6


God horse? It appears to be 4 characters, but Len's result says 6. Because Len calculates the number of bytes rather than the number of characters.


>>> print repr(my_utf8)
'abc\xe2\x80\x93'


Now let's slice this string.


>>> My_utf8[-1] # Get The last char ' \x93 '


I'm going. The Shard result is the last byte, not the last character.



To properly slice the UTF-8, you'd better decode the byte stream to create a Unicode object. Then it can be safely manipulated and counted.


>>> my_unicode = my_utf8.decode("utf-8")
>>> print repr(my_unicode)
u'abc\u2013'
>>> print len(my_unicode)
4
>>> print my_unicode[-1]
–


When Python automatically encodes/decodes



In some cases, it throws an error when Python automatically encodes/decodes using ASCII.



The first case is when it tries to merge Unicode and byte strings together.


>>> u"" + u"\u2019".encode("utf-8")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0:   ordinal not in range(128)


The same happens when you merge the list. Python automatically decodes the byte string to Unicode when it has string and Unicode objects in the list.


>>> ",".join([u"This string\u2019s unicode", u"This string\u2019s utf-8".encode("utf-8")])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 11:  ordinal not in range(128)


Or when trying to format a single byte string:


>>> "%s\n%s" % (u"This string\u2019s unicode", u"This string\u2019s  utf-8".encode("utf-8"),)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 11: ordinal not in range(128)


Basically, when you mix Unicode and byte strings together, it can cause an error.



In this example, you create a utf-8 file, and then add some Unicode object text to it. They will report unicodedecodeerror errors.


>>> buffer = []
>>> fh = open("utf-8-sample.txt")
>>> buffer.append(fh.read())
>>> fh.close()
>>> buffer.append(u"This string\u2019s unicode")
>>> print repr(buffer)
['This file\xe2\x80\x99s got utf-8 in it\n', u'This string\u2019s unicode']
>>> print "\n".join(buffer)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 9: ordinal not in range(128)


You can solve this problem by using the codecs module to load the file as Unicode.


>>> import codecs
>>> buffer = []
>>> fh = open("utf-8-sample.txt", "r", "utf-8")
>>> buffer.append(fh.read())
>>> fh.close()
>>> print repr(buffer)
[u'This file\u2019s got utf-8 in it\n', u'This string\u2019s unicode']
>>> buffer.append(u"This string\u2019s unicode")
>>> print "\n".join(buffer)
This file’s got utf-8 in it

This string’s unicode


As you can see, the stream created by Codecs.open automatically converts the bit string to Unicode when the data is read.



Best practices



1. First decoding, last encoding



2. Use UTF-8 encoding by default



3. Use codecs and Unicode objects to simplify processing



The first decoding means that whenever there is a byte stream input, the input needs to be decoded to Unicode as soon as possible. This prevents the problem of Len () and the utf-8 byte stream from being sliced.



The last encoding means that it is encoded only when it is ready for input. This output may be a file, a database, a socket, and so on. Unicode objects are encoded only after processing is complete. The last encoding also means don't let Python encode Unicode objects for you. Python will use ASCII encoding and your program will crash.



Using UTF-8 encoding by default means: Because UTF-8 can handle any Unicode character, you'd better use it instead of windows-1252 and ASCII.



The codecs module allows us to do less digging when dealing with streams such as files or sockets. Without this tool provided by codecs, you must read the contents of the file as a byte stream, and then decode the byte stream as a Unicode object.



The codecs module allows you to quickly stream bytes to a Unicode object, eliminating the hassle.



Explanation UTF-8



The final part is to get you started UTF-8, if you are a super geek can ignore this paragraph.



With UTF-8, any byte between 127 and 255 is special. These bytes tell the system that these bytes are part of a multi-byte sequence.


Our UTF-8 encoded string looks like this:[) [98] [[] [226] [+] [147] = "abc–"


The last 3 bytes are a UTF-8 multi-byte sequence. If you convert the first of these three bytes to 2, you can see the following results:


11100010


The first 3 bits tell the system that it started a 3-byte sequence of 226,128,147.



Then a complete sequence of bytes.


11100010 10000000 10010011


Then you apply the three-byte sequence to the following mask.


1110xxxx 10xxxxxx 10xxxxxx
XXXX0010 XX000000 XX010011 Remove the X's
0010       000000   010011 Collapse the numbers
00100000 00010011          Get Unicode number 0x2013, 8211 The "–"


This is the basic UTF-8, if you want to know more details, you can go to see the UTF-8 wiki page.



Original link: ERIC MORITZ translator: Bole Online-Base Holy OMG


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.