Python encoding and Unicode

Source: Internet
Author: User
I'm sure there are a lot of explanations for Unicode and python, but I'm going to write something about them to make it easier for me to understand.





Byte stream vs Unicode Object



Let's start by defining a string in Python. When you use the string type, you actually store a byte string.

[a] [b] [c] = "abc"
[97] [98] [99] = "abc"
In this example, the string abc is a byte string. 97., 98, and 99 are ASCII codes. The definition in Python 2.x is to treat all strings as ASCII. Unfortunately, ASCII is the least common standard in the Latin character set.

ASCII uses the first 127 digits for character mapping. Character mappings like windows-1252 and UTF-8 have the same first 127 characters. It's safe to use a mixed string encoding when the value of each byte in your string is less than 127. However, making this assumption is dangerous and will be mentioned below.

问题 When your string has a byte value greater than 126, problems will pop up. Let's look at a string encoded with windows-1252. The character map in Windows-1252 is an 8-bit character map, so there will be 256 characters in total. The first 127 are the same as ASCII, and the next 127 are other characters defined by windows-1252.

A windows-1252 encoded string looks like this:
[97] [98] [99] [150] = "abc–"
Windows-1252 is still a byte string, but have you seen that the value of the last byte is greater than 126. If Python tries to decode this byte stream using the default ASCII standard, it will report an error. Let's see what happens when Python decodes this string:

>>> x = "abc" + chr (150)
>>> print repr (x)
'abc \ x96'
>>> u "Hello" + x
Traceback (most recent call last):
  File "<stdin>", line 1, in?
UnicodeDecodeError: 'ASCII' codec can't decode byte 0x96 in position 3: ordinal not in range (128)
Let's use UTF-8 to encode another string:

A UTF-8 encoded string looks like this:
[97] [98] [99] [226] [128] [147] = "abc–"
[0x61] [0x62] [0x63] [0xe2] [0x80] [0x93] = "abc-"
If you pick up the Unicode encoding table that you are familiar with, you will find that the corresponding Unicode encoding point of the English dash is 8211 (0 × 2013). This value is greater than the ASCII maximum of 127. A value larger than one byte can store. Because 8211 (0 × 2013) is two bytes, UTF-8 must use some tricks to tell the system to store three characters in one byte. Let's look at when Python is going to use the default ASCII to encode a UTF-8 encoded string with a character value greater than 126.

>>> x = "abc \ xe2 \ x80 \ x93"
>>> print repr (x)
'abc \ xe2 \ x80 \ x93'
>>> u "Hello" + x
Traceback (most recent call last):
  File "<stdin>", line 1, in?
UnicodeDecodeError: 'ASCII' codec can't decode byte 0xe2 in position 3: ordinal not in range (128)
As you can see, Python has always used ASCII encoding by default. When it processed the fourth character, Python threw an error because its value was 226 greater than 126. This is the problem with mixed coding.

Decode byte stream
解码 Decoding the term may be confusing when you first learn Python Unicode. You can decode a byte stream into a Unicode object and encode a Unicode object into a byte stream.

Python needs to know how to decode a byte stream into a Unicode object. When you get a byte stream, you call its "decode method" to create a Unicode object from it.

最好 You better decode the byte stream to Unicode as soon as possible.

>>> x = "abc \ xe2 \ x80 \ x93"
>>> x = x.decode ("utf-8")
>>> print type (x)
<type 'unicode'>
>>> y = "abc" + chr (150)
>>> y = y.decode ("windows-1252")
>>> print type (y)
>>> print x + y
abc--abc--
Encoding Unicode into a byte stream
Unicode object is a representative of text encoding agnosticism. You cannot simply output a Unicode object. It must be turned into a byte string before output. Python would be a good fit for this kind of work. Although Python defaults to ASCII when encoding Unicode into a byte stream, this default behavior can cause many headaches.

>>> u = u "abc \ u2013"
>>> print u
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u '\ u2013' in position 3: ordinal not in range (128)
>>> print u.encode ("utf-8")
abc--
Using codecs module
Codecs module can help a lot when processing byte streams. You can open the file with a defined encoding and what you read from the file is automatically converted to a Unicode object.

  Try this:

>>> import codecs
>>> fh = codecs.open ("/ tmp / utf-8.txt", "w", "utf-8")
>>> fh.write (u "\ u2013")
>>> fh.close ()
的 What it does is get a Unicode object and write it to a file with UTF-8 encoding. You can use it in other situations as well.

  Try this:

读取 When reading data from a file, codecs.open will create a file object that can automatically convert UTF-8 encoded files into a Unicode object.

Let's follow the example above, this time using a urllib stream.

>>> stream = urllib.urlopen ("http://www.google.com")
>>> Reader = codecs.getreader ("utf-8")
>>> fh = Reader (stream)
>>> type (fh.read (1))
<type 'unicode'>
>>> Reader
<class encodings.utf_8.StreamReader at 0xa6f890>
Single line version:

>>> fh = codecs.getreader ("utf-8") (urllib.urlopen ("http://www.google.com"))
>>> type (fh.read (1))
You must be very careful with the codecs module. What you pass in must be a Unicode object, otherwise it will automatically decode the byte stream as ASCII.

>>> x = "abc \ xe2 \ x80 \ x93" # our "abc-" utf-8 string
>>> fh = codecs.open ("/ tmp / foo.txt", "w", "utf-8")
>>> fh.write (x)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.5/codecs.py", line 638, in write
  return self.writer.write (data)
File "/usr/lib/python2.5/codecs.py", line 303, in write
  data, consumed = self.encode (object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 3: ordinal not in range (128)
Alas, I went and Python started to decode everything using ASCII again.

The problem of slicing UTF-8 byte stream
Because a UTF-8 encoded string is a list of bytes, len () and slice operations do not work properly. First use the string we used earlier.

[97] [98] [99] [226] [128] [147] = "abc–"
Next do the following:

>>> my_utf8 = "abc–"
>>> print len (my_utf8)
6
  What? It looks like 4 characters, but the result of len says 6. Because len counts bytes instead of characters.

>>> print repr (my_utf8)
'abc \ xe2 \ x80 \ x93'
Now let's slice this string.

>>> my_utf8 [-1] # Get the last char
'\ x93'
I go, the segmentation result is the last byte, not the last character.

In order to properly segment UTF-8, you'd better decode the byte stream to create a Unicode object. Then you can safely operate and count.

>>> my_unicode = my_utf8.decode ("utf-8")
>>> print repr (my_unicode)
u'abc \ u2013 '
>>> print len (my_unicode)
4
>>> print my_unicode [-1]

When Python encodes / decodes automatically
一些 In some cases, Python will throw an error when it uses ASCII for encoding / decoding automatically.

The first case was when it tried to merge Unicode and byte strings together.

>>> u "" + u "\ u2019" .encode ("utf-8")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range (128)
同样 The same thing happens when you merge lists. Python automatically decodes byte strings into Unicode when there are strings and Unicode objects in the list.

>>> ",". join ([u "This string \ u2019s unicode", u "This string \ u2019s utf-8" .encode ("utf-8")])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 11: ordinal not in range (128)
Or when trying to format a byte string:

>>> "% s \ n% s"% (u "This string \ u2019s unicode", u "This string \ u2019s utf-8" .encode ("utf-8"),)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'as cii 'codec can't decode byte 0xe2 in position 11: ordinal not in range (128)
Basically, when you mix Unicode and byte strings together, it will cause errors.

In this example, you create a UTF-8 file, and then add some Unicode object text to it. Will report UnicodeDecodeError.

>>> buffer = []
>>> fh = open ("utf-8-sample.txt")
>>> buffer.append (fh.read ())
>>> fh.close ()
>>> buffer.append (u "This string \ u2019s unicode")
>>> print repr (buffer)
['This file \ xe2 \ x80 \ x99s got utf-8 in it \ n', u'This string \ u2019s unicode ']
>>> print "\ n" .join (buffer)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 9: ordinal not in range (128)
You can use the codecs module to load the file as Unicode to solve this problem.

>>> import codecs
>>> buffer = []
>>> fh = open ("utf-8-sample.txt", "r", "utf-8")
>>> buffer.append (fh.read ())
>>> fh.close ()
>>> print repr (buffer)
[u'This file \ u2019s got utf-8 in it \ n ', u'This string \ u2019s unicode']
>>> buffer.append (u "This string \ u2019s unicode")
>>> print "\ n" .join (buffer)
This file ’s got utf-8 in it

This string ’s unicode
As you can see, the stream created by codecs.open automatically converts the bit string to Unicode when the data is read.

  Best Practices
  1. Decode first, encode last

  2. UTF-8 encoding is used by default

3. Use codecs and Unicode objects to simplify processing

First decoding means that whenever there is a byte stream input, the input needs to be decoded to Unicode as early as possible. This prevents problems with len () and splitting UTF-8 byte streams.

Final encoding means that encoding is performed only when it is ready to enter. This output might be a file, a database, a socket, etc. Only encode unicode objects after processing is complete. The last encoding also means don't let Python encode Unicode objects for you. Python will use ASCII encoding and your program will crash.

UTF-8 encoding by default means: Because UTF-8 can handle any Unicode character, you better use it instead of windows-1252 and ASCII.

Codecs module allows us to step a little less when dealing with streams such as files or sockets. Without this tool provided by codecs, you must read the file contents as a byte stream and then decode the byte stream into a Unicode object.

Codecs module allows you to quickly convert byte streams into Unicode objects, saving a lot of trouble.

Interpret UTF-8
The last part is to get you started with UTF-8, if you are a super geek you can ignore this section.

Using UTF-8, any byte between 127 and 255 is special. These bytes tell the system that these bytes are part of a multibyte sequence.

Our UTF-8 encoded string looks like this:
[97] [98] [99] [226] [128] [147] = "abc–"
The last 3 bytes are a UTF-8 multibyte sequence. If you convert the first of these three bytes to binary, you can see the following result:

11100010
The first 3 bits tell the system that it started a 3-byte sequence 226, 128, 147.

So complete byte sequence.

11100010 10000000 10010011
Then you apply the mask below the three-byte sequence.

1110xxxx 10xxxxxx 10xxxxxx
XXXX0010 XX000000 XX010011 Remove the X's
0010 000000 010011 Collapse the numbers
00100000 00010011 Get Unicode number 0x2013, 8211 The "–"
This is a basic introduction to UTF-8. If you want to know more details, you can check the UTF-8 wiki page.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.