This article describes how to use Unicode encoding in Python2.x. Unicode in Python3 is used as the default encoding, unicode in the Python2 version, which is still widely used, is a place to pay attention to during use. For more information, see Unicode and Python, however, I plan to write something about them to facilitate my understanding and use.
Byte stream vs Unicode object
Let's first define a string using Python. When you use the string type, a byte string is actually stored.
[ a ][ b ][ c ] = "abc"[ 97 ][ 98 ][ 99 ] = "abc"
In this example, the string abc is a byte string. 97., 98, and 99 are ASCII codes. One disadvantage of Python 2.x is that all strings are treated as ASCII by default. Unfortunately, ASCII is the least common standard in Latin character sets.
ASCII uses the first 127 digits for character ING. Character mappings like windows-1252 and UTF-8 have the same first 127 characters. When the value of each byte in your string is less than 127, it is a safe mixed string encoding. However, this assumption is very dangerous and will be mentioned below.
When the byte value in your string is greater than 126, the problem may occur. Let's look at a string encoded in windows-1252. In Windows-1252, the character ING is an 8-bit character ING, so there will be a total of 256 characters. The first 127 are the same as ASCII, and the next 127 are other characters defined by windows-1252.
A windows-1252 encoded string looks like this:[ 97 ] [ 98 ] [ 99 ] [ 150 ] = "abc–"
Windows-1252 is still a byte string, but do you see that the value of the last byte is greater than 126. If Python tries to use the default ASCII standard to decode the byte stream, it will report an error. Let's see what happens when Python decodes this string:
>>> x = "abc" + chr(150)>>> print repr(x)'abc\x96'>>> u"Hello" + xTraceback (most recent call last): File "
", line 1, in ?UnicodeDecodeError: 'ASCII' codec can't decode byte 0x96 in position 3: ordinal not in range(128)
Let's encode another string with UTF-8:
A UTF-8 encoded string looks like this:[ 97 ] [ 98 ] [ 99 ] [ 226 ] [ 128 ] [ 147 ] = "abc–"[0x61] [0x62] [0x63] [0xe2] [ 0x80] [ 0x93] = "abc-"
If you pick up the familiar Unicode encoding table, you will find that the Unicode encoding point corresponding to the English dash is 8211 (0 × 2013 ). This value is greater than the maximum ASCII value of 127. A value greater than the value that one byte can store. Because 8211 (0x2013) is two bytes, the UTF-8 must use some techniques to tell the system to store a character that requires three bytes. Let's look at it again when Python is going to use the default ASCII to encode a UTF-8 encoded string with a character value greater than 126.
>>> x = "abc\xe2\x80\x93">>> print repr(x)'abc\xe2\x80\x93'>>> u"Hello" + xTraceback (most recent call last): File "
", line 1, in ?UnicodeDecodeError: 'ASCII' codec can't decode byte 0xe2 in position 3: ordinal not in range(128)
As you can see, Python always uses ASCII encoding by default. When it processes 4th characters, Python throws an error because its value is 226 or greater than 126. This is a problem caused by mixed encoding.
Decodes byte streams
When learning Python Unicode at the beginning, decoding this term may be confusing. You can decode a byte stream into a Unicode object and encode a Unicode object as a byte stream.
Python needs to know how to decode byte streams into Unicode objects. When you get a byte stream, you call its "decoding method to create a Unicode object from it.
You 'd better decode the byte stream to Unicode as soon as possible.
>>> x = "abc\xe2\x80\x93">>> x = x.decode("utf-8")>>> print type(x)
>>> y = "abc" + chr(150)>>> y = y.decode("windows-1252")>>> print type(y)>>> print x + yabc–abc–
Encodes Unicode into a byte stream
Unicode objects are representative of text encoding. You cannot simply output a Unicode object. It must be converted into a byte string before output. Python is very suitable for doing this. although Python uses ASCII by default when it encodes Unicode into byte streams, this default behavior will become a cause of many headaches.
>>> u = u"abc\u2013">>> print uTraceback (most recent call last): File "
", line 1, in
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position 3: ordinal not in range(128)>>> print u.encode("utf-8")abc–
Use the codecs module
The codecs module can be of great help when processing byte streams. You can use the defined encoding to open a file and the content you read from the file will be automatically converted to a Unicode object.
Try this:
>>> import codecs>>> fh = codecs.open("/tmp/utf-8.txt", "w", "utf-8")>>> fh.write(u"\u2013")>>> fh.close()
What it does is get a Unicode object and write it to the file in UTF-8 encoding. You can also use it in other cases.
Try this:
When reading data from a file, codecs. open creates a file object to automatically convert a UTF-8 encoded file into a Unicode object.
Next, we use the urllib stream as an example above.
>>> stream = urllib.urlopen("http://www.google.com")>>> Reader = codecs.getreader("utf-8")>>> fh = Reader(stream)>>> type(fh.read(1))
>>> Reader
Single row version:
>>> fh = codecs.getreader("utf-8")(urllib.urlopen("http://www.google.com"))>>> type(fh.read(1))
You must be very careful with the codecs module. The object you pass in must be a Unicode object, otherwise it will automatically decode the byte stream as ASCII.
>>> x = "abc\xe2\x80\x93" # our "abc-" utf-8 string>>> fh = codecs.open("/tmp/foo.txt", "w", "utf-8")>>> fh.write(x)Traceback (most recent call last):File "
", line 1, in
File "/usr/lib/python2.5/codecs.py", line 638, in write return self.writer.write(data)File "/usr/lib/python2.5/codecs.py", line 303, in write data, consumed = self.encode(object, self.errors)UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 3: ordinal not in range(128)
Alas, I went, and Python began to use ASCII to decode everything.
Slice the byte stream of UTF-8
Because a UTF-8 encoded string is a byte list, len () and slice operations cannot work properly. Use the string we used before.
[ 97 ] [ 98 ] [ 99 ] [ 226 ] [ 128 ] [ 147 ] = "abc–"
Next we will do the following:
>>> my_utf8 = "abc–">>> print len(my_utf8)6
Shenma? It looks 4 characters long, but len's result is 6. Len calculates the number of bytes rather than the number of characters.
>>> print repr(my_utf8)'abc\xe2\x80\x93'
Now let's split this string.
>>> my_utf8[-1] # Get the last char'\x93'
The splitting result is the last byte, not the last character.
To properly split the UTF-8, you 'd better decode the byte stream to create a Unicode object. Then you can operate and count securely.
>>> my_unicode = my_utf8.decode("utf-8")>>> print repr(my_unicode)u'abc\u2013'>>> print len(my_unicode)4>>> print my_unicode[-1]–
When Python is automatically encoded/decoded
In some cases, an error is thrown when Python automatically uses ASCII for encoding/decoding.
The first case is when it tries to combine Unicode and byte strings.
>>> u"" + u"\u2019".encode("utf-8")Traceback (most recent call last): File "
", line 1, in
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)
The same situation occurs when the list is merged. When Python contains string and Unicode objects in the list, it automatically decodes the byte string to Unicode.
>>> ",".join([u"This string\u2019s unicode", u"This string\u2019s utf-8".encode("utf-8")])Traceback (most recent call last): File "
", line 1, in
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 11: ordinal not in range(128)
Or when you try to format a byte string:
>>> "%s\n%s" % (u"This string\u2019s unicode", u"This string\u2019s utf-8".encode("utf-8"),)Traceback (most recent call last): File "
", line 1, in
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 11: ordinal not in range(128)
Basically, errors will occur when you mix Unicode and byte strings.
In this example, you create a UTF-8 file and add some Unicode object text to it. The UnicodeDecodeError error is reported.
>>> buffer = []>>> fh = open("utf-8-sample.txt")>>> buffer.append(fh.read())>>> fh.close()>>> buffer.append(u"This string\u2019s unicode")>>> print repr(buffer)['This file\xe2\x80\x99s got utf-8 in it\n', u'This string\u2019s unicode']>>> print "\n".join(buffer)Traceback (most recent call last): File "
", line 1, in
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 9: ordinal not in range(128)
You can use the codecs module to load files as Unicode to solve this problem.
>>> import codecs>>> buffer = []>>> fh = open("utf-8-sample.txt", "r", "utf-8")>>> buffer.append(fh.read())>>> fh.close()>>> print repr(buffer)[u'This file\u2019s got utf-8 in it\n', u'This string\u2019s unicode']>>> buffer.append(u"This string\u2019s unicode")>>> print "\n".join(buffer)This file's got utf-8 in it This string's unicode
As you can see, the stream created by codecs. open automatically converts the bit string to Unicode when the data is read.
Best practices
1. first decoding and last encoding
2. UTF-8 encoding is used by default.
3. use codecs and Unicode objects for simplified processing
The first decoding means that the input must be decoded to Unicode as soon as possible, regardless of when there is a word throttling input. This prevents len () and UTF-8 byte stream splitting.
The final encoding means that only when you intend to output the text to a certain place will it be encoded as a byte stream. This output may be a file, a database, a socket, and so on. Unicode objects are encoded only after they are processed. The final encoding also means that Python should not be used to encode Unicode objects for you. Python will use ASCII encoding and your program will crash.
By default, UTF-8 encoding means that because the UTF-8 can handle any Unicode character, you 'd better use it to replace windows-1252 and ASCII.
The codecs module allows us to step on fewer pitfalls when processing streams such as files or sockets. Without the tool provided by codecs, you must read the file content as a byte stream and then decode the byte stream as a Unicode object.
The codecs module allows you to quickly convert bytes into Unicode objects, saving you a lot of trouble.
Explain UTF-8
The last part is to give you an entry-level understanding of UTF-8, and if you're a super geeks, you can ignore this section.
Using the UTF-8, any byte between 127 and 255 is special. These bytes tell the system that these bytes are part of the multibyte sequence.
Our UTF-8 encoded string looks like this:[ 97 ] [ 98 ] [ 99 ] [ 226 ] [ 128 ] [ 147 ] = "abc–"
The last 3 bytes is a multi-byte sequence of UTF-8. If you convert the first of these three bytes into a binary format, you can see the following results:
11100010
The first three bits tell the system that a 3-byte sequence of 226,128,147 is started.
The complete byte sequence.
11100010 10000000 10010011
Then you apply the following mask to the three-byte sequence. (For details, see here)
1110xxxx 10xxxxxx 10xxxxxxXXXX0010 XX000000 XX010011 Remove the X's0010 000000 010011 Collapse the numbers00100000 00010011 Get Unicode number 0x2013, 8211 The "–"
Here is just some basic knowledge about the basics of UTF-8, if you want to know more details, you can go to the UTF-8 wiki page.