A detailed explanation of the use of Unicode encoding in python2.x

A detailed explanation of the use of Unicode encoding in python2.x _python

Last Update:2017-01-19 Source: Internet

Author: User

Tags stdin in python

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

I'm sure there's a lot of Unicode and python instructions, but I'm going to write something about them to make it easier for my understanding to work.

byte stream vs Unicode Object

Let's first define a string in Python. When you use the string type, a byte string is actually stored.

A [b] [c] = "ABC" [the "[]]
[[]] =" ABC "

In this case, ABC this string is a byte string. 97.,98,,99 is an ASCII code. One disadvantage of the Python 2.x version is that all strings are treated as ASCII by default. Unfortunately, ASCII is the least common standard in the Latin character set.

ASCII is a character map with the first 127 digits. Character mappings such as windows-1252 and UTF-8 have the same first 127 characters. It is safe to mix string encoding when the value of each byte in your string is less than 127. However, this assumption is a very dangerous thing, as will be mentioned below.

The problem occurs when you have a byte value greater than 126 in your string. Let's look at a string encoded with windows-1252. The character Map in Windows-1252 is a 8-bit character Map, so there will be 256 characters in total. The first 127 are the same as ASCII, and the next 127 are other characters defined by windows-1252.

A windows-1252 encoded string looks like this: [To be] [to] [to] [A] [A] [+]
= "abc–"

Windows-1252 is still a byte string, but did you see that the last byte value is greater than 126. If Python tries to decode the byte stream with the default ASCII standard, it will give an error. Let's see what happens when Python decodes this string:

>>> x = "abc" + CHR ($)
>>> print repr (x)
' abc\x96 '
>>> u "Hello" + x
Traceback (most recent):
 File "<stdin>", line 1, in?
Unicodedecodeerror: ' ASCII ' codec can ' t decode byte 0x96 in position 3:ordinal not in range (128)

Let's use UTF-8 to encode another string:

A UTF-8 encoded string looks like this: [[to] [+] [A] [a] [
128] [147] = "abc–"
[0x61] [0x62] [0x63 ] [0xe2] [0x80] [0x93] = "abc-"

If you pick up your familiar Unicode coding table, you will find that the English dash corresponds to the Unicode encoding point for 8211 (0x2013). This value is greater than the maximum ASCII value of 127. A value greater than one byte can store. Because 8211 (0x2013) is two bytes, UTF-8 must use some tricks to tell the system that storing a character requires three bytes. Let's see if Python is going to use the default ASCII to encode a UTF-8 encoded string with a character value greater than 126.

>>> x = "abc\xe2\x80\x93"
>>> print repr (x)
' abc\xe2\x80\x93 '
>>> u "Hello" + x< C4/>traceback (most recent):
 File "<stdin>", line 1, in?
Unicodedecodeerror: ' ASCII ' codec can ' t decode byte 0xe2 in position 3:ordinal not in range (128)

As you can see, Python has been using ASCII encoding by default. When it handles the 4th character, because its value is 226 greater than 126, Python throws an error. This is the problem with mixed coding.

decoding byte stream

Decoding the term may be confusing when you start to learn Python Unicode. You can decode a byte stream into a Unicode object and encode a Unicode object as a byte stream.

Python needs to know how to decode a byte stream into a Unicode object. When you get a byte stream, you call it the "decoding method to create a Unicode object from it."

You'd better be decoding the byte stream to Unicode as soon as possible.

>>> x = "abc\xe2\x80\x93"
>>> x = X.decode ("Utf-8")
>>> Print type (x)
<type ' Unicode ' >
>>> y = "abc" + CHR (+)
>>> y = Y.decode ("windows-1252")
>>> Print Type (y)
>>> print x + y
abc–abc–

encode Unicode as a byte stream

A Unicode object is a text encoded by an agnostic representative. You can't simply output a Unicode object. It must be turned into a byte string before the output. Python is a good fit for this kind of work, although Python is the default for ASCII when it encodes Unicode as a byte stream, the default behavior can be a cause of many headaches.

>>> u = u "abc\u2013"
>>> print u
traceback (most recent call last):
 File "<stdin> ", line 1, in <module>
unicodeencodeerror: ' ASCII ' codec can ' t encode character U ' \u2013 ' in position 3:ordina L not in range (128)
>>> print U.encode ("Utf-8")
abc–

Using the Codecs module

The codecs module can provide a lot of help when dealing with word throttling. You can open the file with a defined encoding and the content you read from the file is automatically converted to a Unicode object.

Try this:

>>> Import Codecs
>>> fh = Codecs.open ("/tmp/utf-8.txt", "W", "Utf-8")
>>> Fh.write (U "\u2013")
>>> Fh.close ()

All it does is get a Unicode object and write it to a file in Utf-8 encoding. You can use it in other situations as well.

Try this:

When reading data from a file, Codecs.open creates a file object that automatically converts the Utf-8 encoded file into a Unicode object.

We go on to the example above, this time using the Urllib stream.

>>> stream = Urllib.urlopen ("http://www.google.com")
>>> Reader = Codecs.getreader ("Utf-8")
>>> fh = Reader (stream)
>>> type (fh.read (1))
<type ' Unicode ' >
>> > Reader
<class encodings.utf_8.StreamReader at 0xa6f890>

Single-line version:

>>> fh = Codecs.getreader ("Utf-8") (Urllib.urlopen ("http://www.google.com"))
>>> type (fh.read (1))

You must be very careful with the codecs module. What you're going to get in there must be a Unicode object, otherwise it will automatically decode the byte stream as ASCII.

>>> x = "abc\xe2\x80\x93" # our "abc-" Utf-8 string
>>> fh = Codecs.open ("/tmp/foo.txt", "W", "Utf-8 ")
>>> fh.write (x)
Traceback (most recent call last):
File ' <stdin> ', line 1, in <module& gt;
File "/usr/lib/python2.5/codecs.py", line 638, in write return
 self.writer.write (data)
File/usr/lib/ python2.5/codecs.py ", line 303, in write
 data, consumed = Self.encode (object, self.errors)
Unicodedecodeerror: ' ASCII ' codec can ' t decode byte 0xe2 in position 3:ordinal not in range (128)

Hey, I'm going, and Python is starting to decode everything in ASCII.
The problem of UTF-8 the byte stream slices

Because a UTF-8 encoding string is a byte list, Len () and the slice operation do not work correctly. First use the string we used before.

[128] [[to] [] [A] [a] [147] = "abc–"

Next, do the following:

>>> My_utf8 = "abc–"
>>> print len (My_utf8)
6

God horse? It looks like 4 characters, but Len's results say 6. Because Len calculates the number of bytes rather than the number of characters.

>>> print repr (My_utf8)
' abc\xe2\x80\x93 '

Now we're going to split this string.

>>> My_utf8[-1] # get the last char
' \x93 '

I'll go, shard. The result is the last byte, not the last character.

For the right segmentation UTF-8, you'd better decode the byte stream to create a Unicode object. Then you can safely operate and count.

>>> My_unicode = My_utf8.decode ("Utf-8")
>>> print repr (my_unicode)
u ' abc\u2013 '
> >> print len (my_unicode)
4
>>> print My_unicode[-1]
–

when Python automatically encodes/decodes

In some cases, when Python automatically encodes/decodes using ASCII, it throws an error.

The first case is when it tries to merge Unicode and byte strings together.

>>> u "" + U "\u2019". Encode ("Utf-8")
Traceback (most recent call last):
 File "<stdin>", line 1, I n <module>
unicodedecodeerror: ' ASCII ' codec can ' t decode byte 0xe2 in position 0:  ordinal No in range (12 8)

The same thing happens when you merge a list. Python automatically decodes the byte string to Unicode when it has string and Unicode objects in the list.

>>> ",". Join ([u "this string\u2019s Unicode", u "this string\u2019s utf-8". Encode ("Utf-8")])
Traceback ( Most recent called last):
 File "<stdin>", line 1, in <module>
unicodedecodeerror: ' ASCII ' codec can ' t de Code byte 0xe2 in position 11:ordinal No in range (128)

Or when you try to format a byte string:

>>> "%s\n%s"% (U "this string\u2019s Unicode" and U "this string\u2019s utf-8". Encode ("Utf-8")
Traceback ( Most recent called last):
 File "<stdin>", line 1, in <module>
unicodedecodeerror: ' ASCII ' codec can ' t de Code byte 0xe2 in position 11:ordinal No in range (128)

Basically, when you mix Unicode and byte strings together, you cause an error.

In this example, you create a utf-8 file, and then add the text of some Unicode objects to it. will report unicodedecodeerror errors.

>>> buffer = []
>>> fh = open ("Utf-8-sample.txt")
>>> Buffer.append (Fh.read ())
>>> fh.close ()
>>> buffer.append (u "this string\u2019s Unicode")
>>> Print Repr (buffer)
[' This file\xe2\x80\x99s got utf-8 in it\n ', U ' this string\u2019s Unicode ']
>>> print ' \ n ' . Join (buffer)
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
Unicodedecodeerror: ' ASCII ' codec can ' t decode byte 0xe2 in position 9:ordinal not in range (128)

You can use the codecs module to load files as Unicode to solve this problem.

>>> Import codecs
>>> buffer = []
>>> fh = open ("Utf-8-sample.txt", "R", "Utf-8")
>>> Buffer.append (Fh.read ())
>>> fh.close ()
>>> print repr (buffer)
[ U ' This file\u2019s got utf-8 in it\n ', U ' this string\u2019s Unicode ']
>>> buffer.append (u) This string\ u2019s Unicode ")
>>> print" \ n ". Join (buffer) This
file, got utf-8 in it, this
 
string ' s Unicode

As you can see, the stream created by Codecs.open automatically converts the bit string to Unicode when the data is read.

Best Practices

1. First decode, last encoded

2. Use UTF-8 encoding by default

3. Use codecs and Unicode objects to simplify processing

The first decoding means that whenever there is a byte stream input, the input needs to be decoded to Unicode as soon as possible. This prevents problems with Len () and the Shard Utf-8 byte stream.

The final encoding means that you encode the text as a byte stream only if you intend to output it somewhere. This output may be a file, a database, a socket, and so on. Unicode objects are encoded only after processing is complete. The final coding also means don't let Python encode Unicode objects for you. Python will use ASCII encoding, and your program will crash.

The default use of UTF-8 encoding means that because UTF-8 can handle any Unicode character, you'd better use it instead of windows-1252 and ASCII.

The codecs module allows us to reduce the number of pits when dealing with streams such as files or sockets. Without this tool provided by codecs, you must read the contents of the file as a byte stream and then decode the byte stream into a Unicode object.

The codecs module allows you to quickly convert bytes into Unicode objects, eliminating a lot of hassle.

explain UTF-8

The final part is to let you have an introduction to UTF-8, if you are a super geek can ignore this paragraph.

Using UTF-8, any byte between 127 and 255 is special. These bytes tell the system that these bytes are part of a multi-byte sequence.

Our UTF-8 encoded string looks like this: [[to] [to] [[] [] [[] [
128] [147] = "abc–"

The last 3 bytes are a UTF-8 multi-byte sequence. If you convert the first of these three bytes into 2, you can see the following results:

11100010

The first 3 bits tell the system that it began with a 3-byte sequence of 226,128,147.

So the complete byte sequence.

11100010 10000000 10010011

Then you use the following mask for the three-byte sequence. (see here for details)

 
1110xxxx 10xxxxxx 10xxxxxx
XXXX0010 XX000000 XX010011 Remove The X ' s
0010    000000 010011-Collapse the  Numbers
00100000 00010011 get     Unicode number 0x2013, 8211 the "–"

Here is just the basics of getting started on UTF-8, and if you want to know more details, you can go to the UTF-8 wiki page.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More