The use of Unicode encoding in the detailed python2.x

Source: Internet
Author: User
Tags chr
I'm sure there are a lot of explanations for Unicode and python, but I'm going to write something about them to make it easier for me to understand.


Byte stream vs Unicode Object

Let's start by defining a string in Python. When you use the string type, you actually store a byte string.

A [b] [c] = "abc" [[] [98] [[]] = "ABC"

In this case, the ABC string is a byte-string. 97.,98,,99 is an ASCII code. One disadvantage of the Python 2.x version is that all strings are treated as ASCII by default. Unfortunately, ASCII is the least common standard in Latin-style character sets.

ASCII uses the first 127 digits to make a character map. Character mappings such as windows-1252 and UTF-8 have the same first 127 characters. It is safe to mix string encodings in your string with a value below 127 per byte. However, making this assumption is a very dangerous thing, and will be mentioned below.

The problem occurs when you have a byte value greater than 126 in your string. Let's take a look at a string encoded with windows-1252. The character Map in Windows-1252 is a 8-bit character Map, so there will be a total of 256 characters. The first 127 are the same as ASCII, and the next 127 are other characters defined by windows-1252.

A windows-1252 encoded string looks like this:[) [98] [[] [] [[]] = "abc–"

Windows-1252 is still a byte string, but you have not seen the last byte value is greater than 126. If Python tries to decode the byte stream with the default ASCII standard, it will give an error. Let's see what happens when Python decodes this string:

>>> x = "abc" + CHR >>> print repr (x) ' abc\x96 ' >>> u "Hello" + xtraceback (most recent call L AST): File "
 
  
   
  ", line 1, in? Unicodedecodeerror: ' ASCII ' codec can ' t decode byte 0x96 in position 3:ordinal not in range (+)
 
  

Let's use UTF-8 to encode another string:

A UTF-8 encoded string looks like this:[) [98] [[226] [[147] [] [abc–] [0x61] [0x62] [0x63] [0xe2] [ 0X80] [0x93] = "abc-"

If you pick up the Unicode encoding table you are familiar with, you will find that the Unicode encoding point for the English dash corresponds to 8211 (0x2013). This value is greater than the ASCII maximum value of 127. A value that is greater than one byte can store. Because 8211 (0x2013) is two bytes, UTF-8 must use some tricks to tell the system that storing a character requires three bytes. Let's see if Python is going to use the default ASCII to encode a UTF-8 encoded string with a character value greater than 126.

>>> x = "abc\xe2\x80\x93" >>> print repr (x) ' abc\xe2\x80\x93 ' >>> u "Hello" + xtraceback (most Recent [last]: File "
 
  
   
  ", line 1, in? Unicodedecodeerror: ' ASCII ' codec can ' t decode byte 0xe2 in position 3:ordinal not in range (+)
 
  

As you can see, Python has always been using ASCII encoding by default. When it handles the 4th character, Python throws an error because it has a value of 226 greater than 126. This is the problem of mixed coding.


Decode Byte stream

Decoding this term can be confusing when you start learning Python Unicode. You can decode a byte stream into a Unicode object and encode a Unicode object as a byte stream.

Python needs to know how to decode a byte stream into a Unicode object. When you get a byte stream, you call it the "decoding method to create a Unicode object from it."

You'd better decode the byte stream to Unicode as soon as possible.

>>> x = "abc\xe2\x80\x93" >>> x = X.decode ("Utf-8") >>> print type (x)
 
  
   
  >> > y = "abc" + CHR >>> y = y.decode ("windows-1252") >>> print type (y) >>> print x + yabc–ab C –
 
  


Encode Unicode as a byte stream

A Unicode object is a representation of the encoded agnosticism of a literal. You cannot simply output a Unicode object. It must be turned into a byte string before the output. Python is a great fit to do this, and while Python encodes Unicode as a byte stream by default when ASCII is used, this default behavior can be a cause for many headaches.

>>> u = u "abc\u2013" >>> print utraceback (most recent call last): File "
 
  
   
  ", line 1, in 
  
   
    
   unicodeencodeerror: ' ASCII ' codec can ' t encode character U ' \u2013 ' in position 3:ordinal not in range (+) >& gt;> print U.encode ("Utf-8") abc–
  
   
 
  


Using the Codecs module

The codecs module is a great help when it comes to processing byte streams. You can use the defined encoding to open the file and the content you read from the file is automatically converted to a Unicode object.

Try this:

>>> import codecs>>> fh = Codecs.open ("/tmp/utf-8.txt", "W", "Utf-8") >>> fh.write (U "\u2013" ) >>> Fh.close ()

All it does is get a Unicode object and write it to the file in Utf-8 encoding. You can use it in other situations as well.

Try this:

When reading data from a file, Codecs.open creates a file object that automatically converts the Utf-8 encoded file into a Unicode object.

Let's proceed to the above example, this time using the Urllib stream.

>>> stream = Urllib.urlopen ("http://www.google.com") >>> Reader = Codecs.getreader ("Utf-8") >> > FH = Reader (stream) >>> type (Fh.read (1))
 
  
   
  >>> reader
  
   
 
  

Single-line version:

>>> fh = Codecs.getreader ("Utf-8") (Urllib.urlopen ("http://www.google.com")) >>> type (Fh.read (1))

You must be very careful with the codecs module. The thing you pass in must be a Unicode object, or it will automatically decode the byte stream as ASCII.

>>> x = "abc\xe2\x80\x93" # our "abc-" utf-8 string>>> fh = Codecs.open ("/tmp/foo.txt", "W", "Utf-8") ;>> fh.write (x) Traceback (most recent call last): File "
 
  
   
  ", line 1, in 
  
   
    
   file "/usr/lib/ python2.5/codecs.py ", line 638, in write return Self.writer.write (data) File"/usr/lib/python2.5/codecs.py ", line 303, in Write data, consumed = Self.encode (object, self.errors) Unicodedecodeerror: ' ASCII ' codec can ' t decode byte 0xe2 in Positio n 3:ordinal not in range (+)
  
   
 
  

Hey, I'm going, and Python is starting to decode everything with ASCII.
The issue of UTF-8 byte stream slicing

Because a UTF-8 encoded string is a list of bytes, Len () and slice operations do not work correctly. First use the string we used earlier.

[98] [[226] [] [] [147] = "abc–"

Next do the following:

>>> My_utf8 = "abc–" >>> print len (MY_UTF8) 6

God horse? It appears to be 4 characters, but Len's result says 6. Because Len calculates the number of bytes rather than the number of characters.

>>> print repr (My_utf8) ' abc\xe2\x80\x93 '

Now let's slice this string.

>>> My_utf8[-1] # Get The last char ' \x93 '

I'm going. The Shard result is the last byte, not the last character.

To properly slice the UTF-8, you'd better decode the byte stream to create a Unicode object. Then it can be safely manipulated and counted.

>>> My_unicode = My_utf8.decode ("Utf-8") >>> print repr (my_unicode) u ' abc\u2013 ' >>> print Len (my_unicode) 4>>> print my_unicode[-1]–


When Python automatically encodes/decodes

In some cases, it throws an error when Python automatically encodes/decodes using ASCII.

The first case is when it tries to merge Unicode and byte strings together.

>>> u "" + U "\u2019". Encode ("Utf-8") Traceback (most recent call last): File "
 
  
   
  ", line 1, in 
  
   
     C6/>unicodedecodeerror: ' ASCII ' codec can ' t decode byte 0xe2 in position 0:  ordinal No in range (+)
  
   
 
   
  

The same happens when you merge the list. Python automatically decodes the byte string to Unicode when it has string and Unicode objects in the list.

>>> ",". Join ([u "this string\u2019s Unicode", u "this string\u2019s utf-8". Encode ("Utf-8")]) Traceback (most  Recent call last): File "
 
  
   
  ", line 1, in 
  
   
    
   unicodedecodeerror: ' ASCII ' codec can ' t decode byte 0xe2 In position 11:ordinal No in range (+)
  
   
 
  

Or when trying to format a single byte string:

>>> "%s\n%s"% (U "This string\u2019s Unicode", u "this string\u2019s utf-8". Encode ("Utf-8"), Traceback (the most Recent call last): File "
 
  
   
  ", line 1, in 
  
   
    
   unicodedecodeerror: ' ASCII ' codec can ' t decode byte 0xe2 in position 11:ordinal
  
    No in range
 
  

Basically, when you mix Unicode and byte strings together, it can cause an error.

In this example, you create a utf-8 file, and then add some Unicode object text to it. They will report unicodedecodeerror errors.

>>> buffer = []>>> fh = open ("Utf-8-sample.txt") >>> Buffer.append (Fh.read ()) >>> Fh.close () >>> buffer.append (u "this string\u2019s Unicode") >>> print repr (buffer) [' This file\xe2\  X80\x99s got utf-8 in it\n ', U ' this string\u2019s Unicode ']>>> print ' \ n '. Join (buffer) Traceback (most recent call Last): File "
 
  
   
  ", line 1, in 
  
   
    
   unicodedecodeerror: ' ASCII ' codec can ' t decode byte 0xe2 in Positio n 9:ordinal not in range (+)
  
   
 
  

You can solve this problem by using the codecs module to load the file as Unicode.

>>> Import codecs>>> buffer = []>>> fh = open ("Utf-8-sample.txt", "R", "Utf-8") >>> bu Ffer.append (Fh.read ()) >>> fh.close () >>> print repr (buffer) [u ' this file\u2019s got utf-8 in it\n ', U ' this string\u2019s Unicode ']>>> buffer.append (u "this string\u2019s Unicode") >>> print "\ n". Join ( Buffer) This file ' s got utf-8 in it this string ' s Unicode

As you can see, the stream created by Codecs.open automatically converts the bit string to Unicode when the data is read.


Best practices

1. First decoding, last encoding

2. Use UTF-8 encoding by default

3. Use codecs and Unicode objects to simplify processing

The first decoding means that whenever there is a byte stream input, the input needs to be decoded to Unicode as soon as possible. This prevents the problem of Len () and the utf-8 byte stream from being sliced.

The last encoding means that only you intend to output text to a certain place before you encode it as a byte stream. This output may be a file, a database, a socket, and so on. Unicode objects are encoded only after processing is complete. The last encoding also means don't let Python encode Unicode objects for you. Python will use ASCII encoding and your program will crash.

Using UTF-8 encoding by default means: Because UTF-8 can handle any Unicode character, you'd better use it instead of windows-1252 and ASCII.

The codecs module allows us to do less digging when dealing with streams such as files or sockets. Without this tool provided by codecs, you must read the contents of the file as a byte stream, and then decode the byte stream as a Unicode object.

The codecs module allows you to quickly stream bytes to a Unicode object, eliminating the hassle.


Explanation UTF-8

The final part is to give you an idea of how to get started with UTF-8, and if you are a super geek you can ignore this paragraph.

With UTF-8, any byte between 127 and 255 is special. These bytes tell the system that these bytes are part of a multi-byte sequence.

Our UTF-8 encoded string looks like this:[) [98] [[] [226] [+] [147] = "abc–"

The last 3 bytes are a UTF-8 multi-byte sequence. If you convert the first of these three bytes to 2, you can see the following results:

11100010

The first 3 bits tell the system that it started a 3-byte sequence of 226,128,147.

Then a complete sequence of bytes.

11100010 10000000 10010011

Then you apply the following mask to the three-byte sequence. (See here)

1110xxxx 10xxxxxx 10xxxxxxxxxx0010 XX000000 XX010011 Remove the X ' s0010    000000  010011 Collapse the numbers00100000 00010011     Get Unicode number 0x2013, 8211 the "–"

Here are just a few basics of getting started with UTF-8, and if you want to know more details, go to UTF-8 's wiki page.

  • Related Article

    Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.