Deep analysis of Python character encoding

Source: Internet
Author: User
Tags character set stdin in python

Once you're on the path to programming, if you don't figure out the coding problem, it will haunt your entire career like a ghost, and all sorts of supernatural events will follow. Only by giving full play to the spirit of the programmer's death can you completely get rid of the problem of coding problems, the first time I encountered a coding problem is to write Javaweb related items, a string of characters from the browser to the application code, boil immersed in the database, anytime and anywhere may step on the coded mines. The second encounter coding problem is learning python, when crawling the Web page data, coding problems appeared, then my mood is breaking, with the most ing of the word is: "I was a Meng."

To understand the character encoding, we have to start with the origin of the computer, all the data in the computer, whether text, pictures, video, or audio files, in essence ultimately, in the form of a 01010101-like digital storage. We are fortunate, we are unfortunate, fortunately, the times have given us all the opportunity to computer, unfortunately, the computer is not the invention of our people, so the standard of the computer according to the American people's custom to achieve, it is this reason, the problem that has plagued our coding. So what does the first computer do to represent characters?

Ascii
In English, the number of characters is very limited, 26 letters (case), 10 digits, plus punctuation, and a control character, which means that all the keys on the keyboard add up to more than 100 characters, which is more than enough to represent an English character with a byte of storage space in the computer. Because one byte is equal to 8 bits, 8 bits can represent 256 symbols. So smart Americans have developed a set of character encoding standard called ASCII (American Standard Code for Information Interchange), each character corresponds to a number, For example, the binary value of uppercase a corresponds to 01000001. The first ASCII defines only 128 character encodings, including 96 words and 32 control symbols, and then an extensible ASCII called EASCII that can be used to satisfy some Western European language characters, and the specific characters associated with ASCII can be viewed on the site Ascii-code.

GBK
With the progress of the Times, the computer began to spread to millions of households, Bill Gates so that everyone has a computer desktop dream of the proud realization. But the computer into China has to face a problem is the character encoding, although our country's Chinese characters are the most frequently used text, Chinese characters are profound, there are thousands of common Chinese characters, which has greatly exceeded the ASCII code can be expressed in the range of characters, So the smart Chinese themselves have a set of code called GB2312, this code contains 6,763 characters, and he also compatible with ASCII, but GB2312 still can not 100% meet the needs of Chinese characters, some rare words and traditional characters GB2312 can not handle, Later on on the basis of GB2312 created a code called GBK, GBK not only included 27,484 Chinese characters, but also included Tibetan, Mongolian, Uighur and other major minority languages. The same GBK is also compatible with ASCII encoding. Therefore, the English character is represented in 1 bytes, and the Chinese character is identified by two bytes.

Unicode
for how to handle Chinese own text we can set up a set of coding specifications according to our own needs, but computers are not only used by Americans and Chinese, but also in Europe and other Asian countries such as Japanese, Korean all over the world to add up to an estimated hundreds of thousands of of the text, which has greatly exceeded the ASCII and even GBK can express the range, and why do they use your GBK standard? How does such a large character library be expressed? Thus, the international organization of the Federation proposed Unicode encoding, the scientific name of Unicode is "universal Multiple-octet coded Character Set", referred to as UCS. Unicode has two forms: UCS-2 and UCS-4. UCS-2 is a two-byte code, a total of 16 bits, which theoretically can represent up to 65,536 characters, but to indicate that all the characters in the world show 65,536 numbers is still far, because there are nearly 100,000 optical characters, So the Unicode4.0 specification defines a set of additional character encodings, UCS-4 is 4 bytes (actually only 31 bits, the highest bit must be 0). It can theoretically cover all the symbols used in all languages. Once the Unicode encoding of the character is determined, it will not be changed again. But Unicode has certain limitations, when a Unicode character is transmitted or eventually stored on a network, it is not always necessary for each character to be two bytes, such as an ASCII character "a", a character that can be represented by a byte, or two bytes, which is obviously too wasteful. The second problem is that when a Unicode character is saved to a computer as a string of 01 digits, how does the computer know whether a 2-byte Unicode character represents a 2-byte character or two 1-byte characters, and if you don't tell the computer, then the computer will be confused. Unicode is only a rule of how to encode, and does not specify how to transfer and save this encoding. For example, the Unicode encoding of the word "Han" is 6c49, I can transmit and save this encoding with 4 ASCII digits, or it can be encoded in Utf-8:3 consecutive bytes E6 B1 89来 represent it. The key is to recognize both sides of the communication. So Unicode encoding has different implementations, such as: UTF-8, UTF-16, and so on.

UTF-8
UTF-8 (Unicode Transformation Format) as a way of implementing Unicode, widely used in the Internet, it is a variable length character encoding, can be used in 1-4 bytes to represent a character. For example, English characters, which can be represented by ASCII code, require only one byte of space in UTF-8, and ASCII is the same. For multibyte (n-byte) characters, the top N of the first byte is set to 1, the n+1 bit is set to 0, and the first two digits of the following byte are set to 10. The remaining bits are all populated with Unicode code for that character.

Convert

To the Chinese character "good" for example, "good" corresponding to Unicode is 597D, the corresponding interval is 0000 0800--0000 FFFF, so it needs to use the UTF-8 of 3 bytes to store, 597D with the binary representation is: 0101100101111101, fill to 1110xxxx 10xxxxxx 10xxxxxx get 11100101 10100101 10111101, convert to 16 binary: E5A5BD, so "good" Unicode "597D" The corresponding UTF-8 code is "E5A5BD"

Chinese good
Unicode 0101 100101 111101
Code Rule 1110xxxx 10xxxxxx 10xxxxxx
--------------------------
Utf-8 11100101 10100101 10111101
--------------------------
16 binary Utf-8 E 5 a 5 b d

Python character encoding
Now I've finally finished with the theory. Again, the coding problem in Python. Python was born much earlier than Unicode, and Python's default encoding is ASCII

>>> Import Sys
>>> sys.getdefaultencoding ()
' ASCII '

So if you specify the encoding in the Python source code file without displaying it, a syntax error occurs

#test. py
print "Hello"

Above is the test.py script, running Python test.py will pack the following error:

File "test.py", line 1 yntaxerror:non-ascii character ' \xe4′in file test.py to line 1, but no encoding declared; Http://www.python.org/ps/pep-0263.html for details
In order to support non-ASCII characters in your source code, you must specify the encoding format in the first or second line of the source file:

# Coding=utf-8
Or is:

#!/usr/bin/python
#-*-Coding:utf-8-*-


In Python and string-related data types, which are STR, Unicode two, they are all basestring subclasses, which shows that STR and Unicode are two different types of string objects.

Basestring
/  \
/    \
STR Unicode

For the same Chinese character "good", with STR, it corresponds to the utf-8 encoded ' \xe5\xa5\xbd ', and in Unicode, his corresponding symbol is U ' \u597d ', and U "good" is equivalent. It is important to add that the STR type character is encoded in UTF-8 or GBK, or in other formats, depending on the operating system. For example, in a Windows system, the cmd command line displays:

# Windows Terminal
>>> a = ' good '
>>> type (a)
<type ' str ' >
>>> A
' \xba\xc3 '

In the Linux system, the command line is displayed:

# Linux Terminals
>>> a= ' good '
>>> type (a)
<type ' str ' >
>>> A
' \XE5\XA5\XBD '

>>> b=u ' good '
>>> type (b)
<type ' Unicode ' >
>>> b
U ' \u597d '

Whether it's python3x, Java, or other programming languages, Unicode encoding is the default encoding format for languages, and when data is finally saved to media, different mediums can be useful in different ways, some people like to use UTF-8, some people like to use GBK, which doesn't matter, As long as the platform unified coding specifications, the specific how to achieve does not care. Encode

The conversion of STR and Unicode
So how does it change between STR and Unicode in Python? The conversion between the two types of string types is decode and encode by these two methods.

Py-encode

#从str类型转换到unicode
S.decode (encoding) =====> <type ' str ' > to <type ' Unicode ' >
#从unicode转换到str
U.encode (encoding) =====> <type ' Unicode ' > to <type ' str ' >

>>> C = b.encode (' Utf-8 ')
>>> type (c)
<type ' str ' >
>>> C
' \XE5\XA5\XBD '

>>> d = c.decode (' Utf-8 ')
>>> Type (d)
<type ' Unicode ' >
>>> D
U ' \u597d '


This ' \xe5\xa5\xbd ' is the Unicode U ' good ' encoded by the function encode encoding of the UTF-8 encoded STR type of string. Conversely, C of STR type is decoded into Unicode string d via function decode.

STR (s) with Unicode (s)
STR (s) and Unicode (s) are two factory methods that return str string objects and Unicode string objects, and STR (s) is shorthand for S.encode (' ASCII '). Experiment:

>>> s3 = u "Hello"
>>> S3
U ' \u4f60\u597d '
>>> Str (S3)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
Unicodeencodeerror: ' ASCII ' codec can ' t encode characters in position 0-1: Ordinal No in range (128)

The above S3 is a Unicode type string, and Str (S3) is equivalent to executing s3.encode (' ASCII ') because the "hello" two characters cannot be expressed in ASCII code, so the error is given and the correct encoding is specified: S3.encode (' GBK ') or S3.encode ("Utf-8") would not have this problem. Similar Unicode has the same error:

>>> s4 = "Hello"
>>> Unicode (S4)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
Unicodedecodeerror: ' ASCII ' codec can ' t decode byte 0xc4 in position 0:ordinal not in range (128)
>>>

Unicode (S4) is equivalent to S4.decode (' ASCII '), so the correct conversion is to specify its encoding s4.decode (' GBK ') or S4.decode ("Utf-8") correctly.

Garbled
All the causes of garbled characters can be summed up in the encoding of the character through different encoding and decoding in the process of coding format inconsistent, such as:

# Encoding:utf-8

>>> a= ' good '
>>> A
' \XE5\XA5\XBD '
>>> B=a.decode ("Utf-8")
>>> b
U ' \u597d '
>>> C=b.encode ("GBK")
>>> C
' \xba\xc3 '
>>> Print C

Utf-8 encoded characters ' good ' occupies 3 bytes, decoding into Unicode, if you use GBK to decode, only 2 bytes of the length of the last occurrence of garbled problems, so the best way to prevent garbled code is always adhere to the same encoding format to encode and decode characters.

Decode-encode

Other Tips
For strings such as Unicode (str type):

s = ' Id\u003d215903184\u0026index\u003d0\u0026st\u003d52\u0026sid '

Converting to true Unicode requires the use of:

S.decode (' Unicode-escape ')

Test:

>>> s = ' id\u003d215903184\u0026index\u003d0\u0026st\u003d52\u0026sid\u003d95000\u0026i '
>>> print (type (s))
<type ' str ' >
>>> s = S.decode (' Unicode-escape ')
>>> s
U ' id=215903184&index=0&st=52&sid=95000&i '
>>> print (type (s))
<type ' Unicode ' >
>>>

The above code and concepts are based on python2.x.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.