Python character encoding

Source: Internet
Author: User
Tags coding standards

Http://www.cnblogs.com/yuanchenqi/articles/5956943.html

Speaking of Python coding, it is a very sad sentence. Counted, repeated two months of tossing and counting. Fortunately, finally combed clear. As a communist, be sure to share it with everyone. If you're still having headaches because of coding, then come with me. Let's uncover the truth of the PY code!

What is a code?

The basic concept is simple. First, we start with a piece of information, the message, that the message exists in a human understandable and understandable sense. I intend to call this expression "plaintext" (plain text). For people who speak English, the English words printed on the paper or displayed on the screen count as clear text.

Second, we need to be able to turn the plaintext message into some other representation, and we need to be able to turn the encoded text back into plaintext. The conversion from plaintext to encoded text is called "encoding", and it is "decoded" from the encoded text back to Cheng Mingwen.

Coding is a big problem, and if it's not solved completely, it will be like a little snake in the jungle, biting you from time to time.    So what is coding anyway? ASCII remember the bottom line: all the data in a computer, whether it's text, pictures, videos, or audio files, is essentially stored in a binary like 01010101.
Besides, the computer only knows the binary numbers! So, the purpose is clear: how can we identify the symbol unique to a set of binary numbers corresponding to? So the comrade of the United States thought that by means of a level of high and low State to refer to 0 or 1,
Eight levels as a group can represent 256 different states, each state is unique to one character, such as a--->00010001, and English only 26 characters, count some special characters and numbers, 128 states also enough
Each level is called a specific, and the 8 bits are agreed to form a byte, so the computer can use 127 different bytes to store the English text. This is ASCII encoding. Extended ANSI encoding just said, at first, a byte has eight bits, but the top is useless, the default is 0, and later for the computer can also represent Latin, the last one is also used, from 128 to 255 of the character set corresponds to Latin. At this point, a single byte is full! GB2312 Computer cross the sea came to China, the problem came, the computer does not know Chinese, of course, can not display Chinese; and a byte all states are occupied, the evil imperialism died
My heart is not dead! Our party is also a good, self-reliant, self-rewriting a table, directly vigorous the expansion of the eighth bit corresponding to the Latin all deleted, the meaning of a character less than 127
Righteousness is the same as the original, but two words greater than 127 connect prompt together, representing a Chinese character, preceded by a byte (what he calls a high byte) from 0xa1 to 0xf7, followed by a byte
(Low-byte) from 0xa1 to 0xFE, so that we can assemble about 7,000 more Simplified Chinese characters, this scheme of Chinese characters is called "GB2312". GB2312 is a Chinese extension to ASCII. GBK and GB18030 encoding but there are too many Chinese characters, GB2312 is not enough, so the rule: as long as the first byte is greater than 127 fixed indicates that this is the beginning of a Chinese character, whether followed by the expansion of the character set
Content. The result of the expanded coding scheme is called the GBK Standard, and GBK includes all the contents of the GB2312, while adding nearly 20,000 new Chinese characters (including traditional characters) and symbols. Unicode encoding: Many other countries have developed their own coding standards, but they do not support each other. This brings a lot of problems. Therefore, the international standard who organization for the Unified Code: The standard code is proposed
Then: UNICODE. Unicode is represented by two bytes as a character, it can be combined with 65535 different characters, which is enough to cover all the world's symbols (including Oracle)//utf8:unicode are eminence, why do we have a UTF8 code? Everyone think, for the English World people, a byte completely enough, such as to store A, originally 00010001 can be, now eat the Unicode same big pot, two bytes: 00000000 00010001, waste too serious! Based on this, American scientists have put forward the idea of genius: UTF8. UTF-8 (8-bit Unicode Transformation Format) is a variable-length character encoding for Unicode, which can represent a symbol using 1~4 bytes, depending on
Different symbols change the length of the byte, when the character is in the ASCII range, it is expressed in one byte, so it is compatible with ASCII code. The obvious benefit is that while the data in our memory is Unicode, it is far less utf8 to use Unicode directly when data is saved to disk or used for network transmission! This is also why UTF8 is our recommended coding method. The relationship between Unicode and UTF8: Word: Unicode is a memory-encoded representation scheme (which is a specification), and UTF is a scheme for how to save and transmit Unicode (implementation), which is the difference between UTF and Unicode.
Add: How UTF8 saves hard drives and traffic
1 s="I‘m 苑昊"

The Unicode character set you see is such an encoding table:

I  0049 '  0027m  006d   0020 yuan 82d1 Hao 660a

Each character corresponds to a hexadecimal number.
Computers only understand binary, so, strictly in Unicode (UCS-2), this should be stored in this way:

I      00000000 01001001 '      00000000 00100111m      00000000 01101101       00000000 00100000 yuan     10000010 11010001 Hao     01100110 00001010

This string occupies a total of 12 bytes, but compared with the binary code in English, you can find that the first 9 bits of English are 0! Wasting, wasting your hard drive, wasting your traffic. What to do? UTF8:

I    01001001 '    00100111m    01101101     00100000 yuan   11101000 10001011 10010001 Hao   11100110 10011000 10001010

UTF8 used 10 bytes, compared to Unicode, two fewer, because our program English will be much more than Chinese, so the space will improve a lot!

Remember: Everything is meant to save your hard drive and traffic.

String encoding of two Py2

In Py2, there are two types of strings: str type and Unicode type; Note that this is just two names, Python defines two names, and the key is what are the two data types that have memory addresses when the program runs?

Let's take a look:

12345678910 #coding:utf8s1=‘苑‘printtype(s1) # <type ‘str‘>printrepr(s1) #‘\xe8\x8b\x91 s2=u‘苑‘printtype(s2) # <type ‘unicode‘>printrepr(s2) # u‘\u82d1‘

The built-in function repr can help us display the stored content here. It turns out that STR and Unicode each have byte data and Unicode data, so what is the concern between the two types of data? How to convert it? Here's the code (encode) and decoding (decode).

S1=u ' Court ' print repr (S1) #u ' \u82d1 ' b=s1.encode (' UTF8 ') print bprint type (b)  #<type ' str ' >print repr (b)  # ' \ xe8\x8b\x91 ' s2= ' Court Hao ' u=s2.decode (' UTF8 ') print u        # Yuanhao print type (u)  # <type ' Unicode ' >print repr (u)  # u ' \u82d1\u660a ' #注意u2 =s2.decode (' GBK ') print U2  #鑻戞槉print len (' Court Hao ') #6

Whether it is UTF8 or GBK are just a coding rules, a Unicode data encoding into byte data rules, so UTF8 encoded bytes must be decoded with UTF8 rules, otherwise there will be garbled or error cases.

Features of PY2 encoding:
123456789 #coding:utf8print‘苑昊‘#  苑昊   print repr(‘苑昊‘)#‘\xe8\x8b\x91\xe6\x98\x8a‘print(u"hello"+"yuan")#print (u‘苑昊‘+‘最帅‘)   #UnicodeDecodeError: ‘ascii‘ codec can‘t decode byte 0xe6                         # in position 0: ordinal not in range(128)

Python 2 silently hides the byte-to-Unicode conversion, so long as the data is all ASCII, all conversions are correct, and once a non-ASCII character sneaks into your program, the default decoding will be invalidated, resulting in unicodedecodeerror The error. The PY2 encoding makes it easier to process ASCII. The price of your comeback is that it will fail when dealing with non-ASCII.

String encoding of three Py3

Python3 renamed the Unicode type to str with the old STR type have been replaced by bytes.

PY3 also has two data types: str and bytes, the STR type has Unicode data, Bytse type bytes data, and py2 than just a change of name.

Import jsons= ' Court Hao ' print (type (s))       #<class ' str ' >print (json.dumps (s)) #  "\u82d1\u660a" B=s.encode (' UTF8 ') print (type (b))      # <class ' bytes ' >print (b)            # b ' \xe8\x8b\x91\xe6\x98\x8a ' U=b.decode (' UTF8 ') print ( Type (u))       #<class ' str ' >print (u)             #苑昊print (Json.dumps (U)) # "\u82d1\u660a" Print (Len (' Court Hao ')) # 2

Py3 's Coding philosophy:

Python 3 The most important new feature is probably a clearer distinction between text and binary data, and no longer automatically decodes bytes byte strings. Text is always Unicode, represented by the STR type, and binary data is represented by the bytes type. Python 3 does not mix str and bytes in any implicit way, which makes the distinction between them particularly clear. You cannot stitch strings and byte packets, search for strings in a byte packet (or vice versa), or pass a string into a function with a byte packet (or vice versa).

12 #print(‘alvin‘+u‘yuan‘)#字节串和unicode连接 py2:alvinyuanprint(b‘alvin‘+‘yuan‘)#字节串和unicode连接 py3:报错 can‘t concat bytes to str

Note: Regardless of the py2, or PY3, the Unicode data directly corresponds to the plaintext, and the printed Unicode data displays the corresponding plaintext (both English and Chinese)

Four files from disk to memory encoding (******)

Speaking of which, just came to our point!

Put aside the implementation of the implementation procedures, I ask everyone, text editor Everyone is used, if not understand what is, then word always used it, OK, when we edit text in Word, whether it is Chinese or English, the computer is not known, then before the data is saved by what form of memory? Yes, is the Unicode data, why to save Unicode data, this is because its name is the most cock: Universal code! The explanation is that no matter English, Chinese, Japanese, Latin, any character in the world it has a unique encoding corresponding, so compatibility is the best.

Okay, so what happens when we keep the data stored on the disk?

The answer is a bytes byte string encoded in some way. For example, utf8---is a variable length coding, which saves space and, of course, the GBK encoding of historical products and so on. So, in our text editor software, there is a default way to save files, such as UTF8, such as GBK. When we click Save, these editing software has "silently" helped us to do the coding work.

That when we open this file again, the software silently to us to do the decoding work, the data will be decoded into Unicode, and then can be rendered clear to the user! So, Unicode is closer to the user, bytes is the data closer to the computer.

What does that have to do with the execution of our program?

Let's start by defining a concept: the PY interpreter itself is a software, like a text editor!

Now let's restore a py file from creation to execution of the encoding process:

Open Pycharm, create hello.py file, write

ret=1+1s= ' Court Hao ' print (s)

When we save the time, the hello.py file is saved to the disk by the default encoding Pycharm, the file is closed and then opened, Pycharm then the default encoding to read after the opening of the content to decode, turn to Unicode to memory we see our plaintext;

And if we click the Run button or run the file at the command line, the PY interpreter will be called, open the file, and decode the bytes data on the disk into Unicode data, the process is the same as the editor, The difference is that the interpreter will then translate these Unicode data into the C code and then into the binary data stream, and finally execute the binary data by controlling the operating system call CPU, the whole process is finished.

So the question comes, our text editor has its own default encoding and decoding method, does our interpreter have it?

Of course, py2 default ASCII code, py3 default UTF8, can be queried by the following way

12 importsysprint(sys.getdefaultencoding())

Do you remember this statement?

1 #coding:utf8

Yes, this is because if the PY2 interpreter to execute a UTF8 encoded file, it will be decoded by default ASCII UTF8, once the program has Chinese, natural decoding error, so we declare at the beginning of the file #coding: UTF8, in fact, is to tell the interpreter, You should not decode this file by default encoding, but instead use UTF8 to decode it. The PY3 interpreter is much more convenient because it is encoded by default UTF8.

Note: The string encoding we mentioned above is the storage state of the CPU when executing the program, is another process, do not confuse!

Five common coding Problems 1 cmd under the garbled problem

hello.py

12 #coding:utf8print(‘苑昊‘)

The encoding of the file when it is saved is also UTF8.

Think: Why under the IDE with 2 or 3 to do all right, under the Cmd.exe 3 is correct, 2 garbled it?

We in win under the terminal namely Cmd.exe to execute, everybody notice, Cmd.exe itself also a software; when we python2 hello.py, the Python2 interpreter (default ASCII encoding) goes to the declared UTF8 encoded file, and the file is UTF8 saved, So no problem, the problem is when we print ' Court Hao ', the interpreter side of the normal execution, will not error, Just print content will be passed to Cmd.exe to display, and in py2 this content is UTF8 encoded byte data, can this software default encoding decoding is GBK, so cmd.exe with GBK decoding way to decode UTF8 nature will garbled.

Py3 The correct reason is that the Unicode data is passed to CMD, Cmd.exe can recognize the content, so the display is fine.

Understand the principle, there are many ways to modify, such as:

1 print(u‘苑昊‘)

After changing to this, the CMD under 2 will not have a problem.

2 encoding issues in open ()

Create a hello text and save it as UTF8:

Yuanhao, you are the most handsome!

Create a index.py in the same directory

F=open (' Hello ') print (F.read ())

Why under Linux, the result is normal: Yuanhao, under win, garbled: Ã language 槉 (PY3 interpreter)?

Because your win's operating system is installed with the default GBK encoding, while the Linux operating system defaults to UTF8 encoding;

When the open function is executed, the operating system is called the opening file, the operating system with the default GBK encoding to decode UTF8 files, natural garbled.

Workaround:

F=open (' Hello ', encoding= ' UTF8 ') print (F.read ())

If your file is GBK encoded, you will not have to specify encoding under win.

In addition, if your win does not need to assign to the operating system encoding= ' UTF8 ', that is, you install the default UTF8 code or has been modified by the command to UTF8 encoding.

Note: The open function is different in py2 and Py3, and there is a encoding=none parameter in the PY3.

Python character encoding

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.