A detailed explanation of the code in Python

Source: Internet
Author: User

Look at this article before you should already know why there are coding, as well as the type of coding situation

    • ASCII accounts for 1 bytes, only English is supported
    • GB2312 2 bytes, supports 6700+ kanji
    • GBK GB2312 's upgraded version, supports 21000+ kanji
    • Shift-jis Japanese characters
    • ks_c_5601-1987 Korea code
    • TIS-620 Thailand Code

Since each country has its own character, its correspondence also covers the characters of its own country, but the above code has limitations, namely: only the national characters, no other national character correspondence. The advent of the universal code, he covers all the world's text and binary correspondence,

    • Unicode 2-4 bytes has been included in 136,690 characters, and is still expanding ...

Unicode plays 2 roles:

    1. Directly support all languages in the world, and each country can no longer use its own old code, Unicode. (Just like English is a universal language)
    2. Unicode contains a mapping relationship with all countries in the world, why? I'll talk later.

Unicode solves the correspondence between character and binary, but using Unicode to represent one character is a waste of space. For example, using Unicode to denote "Python" requires 12 bytes to be represented, which is 1 time times more than the original ASCII representation.

Because the computer's memory is larger, and the string is not particularly large in the content, so the content can be processed using Unicode, but the storage and network transmission of the general data will be very much, then 1 time times the increase will be intolerable!!!

In order to solve the problem of storage and network transmission, the Unicode transformation Format, academic name UTF, namely: convert in Unicode, so as to save space in storage and network transmission!

    • UTF-8: Use 1, 2, 3, 4 bytes for all characters, a priority of 1 characters, not enough to increase one byte, up to 4 bytes. English accounts for 1 bytes, European languages accounted for 2, East Asia accounted for 3, other and special characters accounted for 4
    • UTF-16: Use 2, 4 bytes to represent all characters, 2 bytes preferred, otherwise 4 bytes are used.
    • UTF-32: Use 4 bytes to represent all characters;

Summary:UTF is a coding scheme designed for Unicode encoding that saves space in storage and transmission.

characters stored on the hard disk

Regardless of the encoding in memory to display characters, save to the hard disk is 2 binary.

123456789101112131415161718192021 ascii编码(美国):    l   0b1101100    o   0b1101111    v   0b1110110    e   0b1100101GBK编码(中国):    老   0b110000000b11001111    男   0b110001000b11010000    孩   0b101110100b10100010Shift_JIS编码(日本):    私   0b100011100b10000100    は   0b100000100b11001101ks_c_5601-1987编码(韩国):    ?   0b101100110b10101010    ?   0b101101000b11000010     TIS-620编码(泰国):    ???  0b101010010b110100010b10111001...
1 要注意的是,存到硬盘上时是以何种编码存的,再从硬盘上读出来时,就必须以何种编码读,要不然就乱了。。

  

Encoded conversions

Although the international language is English, but everyone in their country still speak their own language, but out of the country, you have to speak English
Encoding is the same, although the Unicode and utf-8, but because of historical issues, countries are still heavily using their own code, such as China's windows, the default encoding is still GBK, not utf-8

Based on this, if China's software exports to the United States, on the American computer will show garbled, because they do not GBK code.
If you want Chinese software to be displayed on an American computer normally, there are only 2 routes to go:

    1. To put GBK code on American computers.
    2. Encode your software in Utf-8

The 1th method is almost impossible, and the 2nd method is relatively simple. But it can only be for newly developed software. If the software you developed earlier is GBK encoded, millions of lines of code may have been written out and re-encoded into the UTF-8 format is a great effort.

So, for already used GBK development completed projects, the above 2 kinds of programs can not easily let the project in the American computer normal display, there is no other way?
Yes, remember when we talked about Unicode one of the features is that it contains a mapping relationship with all countries in the world, meaning that you are writing GBK's "Luke City", but Unicode can automatically know what the code of "Luke City" is in Unicode, and if so, Does that mean that no matter what code you store data, as long as your software reads the data from the hard disk into memory, it turns into Unicode to display it.
Since all systems and programming languages support Unicode by default, your GBK software is placed on American computers, loaded into memory, converted into Unicode, and Chinese can be displayed normally.

The mapping table for Unicode and GBK http://www.unicode.org/charts/

Python3 the execution of the process

Before we look at the actual code examples, let's talk about the process of Python3 code execution.

    1. The interpreter finds the code file, loads the code string into memory as defined by the file header, and turns it into Unicode
    2. The code string is interpreted according to the syntax rules,
    3. All variable characters are declared in Unicode encoding
Encoding conversion process

Actual code demo, write your code to utf-8 on Py3, save it, then execute it on windows,

12 =‘路飞学城‘print(s)

So, everything is wonderful, here, our learning about coding is supposed to be over.

But, like life, beautiful under the surface, is always hidden unsatisfactory, the above Utf-8 code can be displayed in the Windows GBK Terminal Normal, because in the memory of the Python interpreter utf-8 into Unicode, but this is only Python3, Not all programming languages in memory default encoding is Unicode, such as the evil Python2 is not, its default encoding is ASCII, want to write Chinese, you have to declare the file header coding for GBK or utf-8, after declaration, The Python2 interpreter interprets your code only with the encoding of the file header, which is loaded into memory and does not actively help you convert to Unicode, that is, your file encoding is utf-8, loaded into memory, your variable string is also Utf-8, which means what do you know? means that you encode files in Utf-8, which is garbled in Windows.

Chaos is normal, not chaos is not normal, because there are only 2 cases, your Windows display will not mess

    1. The string is displayed in GBK format
    2. string is Unicode encoded

Since Python2 does not automatically convert the file encoding into Unicode memory, you can only take the last move and manually turn it yourself. Py3 automatically convert the file encoding to Unicode must be called what method, this method is, decode (decoding) and encode (encoding)

12 utf - 8  - - > decode decode   - - unicode unicode  - - > encode encoding   - - > gbk  /  utf - 8  

Decode example

Encode example

Remember the rules

How do I verify that the encoding goes right?

1. Look at the data type, Python 2 has a dedicated Unicode type
2. View the Unicode Encoding mapping table

Unicode characters are determined by a specific Unicode type, but the UTF-8,GBK encoded characters are STR, and what encoding is the current string data that you distinguish? Some say it can be judged by byte length, because Utf-8 a Chinese account of 3 bytes, gbk a 2 bytes

By the number of bytes above, although can also be broadly judged what type, but always feel not very professional.

How to accurately verify the encoding of a character, is to take these 16 binary numbers with the Code table to match.

"Lu Fei Learning City" of the Unicode encoding map location is U ' \u8def\u98de\u5b66\u57ce ', ' \u8def ' is ' road ', to the table search.
"Lu Fei Learning City" corresponding to the GBK code is ' \xc2\xb7\xb7\xc9\xd1\xa7\xb3\xc7 ', 2 bytes A Chinese, "Road" binary "\xc2\xb7" is 4 16 binary, exactly 2 bytes, take it to the Unicode mapping table, Found is g0-4237, not \xc2\xb7 ah ... Wipe. It's a hit.

Then check the "Fly" \u98de, corresponding to the g0-3749, and \xb7\xc9 is not on.

Although not on, but good \xc2\xb7 and g0-4237 in the 2nd place of the 2 and 4th 7 pairs on the, "Fly" the word is the same, is it coincidence?

Turn them all into a 2-input display try

123456789101112131415161718 C               28421842 1<strong>11000010</strong> B               784218421<strong>10110111</strong>B               78 4218421101101 11C               98421842111001001

This "road" or with g0-4237, yes, but if you take the road \xc2\xb7 each binary byte left the first bit into 0 try it, I rub, add up is really 4237 ah. Is it another coincidence???

is not necessarily, because, GBK's coded representation determines the form. Because the GBK code in the early stages of the design to consider to be compatible with ASCII, that is, if it is in English, in one byte, 2 bytes is Chinese, but how to distinguish between the 2 bytes together is to represent 2 English letters, or a Chinese character? The Chinese are so smart, decided, 2 bytes together, if the 1th bit of each byte (that is, the equivalent of 128 of the 2 binary bit) if it is 1, it means that this is a Chinese, the first is the byte of 128 is called High byte. That is, 2 high-byte connected together, must be a Chinese. How can you be so assured? Since 0-127 already represents most of the characters in English, 128-255 is an extended table of ASCII, which represents very special characters, which is generally useless. So the Chinese took it directly.

Q: Then why does the "\xc2\xb7" above the 2-bit of the 128 to be removed to match the g0-4237 of the Unicode encoding table?

This can only be said that Unicode in the expression of the mapping table directly ignores the high-byte, but the true mapping, it is necessary to use the high-byte ha.

Python bytes Type

Write a string on Python 2

12345 >>> s ="路飞">>> prints路飞>>> s‘\xe8\xb7\xaf\xe9\xa3\x9e‘

Although the print is the way to fly, but directly call the variable s, see is a 16 binary binary bytes, how do we call such data? Directly called binary? Yes, but compared to 010101, the data string in the representation of the 2 into the system into the 16 binary to indicate, this is why? Ha, just to make people look more readable. We call it the bytes type, which is the byte type, which refers to a 8 binary group called a byte, denoted by a 16 binary.

What does that mean?

To tell you the truth, the Python2 string is actually more called a byte string. By means of storage can be seen, but there is another type of Python2 is bytes Ah, is it called bytes also called string? Well, yes, in python2, bytes = = str, in fact, is a matter of

Besides, there's a separate type of Unicode in Python2, which, after decoding the string, becomes Unicode.

123456 >>> s‘\xe8\xb7\xaf\xe9\xa3\x9e‘#utf-8>>> s.decode(‘utf-8‘)u‘\u8def\u98de‘#unicode 在unicode编码表里对应的位置>>> print(s.decode(‘utf-8‘))路飞 #unicode 格式的字符


Because of the limitations of the initial cognitive nature of Python's founders, it did not anticipate that Python could develop into a globally popular language, causing its early development not to support global languages as important things, so it was frivolous to consider ASCII as the default encoding. Python was prepared to introduce Unicode when the calls to support kanji, Japanese, and French were more and more high, but it was unrealistic to change the default encoding to Unicode directly, since many software was developed based on the previous default encoding ASCII, and the code was changed The code for those software is all messed up. So Python 2 just makes a new character type, called the Unicode type, for example, you want your Chinese to display properly on all the computers in the world, and in memory you have to save the string as a Unicode type

12345678 >>> s  =   "Lu Fei" >>> s ' \xe8\xb7\xaf\xe9\xa3\x9e ' >>> s2  =  s.decode ( "Utf-8" ) >>> S2 U ' \u8def\u98de ' >>>  Code class= "Python functions" >type (S2) < type   ' Unicode ' >


Time came to 2008, Python development has been nearly 20 years, the founder of turtle Uncle More and more feel that many things in Python has evolved not like his original intention, began to become bloated, not concise, and some design is not to touch the mind, such as Unicode and STR type, str The relationship with the bytes type is confusing for many Python programmers.
Uncle Turtle can not endure, like before the tinkering has not allowed Python to become better, so a big change, Python3 turned out, incompatible python2,python3 than Python2 did a lot of improvement, One is to finally turn the string into Unicode, the file default encoding becomes utf-8, which means that, as long as the use of python3, no matter what kind of code your program is developed, can be in the world of computers on the normal display, it is great!

PY3 in addition to the string encoding to Unicode, but also the Str and bytes made a clear distinction, STR is Unicode format characters, bytes is simply binary.

The last question, why in Py3, after Unicode encoding, the string becomes the bytes format? You directly to me directly to print to GBK character display is not good? I think in fact py3 design is really painstaking, just want to tell you clearly in this way, want to see characters in Py3, must be Unicode encoding, other codes are displayed in bytes format.

In the end, Python only has a variety of coding problems, but there is nothing wrong with coding settings.
Common coding errors include the following reasons:

      • Default encoding for Python interpreter
      • Python source file encoding
      • Encoding used by the terminal
      • Language settings for the operating system

After mastering the relationship between the codes, it's okay to go through the wrong line.

A detailed explanation of the code in Python

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.