Python Base character encoding

Source: Internet
Author: User

A knowledge reserve for understanding character encoding

1. Basic Computer knowledge

2. principle of Text Editor Access file (Nodepad++,pycharm,word)

# 1, open the editor opens a process, is in memory, so, the content written with the editor is stored in memory, the loss of data after the power outage # 2, in order to permanently save, you need to click the Save button: the editor to the memory of the data to the hard disk.  #3, when we write a py file (no execution), and write other files without any difference, are just writing a bunch of characters. 

3.the Python interpreter implements the principle of the Py file, such as Python test.py

# First stage: The Python interpreter starts, which is equivalent to launching a text editor # The second stage: The Python interpreter equivalent to the text editor, to open the test.py file, from the hard disk to read the contents of the test.py file into memory (Small review: Pyhon interpretation, decided that the interpreter only care about the contents of the file, do not care about the file suffix name) # The third stage: The Python interpreter interprets the code that executes just loaded into memory test.py (PS: In this phase, when the code is actually executed, the Python syntax is recognized, execution of the in-file code, and when executed to Name= "Egon", will open the memory space to hold the string " Egon ")

4. summarize the similarities and differences between the Python interpreter and the document editor

# 1, the same point: the Python interpreter is to interpret the execution of the file content, so the Python interpreter has the ability to read the Py file, which is the same as the text editor # 2, different points: The text editor reads the contents of the file into memory for display or editing, ignoring the Python syntax, while the Python interpreter reads the contents of the file into memory, not to give you a glimpse of what Python code writes, but to execute the Python code , the python syntax is recognized. 

Introduction to two-character encoding

1. What is character encoding

The computer must be energized to work, that is, "electric" drive the computer to work, that is, the characteristics of the "electricity" determines the characteristics of the computer. The characteristics of the electric power is high and low level (the human logic of the binary number 1 corresponds to the higher level, the binary number 0 corresponds to the low level), about the magnetic characteristics of the disk is the same reason.  Conclusion: The computer only knows the number is obvious, we usually use the computer, is the human can understand the character (in high-level language programming results is nothing more than write a bunch of characters in the file), how can the computer read human characters? Must go through a process:  ## This process is actually how a character corresponds to a specific number of standards, which is called character encoding  

2, the following two scenarios are related to character encoding issues:

# 1, the content of a Python file is composed of a bunch of characters, access is related to character encoding problem (python file is not executed, the first two phases belong to this category) # 2, the data type string in Python is composed of a string of characters (python file execution, that is, the third stage)

3 . The history and classification of character coding (understanding)

The computer was invented by the Americans, the earliest characters encoded as ASCII, only the English alphanumeric and some special characters and the corresponding relationship between the numbers. Can be represented at most 8 bits (one byte), that is: 2**8 = 256, so the ASCII code can only represent a maximum of 256 symbols

Of course, our programming language is not a problem in English, ASCII enough, but in the processing of data, different countries have different languages, the Japanese will be in their own programs to add Japanese, Chinese will join Chinese.

And to express the Chinese, take a single byte table to represent a man, is impossible to express (even elementary school students know more than 2000 Chinese characters), the solution is only one, is a byte with >8 bit 2 in the representation, the more the number of changes on behalf of more, so that can be as many as possible to express the Chinese characters

So the Chinese have set their own standard gb2312 code, which specifies the correspondence between characters, including Chinese.

The Japanese have set their own shift_jis codes.

South Koreans have set their own EUC-KR codes (in addition, Koreans say that computers were invented by them, requiring the world to be harmonized with the Korean code, but the world's people did not respond to them)

At this time, the problem arises, proficient in 18 languages of the small week classmate modest in 8 languages to write a document, then this document, according to which country standards, will appear garbled (because the various standards at the moment are only the text of their own country, including the character and the corresponding relationship between the numbers, if the use of a national encoding format, Then the language of the remaining languages will be garbled when parsing the text.

So there is an urgent need for a world standard (which can contain all the languages of the world) so the Unicode came into being (Koreans say no, then no eggs)

ASCII uses 1 bytes (8-bit binary) to represent one character

Unicode commonly used 2 bytes (16-bit binary) represents a character, the uncommon Word needs 4 bytes

Cases:

The letter x, denoted by ASCII is decimal 120, binary 0111 1000

Chinese characters are beyond the ASCII encoding range, Unicode encoding is decimal 20013 , binary 01001110 00101101 .

The letter x, which uses Unicode to represent the binary 0000 0000 0111 1000, so Unicode compatible ASCII, also compatible with all nations, is the world's standard

This time the garbled problem disappears, all the documents we use but the new problem arises, if all our documents are English, you can use Unicode more space than ASCII, the storage and transmission is very inefficient

In the spirit of saving, there has been the conversion of Unicode encoding to "Variable length encoding" UTF-8 encoding. The UTF-8 encoding encodes a Unicode character into 1-6 bytes according to a different number size, the commonly used English letter is encoded in 1 bytes, the kanji is usually 3 bytes, and only the very uncommon characters are encoded into 4-6 bytes. If the text you want to transfer contains a large number of English characters, you can save space with UTF-8 encoding:

character ASCII Unicode UTF-8
A 01000001 00000000 01000001 01000001
In X 01001110 00101101 11100100 10111000 10101101

It can also be found from the table above that the UTF-8 encoding has an added benefit that ASCII encoding can actually be seen as part of the UTF-8 encoding, so a large number of legacy software that only supports ASCII encoding can continue to work under UTF-8 encoding.

4, the development of summary character coding can be divided into three stages (important)

#Phase One: Modern computers originated in the United States, and the earliest birth was also based on the English-considered ASCIIASCII: A bytes represents a character (English characters/all other characters on the keyboard), 1bytes=8bit,8bit can represent 0-2**8-1 kinds of changes, that can represent 256 characters ASCII initially only used the last seven bits, 127 digits, has been fully able to represent all the characters on the keyboard (English characters/all other characters of the keyboard), and later in order to encode the Latin into the ASCII table, the highest bit is also occupied#Stage Two: In order to satisfy Chinese and English, the Chinese have customized the GBKGbk:2bytes represents a Chinese character, and 1Bytes represents an English character. In order to meet other countries, each country has customized its own code Japanese to Shift_JIS in Japanese, Korea to the Korean EUC-Kr.#Stage Three: countries have national standards, there will inevitably be conflicts, the result is that in the multi-language mixed text, the display will be garbled. How to solve this problem??? #!!!!!!!!!!!! Very important!!!!!!!!!!!! plainly garbled the nature of the problem is not unified, if we can unify the world, the world can only use a text symbol, and then unified use of a code, then the garbled problem will no longer exist, PS: Like the Qin Shihuang unified China, the book with the same rail, all the trouble to solve all the problem is obvious, The above assumptions are impossible to establish. Many places or old systems, applications will still use a variety of coding, which is the legacy of history. So we have to find a solution or a coding solution that needs to be met:#1, can be compatible with the universal character#2, with all the world character encoding has a mapping relationship, so that can be converted to any country's character encodingThis is Unicode (fixed length), unified with 2Bytes for one character, although 2**16-1=65535, but Unicode can store 100w+characters, because Unicode stores a mapping relationship with other encodings, it is accurate to say that Unicode is not a strict character-encoding table, download PDFs to view the Unicode details: Link: https:pan.baidu.com/s/1dev3ryp It is clear that for text that is in English throughout, Unicode is undoubtedly one-fold more storage space (the binary is ultimately stored in the storage medium in the form of electricity or magnetism) and thus the UTF-8(variable length, full name Unicode transformation Format), the English characters are only 1Bytes, the Chinese characters with 3Bytes, to other uncommon words with more bytes to save#Summary: The unified use of Unicode in memory, waste space in exchange for can be converted to arbitrary encoding (not garbled), the hard disk can be used in various codes, such as Utf-8, to ensure that the hard disk or network-based transmission of data is very small, improve transmission efficiency and stability. !! Focus!!! 
The most heavy priority !!!

The file editor for the three-character encoding application

nodpad++ of the 3.1 text editor

first clear the concept#1, the file from the memory brush to the hard disk operation abbreviation for the file#2, the file from the hard disk read to the memory operation abbreviation reads the filegarbled in two cases:#garbled one: files are garbled when they are storedSave the file, because the file has the text of each country, we shiftjis to save, in essence, other countries of the text because in the ShiftJIS did not find the corresponding relationship in the storage failed but when we hard to save the time, the editor will not error (if your coding error, Editor This software then crash??? , but there is no doubt that can not be saved and hard to save, it must be a mess, that is, the file stage has been garbled and when we open the file with ShiftJIS, Japanese can display normally, and Chinese is garbled#the process of using open to simulate the editorCan be tested with the write of the Open function, F=open ('a.txt','W', encodig='Shift_JIS'F.write ('What are you seeing, て\n?')#' What do you see ' because there is no correspondence in ShiftJIS to save success, only ' How to see て\n ' can succeed#opening a file with any encoding a.txt the remaining two problems that are not displayed correctlyF=open ('a.txt','WB') F.write ('how to see て\n'. Encode ('Shift_JIS')) F.write ('What are you worried about?'. Encode ('GBK')) F.write ('What are you worried about?'. Encode ('Utf-8') ) F.close ()#garbled two: When the file is not garbled and read the file garbledUse utf-when saving files8 encoding, to ensure that compatible with all nations, not garbled, and read the file when the wrong decoding method, such as GBK, then in the reading phase garbled, read the stage garbled is can be resolved, choose the correct decoding mode on OK,!!! Garbled analysis!!! 
garbled analysis!!!

Pycharm of the 3.2 text editor

Open in utf-8 format (select reload)

# the difference between reload and convert: Pycharm is very powerful, provides the ability to automatically help us convert, will be the character according to the correct format to explore the nature of character encoding, or do not use this we choose Reload, That is, reload the file according to some encoding pycharm: the difference between reload and convert
In pycharm: The difference between reload and convert

3.3 Text editor Python interpreter

file test.py is saved in GBK format: x=' forest ' either Python2 test.py or Python3 Test.py will error (because python2 default ascii,python3 default UTF-8) unless you specify # CODING:GBK at the beginning of the file 

3.4 Summary

      Summarize the very important two points!!!

# 1, ensure that the core of the law is, the character according to what Standard and coding, to follow what standard decoding, here is the standard refers to the character encoding # 2, all characters written in memory, non-discriminatory, are Unicode encoding, such as we open the editor, enter a "You", we can not say "You" is a Chinese character, at this time it is just a symbol, the symbol may be used in many countries, Depending on the input method we use, the style of the word may be different. We can only determine if "you" is a kanji or a Japanese word when we go to the hard drive or on the network, which is the process of converting Unicode into another encoding format.

Unicode----->encode-------->utf-8

Utf-8-------->decode---------->unicode

# Add: when browsing the Web, the server converts dynamically generated Unicode content to UTF-8 andthen to the browser if the service-side encode encoding format is UTF-8, the client memory is also the result of utf-8 encoding.

Python for four-character encoding applications

4.1 Three stages of executing a python program

Python test.py (I'll emphasize again that the first step in executing test.py must be to read the contents of the file into memory first)

The contents of the test.py file are saved in GBK format with the following contents:

Phase one: Start the Python interpreter

Stage two: The Python interpreter is now a text editor responsible for opening the file test.py, which reads the contents of the test.py from the hard disk into memory

At this point, the Python interpreter will read the first line of test.py,#coding:utf-8, to determine what encoding format to read into memory, this line is to set the Python interpreter this software encoding using the encoding format this code,  You can use Sys.getdefaultencoding () to view, if not the Python file specifies header information #-*-coding:utf-8-*-, then use the default Python2 in the default ASCII, Python3 by default in UTF-8

Correction: In test.py Specify the file header, character encoding must be GBK,

# coding:gbk , hello.

    Phase three: Read the code that has been loaded into memory (Unicode encoding format), and then execute, the execution process may open up new memory space, such as x= "Egon"

memory encoding uses Unicode, which does not mean that all memory is Unicode, and it is really Unicode in memory before the program executes, such as reading a line from a file x="Egon" , where x, equals, quotes, all the same status, are ordinary characters, are stored in the Unicode format in memory, but the program in the process of execution, will request memory (and the memory that the program code exists is two space) to hold the data type of Python value, The Python string type also involves the concept of a character such as x="Egon", which is recognized by the Python interpreter as a string, and the memory space is applied to hold the value of the string type. As to what encoding the value of the string type is identified as, this is related to the Python interpreter, and the string type of Python2 and Python3 are different.

4.2 The difference between Python2 and Python3 string types

4.2.1 There are two types of string in Python2 str and Unicode

STR type

When the Python interpreter executes the code that produces the string (for example, x= ' on '), it requests a new memory address and then encodes ' up ' into the encoding format specified at the beginning of the file

To see the true format of x in memory, you can put it in the list and then print it instead of printing directly because direct print () automatically converts the encoding, which we'll talk about later.

# coding:gbkx= ' on 'y=' down 'print # [' \xc9\xcf ', ' \xcf\xc2 '] # \x represents 16 binary, here is C9CF a total of 4 bits 16 binary number, a 16 binary 44 bits, 4 16 binary number is 16 bits, 2 bytes, which proves that according to GBK encoding Chinese 2Bytes
Print # (<type ' str ';, <type ' str ' >)

Understand the key to character encoding!!!

In-memory data is typically expressed in 16-binary notation, and 2-bit 16-binary data represents a byte, such as \XC9, which represents two-bit 16-binary, one-byte

 GBK in Chinese requires 2 bytes, while in English it takes 1 bytes, how does it do it?

GBK will be the first bit in each bytes, that is, the 8 bit as the flag bit, the flag bit is 1 is the Chinese character, if the flag bit is 0 is the English character

x=' You a good ' turn into GBK format bits 8bit+8bit+8bit+8bit+8bit= (1+7bit) + (1+7bit) + (0+7bit) + (1+7bit) + (1+7bit)

This way the computer reads in the order from left to right:

# The first sign in the first two brackets is read consecutively to be 1, then a noon character is formed: you # Reading to the first sign of the third parenthesis is 0, then the 8bit represents an English character: a # The first flag bit in the last two parentheses is 1, which is a noon character: Good

That is, each bytes left us with a valid number of digits for the real value, only 7 bits, while the Unicode table holds only the valid 7 bits, as far as the first flag bit is related to the specific encoding, that is, the way to represent GBK in Unicode is:

(7bit) + (7bit) + (7bit) + (7bit) + (7bit)

According to the results of the translation, we can go to Unicode on the correspondence of Chinese characters to check: Link: https://pan.baidu.com/s/1dEV3RYp

You can see that the "on" "corresponding GBK (G0 represents GBK) is encoded at 494F, that is, we have the result, and the corresponding Unicode encoding is 4e0a, we can Gbk-->decode-->unicode

# coding:gbkx=' on '. Decode ('gbk') y='  next '. Decode ('gbk')print#  [u ' \u4e0a ', U ' \u4e0b ']

Unicode type

When the Python interpreter executes the code that produces the string (for example, S=u ' forest '), it requests a new memory address and then stores the ' Forest ' in Unicode format in the new memory space, so s can only encode and cannot be decode

# coding:gbkx=u' on '# equals x= ' on '. Decode (' GBK ')y=u'  Next '# equals y= ' under '. Decode (' GBK ')print#[u ' \u4e0a ', U ' \ u4e0b ']
Print # (<type ' Unicode ';, <type ' Unicode ' >)

Print to Terminal

Special instructions for print are:

When the program is executed, such as

x= ' on ' #gbk下, string stored as \XC9\XCF

Print (x) #这一步是将x指向的那块新的内存空间 (non-code memory space) in the memory, printing to the terminal, supposedly should be saved what to print what, but printing \xc9\xcf, for some unfamiliar Python coding programmer, immediately confused, So Uncle Turtle to take the Liberty, in print (x), using the terminal encoding format, the memory of the \XC9\XCF into character display, the need for terminal encoding must be GBK, otherwise it will not display the original content:

For data in Unicode format, no matter how it is printed, it is not garbled.

Unicode so good, not garbled, that python2 why still so awkward, engage a str out? When Python was born, Unicode was not as popular as it is today, it is clear that the good things you can see, the turtle uncle has seen, the turtle uncle in Python3 will be directly into Unicode, we define a str, no need to add u prefix, is a unicode,6 not 6?

4.2.2, There are two types of string in Python3 str and bytes

STR is Unicode

#CODING:GBKx='on' #when the program executes, you do not need to add u, ' on ' will also be in Unicode form to save the new memory space,Print(Type (x))#<class ' str ' >#x can be encode directly into any encoding formatPrint(X.encode ('GBK'))#b ' \XC9\XCF 'Print(Type (X.encode ('GBK')))#<class ' bytes ' >

It is important to see that the result of X.encode (' GBK ') in Python3 \XC9\XCF is the value of the STR type in Python2, while Python3 is the bytes type, and in Python2 is the str type

So I have a bold speculation: Python2 in the type of str is python3 bytes type, so I look at Python2 str () source, found

Python Base character encoding

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.