Break up, speak Chinese code

Last Update:2015-04-23 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Computers use 0 and 1 to store data, and there are two main types of data stored: numbers and characters (and what operators do not discuss for the moment), the method of digital storage is relatively simple, no problem, here is how to store characters.

1 large history of the coding method

1.1 ASCII

For the first American to invent computers, the characters had only uppercase and lowercase letters, so they used a simple encoding-ASCII, a letter for a 8-bit binary (ASCII), or a number 0-255, or 8 bits, or 1 bytes. The stored content is actually this set of 8-bit 01 yards, when the use of ASCII encoding software is told to display this set of 8-bit 01 yards of characters, it will appear as a letter, which is actually the same as the number stored in 01 yards, when you tell the software to display this group of 01 yards in a digital way, it will display the numbers.

1.2 GBK

But obviously for the Chinese, this 256 space is not enough to put all the Chinese characters, so the Chinese use another encoding for the language-GBK(First GB2312, later extended to GBK), with two bytes of 01 corresponds to 1 Chinese characters, that is, When software using GBK encoding is told to display this set of 01 yards in characters, it reads the corresponding Chinese characters from the GBK table and displays them (this process is discussed in detail later).

1.3 Unicode Coding

But there are hundreds of languages all over the world, Japan to the Japanese into the Shift_JIS, South Korea to the Korean euc-kr, countries have national standards, will inevitably appear conflict, the result is, in the multi-language mixed text, the display will be garbled. Therefore,Unicode encoding comes into being. Unicode unifies all languages into a set of encodings, so there is no more garbled problem. Unicode uses two bytes to represent one character (4 bytes if you want to use very remote characters). Modern operating systems and most programming languages directly support Unicode, here to say, Windows7 Chinese system is the default encoding is GBK (at least the characters in TXT is GBK encoding, the system kernel may be Unicode, or only the input and Output window is GBK, temporarily regardless). (in addition to the simple encoding conversion, ASCII-encoded characters can obviously be converted to Unicode encoding, conversely, if the ASCII-encoded a Unicode encoding, only need to be 0 in front, so, A's Unicode encoding is 00000000 01000001). That is, when software using Unicode encoding is told to display this set of 01 yards in characters, it reads the corresponding Chinese characters from the Unicode table and displays them.

1.4 UTF-8 Coding

But the new problem arises: If Unicode encoding is unified, the garbled problem disappears. However, if you write text that is basically all in English, using Unicode encoding requires more storage space than ASCII encoding, which is not cost-effective in storage and transmission. Therefore, in the spirit of saving, there has been the conversion of Unicode encoding to "Variable length encoding" UTF-8 encoding. UTF-8 encoding encodes a Unicode character into 1-6 bytes based on a different number size, the commonly used English letters are encoded in 1 bytes, the Chinese characters are usually 3 bytes (actually more memory than GBK), only the very uncommon characters will be encoded into 4-6 bytes. If the text you want to transfer contains a large number of English characters (right, English), you can save space by coding with UTF-8. One additional benefit of UTF-8 encoding is that ASCII encoding can actually be seen as part of the UTF-8 encoding, so a large number of legacy software that only supports ASCII encoding can continue to work under UTF-8 encoding. That is, when the software using UTF-8 encoding is told to display this set of 01 yards in characters, it reads the 01-yard corresponding Chinese characters from the UTF-8 table and displays them.

To be emphasized here, in order to be compatible with all languages, and to provide convenience for most English-speaking programs, in the programmer world, UTF-8 code is a universal encoding, files are generally stored in this form, so the Linux system uses UTF-8, but unfortunately, The Windows7 we use now uses GBK encoding, which means that the road to programming is going to be more curved than others.

2 encode, decode, and Transcode

2.1 encoding and decoding

The above words are coded, the code is 1 to build the character and 01 yards of one by one correspondence relationship 2 characters stored as this 01 yards. It also mentions decoding, which is the decoding of the 01 yards into characters (the following will explain what happened in the process).

All the coding, decoding and transcoding process is achieved through the software, and the corresponding relationship between the code is the software itself, so with which, what is the default, is different, the following will be used in the Win7 to write a Python program example.

2.2 transcoding between different encodings

For English letters and numbers, they can be encoded directly in ASCII, GBK, Unicode, UTF-8, and the ASCII encoding of these characters is the same as UTF-8 encoding, so to say, there are three kinds of encoding between the conversion, which conversion can be done? The picture is simple and convenient, the left side of the left four double-headed arrows for encoding and decoding, the right two double-headed arrows indicate that Unicode can be bidirectional transcoding with GBK and UTF-8, while GBK and UTF-8 cannot be transcoded.

For four transcoding paths, the Python code, which also uses the UTF-8 code and GBK code to Unicode, is called decoding.

And for Chinese characters, they can be directly used GBK, Unicode, UTF-8 encoding, not ASCII encoding, the process in addition to the left side of the first two-headed arrow, the rest is exactly the same.

3 in Win7 write Python on program

The origin of this article is to write a Python program that uses the Wxpython library, so the following is an explanation of some of the problems encountered in this process.

3.1 Python support features for encoding

Because Python was born earlier than the Unicode standard, the earliest Python only supported ASCII encoding, and the normal string ' ABC ' was ASCII-encoded inside python, directly using ' ... ' To declare. Python later added support for Unicode, using Unicode-encoded strings with u ' ... ' Statement. (While the normal decoding output is directly with the print statement)

There are actually two types of strings in Python, the str type and the Unicode type, both of which are derived classes of basestring. The STR type is encoded in exactly the same way as the source file (the Py file itself must also be encoded), and by default it is the standard ASCII encoding, which you can change by writing this statement in the first line: #coding: UTF-8/GBK (the two ways can only be written in one, and cannot be set to Unicode, and ASCII does not write, and it is worth mentioning that this statement in order to respect the habits of other languages, is set to regular expression recognition, so you can see this sentence a lot of other ways, and Unicode type is naturally Unicode encoding. So Python supports four encoding methods in this way.

In addition, because the PY source file is saved by default ASCII encoding, if it is added to the Chinese comment or let the variable stored in the characters, it will certainly not be saved, so in general, you will change the source file encoding in the first line, preferably with UTF-8 encoding.

3.2 in Win7 write Python on program encoding and decoding process

The following details the different encoding and decoding processes, where encoding and decoding occur in a generalized input and output scenario. The following input describes four ways, the first is to write directly into the code, the second is from the Console input window (standard input), the third is to read from Win7 TXT document (read from memory), the fourth is from now I use Wxpython to do the window interface text box. The output is introduced in three ways, the first is from the Console Output window (standard output), the second is to write Win7 txt document (write data to memory), and the third is from now I use Wxpython to do the window interface text box.

3.2.1 code Input, console input window input (standard input), console output (standard output)

(1) Code input

Directly into the code when this is the case, we first create a variable, and then define the data type of the variable (it is convenient to draw space in memory variable data), and finally assign the corresponding string to this variable (of course, the three steps in the program is done once), such as S = ' ABC ' (in Python is a dynamic type, so instead of defining the variable type beforehand, the quotation marks are the only way to declare that Python S is the str type, and this declaration tells the variable s to be assigned a string encoded in source code (the default is ASCII). So this statement lets Python do a few things (the encoding process):

1 query to the current source code method

2 According to the current source code in memory to open up corresponding to the size of the STR type of space; the character ABC encoded by the source code into 01 characters

3 put the 01 yards into the open space

4 the pointer to the variable s points to the address of the space

(2) Console input window input (standard input)

The most common raw_input () built-in function is used to implement the standard input, which reads all the characters you enter in the console and assigns to the specified variable as a str type, that is, encoded by source code. The code is as follows:

S = Raw_input (' The words what'll show to You ')

When you execute this line of code, the console input window prompts you for a character, and after you enter the character ABC and return, the characters you enter are assigned to the variable S, and then the thing that happens is like executing s= ' abc '.

(3) Console output (standard output)

When the output window is output from the console, it does so when it is instructed to print s (decoding process):

1 Locate the address to which s is pointing; query to S is str type

2 query to the current source code method

3 Remove the memory to open 01 yards in the space corresponding to the STR type size

4 According to the current code encoding method of decoding 01 yards, that is, from the encoding table to find the corresponding characters

5 Display this character in the output window: ABC

(4)

In the above three procedures are prone to such a few errors:

1 encode Chinese characters in ASCII encoding, which obviously fails, ASCII does not

2 using UTF-8 decoding method to encode 01 yards with GBK decoding, obviously will show garbled (because the corresponding type of space will be wrong, this may read the content should not be read, some game bug is similar situation)

And in peacetime we often do not declare the source code file with UTF-8 encoding when the Chinese comments, in the compilation will occur when the 1th error, compile before the whole program to save the code, so it is wrong, the same time the variable is given a Chinese string error.

And when the first line to declare the source code file with UTF-8 encoding, write comments in Chinese is no problem, and if the variable is given a Chinese string is not a problem (UTF-8 encoding), while printing the variable is not a problem (UTF-8 decoding), But this time if you read from the external file with GBK encoded characters into the variable, and then print, although not error, but the output will appear garbled, if you want to display the correct, you need to transcode.

3.2.2 from Win7 the TXT read data in the document (read from memory), write Win7 the TXT document (write data to memory)

(1) Read data from Win7 TXT document (read from memory)

Before explaining the encoding process, let's introduce the simple file IO operation in Python, we know that the function of reading and writing files on disk is provided by the operating system, and the modern operating system does not allow the normal program to operate the disk directly, so A read-write file is a request for the operating system to open a file object (often referred to as a file descriptor, typically to an address and a name), and then read the data (read file) from the file object via an interface provided by the operating system, or write the data to the file object (write file). The corresponding code is as follows (to be displayed on one line, separated by a Chinese semicolon):

Test1hand = open (' Test1.txt ', ' R '); test1 = Test1hand.read (); Test1hand.close ()

The first sentence opens the Test1.txt file in the same directory with the Open () function and returns the interface or handle of the file object (that is, using the permission, where the parameter r is the permission to read) and assigns the variable Test1hand.

The second sentence through the handle of the. Read () method, one time to read the entire contents of the file, note, read the contents of the file is 01 yards (here to say, originally for reading 01 yards using the open mode is ' RB ', but for Unix-like systems, the text file itself is a binary file, This b is optional, but Windows is not Unix-like system, it is strange that this does not need to add B), this 01 yards is what encoding method is determined by the file system itself, read only 01 yards, and then put all 01 yards back, here is assigned to the variable test1, Note again, at this time is also the STR type assigned to it, that is, test1 variable type is str, that is, the system thinks that these 01 yards are encoded by the source code, and if the file itself is GBK encoded, but the source code is UTF-8, which is a mistake, Of course, this does not error, continue to note, in the decoding time, still will not error, but you will see the error, that is, if you directly print decoding, it will be encoded according to the system to decode, then its output characters will appear garbled, if you want to display correctly, you need to transcode. (Note: A. Read () method will monopolize the contents of the file, if the method is used multiple times between opening and closing the file, the subsequent method reads an empty file, and the returned character is ").

The third sentence is to call the close () method to close the file. The file must be closed after it is used because the file object consumes the resources of the operating system, and the number of files that the operating system can open at the same time is limited.

So this process does not occur any coding or decoding, is completely 01 yards of operation.

and read the file method in addition to. Read () also commonly used are. ReadLine () and. ReadLines (), and even a dedicated Linecache library, as follows. ReadLines () method, others temporarily do not speak.

The. ReadLines () method is actually the same as the. Read () method, except that the 01 yards it returns is divided into segments by line breaks, and then a str list is returned, that is, the branch is read, and for each string in the list, the situation is the same as above.

(2) TXT document written to Win7 (write data to memory)

The process of writing data to a TXT document is similar to reading, and the code is as follows (to be displayed on one line, separated by a Chinese semicolon):

Test1hand = open (' Test1.txt ', ' W '); Test1hand.write (' ... '/valuename); Test1hand.close ()

This process is a coding process or does not occur any decoding code completely 01 yards operation process, if the test1hand.write (' ... '/valuename) is ' ... ', then the ' ... ' content is encoded according to the source code, then the generated 01 yards are written to the file, If it is directly a variable name, it will directly write the corresponding 01 yards of the variable to the file.

There's a little mistake here. I don't know why, because theoretically. The write () method is to write 01 yards directly to the file, and then the text file is opened with the operating system's default decoding method to decode the 01 code display, so theoretically for me this win7 system, Every time the text file is opened it is 01 yards GBK decoding and display, if 01 yards is encoded with GBK, you can display the characters correctly, if 01 yards are encoded with UTF-8 or Unicode, it will directly display garbled. This theory is correct for both GBK and UTF-8 encoding, but for Unicode encoding, the error is directly when writing: Unicodeencodeerror: ' ASCII ' codec can ' t encode characters in Position 0-1: Ordinal not in range (128).

3.2.3 from Wxpython a text box in the generated Simple window interface

The text box output from the simple window interface generated by the Wxpython, and the text box contents as input to the text box data, the code example is as follows:

The text box name. SetValue (string/string variable name)

String variable name = text box name. GetValue ()

The first sentence is the right parenthesis in the data transfer to the left text box, the text box decoding the display, the second sentence is the right text box in the data transfer into the right string variable, stored up.

This process is not particularly clear, because there are several processes, a setvalue process, a text-box decoding process, a getvalue process, a test, a known number of cases, and a process guessing:

(1) GBK encoded data through the SetValue process into the text box, the text box can be decoded and displayed correctly, the GetValue process gets the text box content to return Unicode encoded data.

Process speculation: GBK encoded data through the SetValue process into the text box, the text box is GBK decoded and displayed, the GetValue process gets the text box character and Unicode encoding and return.

(2) Unicode encoded data through the SetValue process into the text box, the text box can be decoded and displayed correctly, the GetValue process gets the text box content to return Unicode encoded data.

Process guessing: Unicode encoded data passes through the SetValue process into the text box, the text box is Unicode decoded and displayed, the GetValue process obtains the text box character and Unicode encodes and returns.

(3) UTF-8 encoded data through the SetValue process into the text box, text box can not decode and direct error, rather than display garbled

The summary is that any encoded data can be transmitted in the past, but the text box can only be GBK and Unicode decoding, and the data transferred back directly to the text box character Unicode encoding. So if you transmit the GBK code, the correct display is sent back to the Unicode code.

Description: This article in the writing process to see a large number of network resources and books, thank you programmers selfless sharing, on the basis of their own understanding of the analysis and practice, in a more detailed way to re-add the collation, hope to solve the practical problems of everyone. In this process in fact also encountered a few know its why the problem in the article has marked, in addition I learn programming still not how many days, Caishuxueqian, the article will be careless, hope to enlighten. In addition, this blog will try to weekly an original technical articles, please pay attention, hope to communicate, without notice.

Break up, speak Chinese code

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Break up, speak Chinese code

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Break up, speak Chinese code

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support