Solve the problem in Chinese with Python (summary of the experience of many predecessors, which must be viewed by Beginners)

Source: Internet
Author: User
First, let's talk about how I encountered Python Chinese input problems. I wrote a small tool to query Python library functions. Because Python is a built-in document, you can use the help function to query the usage instructions of each system function. In general, the key usage and attention points are clearly stated in this system document. I tried to find functional explanations for the Chinese version of the system documentation on the Internet, but none of them were found. so I decided to learn about the function explanations in the English version of the system.

If you want to program Tkinter and wxPython, and want to know the usage and attribute of common widgets, which is not very good in English, I recommend you, you can go to the book "Python and Tkinter programming". The Appendix B and Appendix C on pages 7 to 392 have chosen common functions and almost all attributes for introduction, highlights cannot be missed.

The tool I mentioned above is soon ready. You can query functions that have not been queried, and save the keyword key and query result info, so that you can view them directly from the list next time. if no query is found, then manually add it to the list -- this is a simple tool. Everything looks good. But the problem also arises: After the English info is opened, some words in the explanation do not know the meaning, and you want to write the words in info after checking them. after saving the words, you can open them directly from the hard disk next time. However, if you enter Chinese characters in the English info file, the decoding will fail during the saving process, that is, the following error will pop up when the Chinese characters are decoded:

UnicodeEncodeError: 'ascii 'codec can't encode character u' \ u6211 'in position 61: ordinal not in range (128)

The location 61 is elastic, that is, the location where the Chinese character is added to info. This error basically persists, that is, when I want to write the modified info to a file:

The code is as follows:


Fp = open('tt.txt ', 'w ')
Fp. write (info. encode ("UTF-8") # Here error
Fp. close ()


These three lines seem to be correct. However, an error occurs in the code in the middle line. Is the encode method incorrect? I have tried many types of encoding, such as ANSI, UTF-8, SHIFT_JIS, GB2312, GBK encoding, and found that none of them work. So I am confused.

Now I know why it is wrong. The problem is that the modified info string variable. The data in info is a comprehensive string that I found from the help function in the system (that is, the original info in pure English) and the Chinese character I entered manually. When I query documents from the system, I saved the original info as follows:

The code is as follows:


Fp = open('tt.txt ', 'w ')
Fp. write (info)
Fp. close ()


Note that an error occurs when the original info is directly written to the file. Do you know the encoding method after this writing? Open tt.txt and check the encoding method. the encoding method is ANSI. The error is generated as follows: proceed.

Therefore, the conclusion is that when you operate in the memory, you can determine the encoding method automatically regardless of the encoding method. However, if you want to use Chinese characters and temporarily save data or strings through files, you must write them in UTF-8 format when writing the file for the first time, that is, the following method:

The code is as follows:


Fp = open('tt.txt ', 'w ')
Fp. write (info. encode ("UTF-8 "))
Fp. close ()


This will ensure that you do not need to convert the encoding method to directly print and display it after the next read, even as the control text is no problem. Pay attention to this.

The problem is found. we will discuss it in the following.

Some people say that if you use #-*-coding: UTF-8-*-, isn't that enough? Actually not.

Through my tests (I use the IDLE (Python2.5.4 GUI) compiler. [1] no matter if I start with #-*-coding: UTF-8-*-or if the default UTF-8 encoding method is set in the software, there is no problem with the use of Chinese between controls and files. [2] info = 'Chinese'; such operations are acceptable. You can use the normal reading method when reading. I think the reason is that the compiler upgrade solves the problem of Chinese display and usage, and the early Chinese language cannot be used now does not exist.

The code is as follows:


# Coding = UTF-8
Try:
JAP = open ("jap.txt", "r ")
CHN = open ("chn.txt", "r ")
UTF = open ("utf.txt", "w ")

Jap_text = JAP. readline ()
Chn_text = CHN. readline ()
# Decode into a UTF-16, then encode into a UTF-8
Japan _ text_utf8 = Japan _ text.decode ("SHIFT_JIS"). encode ("UTF-8 ")
# Do not convert to UTF-8.
Chn_text_utf8 = chn_text.decode ("GB2312"). encode ("UTF-8 ")
# The encoding method is case-insensitive, and the same is true for UTF-8.
UTF. write (jap_text_utf8)
UTF. write (chn_text_utf8)
UTF. close ()
Handle T IOError, e:
Print "open file error", e


This is the code I learned to extract from the pythonprocessing python 文 chapter from http://www.jb51.net/article/26542.htm. Here, we will explain that both the preceding jap_text_utf8 and chn_text_utf8 must be ensured by the default machine encoding method or UTF-8 encoding method. The most important thing is to maintain consistency. After the unified encoding is UTF-8, you can write a file and read it again. When reading data, use the following common method:

The code is as follows:


Filen = open('tt.txt ')
Info = filen. read ()
Print info


In addition. Someone uses the following method for encoding and conversion:

The code is as follows:


Import sys
Reload (sys)
Sys. setdefaultencoding ('utf8 ')

Def ConvertCN (s ):
Return s. encode ('gb18030 ')

Def PrintFile (filename ):
F = file (filename, 'r ')
For f_line in f. readlines ():
Print ConvertCN (f_line)
F. close ()

If _ name _ = "_ main __":
PrintFile('1.txt ')
Print ConvertCN ("\ n ****** press any key to exit! ******")
Print sys. stdin. readline ()


Through my tests, this method is not feasible. If the second line is removed, the setdefaultencoding function in the third line will be invalid. if the second line is retained, the code in the third line and later will not be executed (although no error is reported ). If this method is feasible, try it.
In addition, deep analysis of python Chinese garbled text. Principle of text encoding: it turns out that an appropriate annotation symbol is added at the beginning of the text to indicate the internal encoding method, then the interpreter will translate the byte according to a certain step or in a flexible way according to a corresponding rule to get the original text, the translation step and rules are exactly at the beginning of the description. Therefore, if your body is a single byte encoding method, you can add a suitable rule at the beginning of your encoding to tell others how to translate your encoded text. Among them, the knowledge at the end of the text such as BOM_UTF_8 is also very interesting, and similar to BOM_UTF_16. the symbols at the end of different encoding methods are different. you can pay attention to it.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.