This article is reprinted. Please indicate the source for reprinting.
Address: http://blog.csdn.net/mayflowers/article/details/1568852
@ Mayflowers Thank you ~~~
1. Use Chinese in Python
There are two default strings in Python: Str and Unicode. In python, be sure to distinguish between "Unicode string" and "Unicode object. All subsequent "Unicode strings" refer to "Unicode objects" in Python ". In fact, there is no such thing as a "Unicode string" in Python, but only a "Unicode" object. A traditional Unicode string can be used completely
STR object. It is only a byte stream at this time, unless decoded as a unicode object, there is no practical significance. We use "Haha" to test on multiple platforms, where "Haha" corresponds to different encodings: 1. Unicode (UTF8-16 ),
C854; 2. the UTF-8, e59388; 3. GBK, b9fe.
1.1 windows ConsoleThe following is the result of running on the Windows Console: we can see that on the console, the Chinese character encoding is GBK rather than UTF-16. After decoding string S (GBK encoding) with decode, you can get the same UNICODE object. Note: printing SS on the console does not mean it can be serialized directly. For example, directly outputting ss to a file will throw the same exception. When processing Unicode Chinese strings, you must first call the encode function to convert them to other encoding outputs. This is the same for all environments. Summary: in Python, the "str" object is a byte array. Whether the content is a legal string or not, and what encoding the string uses (GBK, UTF-8, Unicode) is not important. The content must be recorded and determined by the user. These restrictions also apply to Unicode objects. Remember that the content in the "Unicode" object is definitely not necessarily a valid Unicode string. We will soon see this situation. Summary: on the Windows console, STR objects that support GBK encoding and Unicode objects that support Unicode encoding are supported.
1.2 Windows Idle (run on Shell)In Windows Idle, the running effect is different from that in the Windows Console: it can be seen that for strings that do not use "U" as the identifier, idle encodes the Chinese characters into GBK. However, for Unicode strings using "u", idle uses GBK encoding. The difference is that at this time, every character is a Unicode (object) character !! Len (SS) = 4. This creates a magical problem. The current SS cannot be normally displayed in idle. And I cannot convert ss to normal code! For example, the following method is used: this may be because idle localization is not good enough to support Chinese characters. We recommend that you do not use the U "Chinese" method in the idle shell, because it is not what you want. It also indicates that the shell of idle supports two formats of Chinese strings: the "str" Object encoded by GBK, and the Unicode object encoded by Unicode.
1.3 run code on idleRun the file on the shell of the idle and get different results. The file content is: the result of direct running is: flawless, quite satisfactory. I have not tried whether other encoded files can run normally, but it should be good. The same Code has been tested on the Windows console, and there is no problem.
1.4 windows eclipseIt is more difficult to process Chinese in eclipse, because in eclipse, writing code and Running code are different windows, and they can have different default encodings. For the following code :#! /Usr/bin/Python #-*-coding: UTF-8-*-S =
"Haha"Ss = u
'Haha'Print repr (s) print repr (SS) print S. Decode (
'Utf-8'). Encode (
'Gbk') Print ss. encode (
'Gbk') Print S. Decode (
'Utf-8') The first four prints of print SS run normally, and the last two prints throw an exception: '/xe5/x93/x88/xe5/x93/x88 'U'/u54c8/u54c8' hahaha traceback (most recent call last ):
File "E:/workspace/Eclipse/testpython/test/test_encoding_2.py", line 13, in <module>Print S. decode ('utf-8') unicodeencodeerror: 'ascii 'codec can't encode characters in position 0-1: ordinal not in range (128) that is to say, the STR object of GBK encoding can be printed normally, but the Unicode object of Unicode encoding cannot be printed. Click "Run as" and "run" on the source file, and select "common" in the pop-up dialog box. The default encoding method of the eclipse console is GBK; so Unicode is not supported. If you change the coding in the file to GBK, you can directly print the STR object encoded by GBK, such as S. If you set the source file encoding to "UTF-8", the console encoding is also set to "UTF-8", it is reasonable to say that printing should be no problem. However, the experiment shows that when printing the STR object encoded by UTF-8, the last character of Chinese will be garbled and cannot be read normally. However, I have already met the requirement. At least no exception is thrown. No: BTW: the eclipse version used is 3.2.1.
1.5 read Chinese from a fileWhen editing a file in Notepad under a window, if it is saved as Unicode or UTF-8, two bytes "/xFF/xfe" and three bytes "/XeF/xbb/xbf" are added at the beginning of the file respectively ". You may encounter problems when reading data, but the processing of these characters varies with different environments. Take the Windows console as an example. use NotePad to save three different versions of "Haha ".
Open a file in UTF-8 format and read the UTF-8 string, and then decode it into a unicode object. However, the added three characters are converted to a Unicode character. The data value of the character is "/xFF/xfe ". This character cannot be printed. Skip this character during encoding.
After opening a file in unicode format, the obtained string is correct. This applies to UTF-16 decoding, can get the correct unicdoe object, can be used directly. The extra character will be filtered out during conversion. After opening an ANSI file, you can directly use it without filling in characters. Conclusion: There is no problem in reading and writing a file generated using python, but when processing a text file generated by notepad, if the file may be non-ANSI encoded, you need to consider how to fill the characters.
1.6 use Chinese characters in the databaseI just got in touch with python. The database I use is MySQL. During the insert and search operations, if the character encoding used in the runtime environment is inconsistent with that in MySQL, it may cause runtime errors. Of course, as shown above, the running environment is not a key factor. The key is
Query statementEncoding method. If the query string is converted to the default character encoding of MySQL during each query operation, no problem will occur. But writing code in this way is too painful. Use the following code to connect to the database:
Self. Conn = mysqldb. Connect (use_unicode = 1, charset =
'Utf8',
** Server) What I cannot understand is that since the database uses the default encoding is UTF-8, I also use the UTF-8 when connecting, why is the obtained text content unicode encoded (UNICODE object )? Is this the setting of mysqldb library?
1.7 use Chinese in XMLSimilar to mysqldb, the XML. Dom. minidom method calls the toxml method to generate a unicode object. To output UTF-8 text, you can use either of the following methods:
1. Use System FunctionsEncoding when outputting XML documents is the best method I think. Xmldoc. toxml (encoding = 'utf-8') xmldoc. writexml (OUTFILE, encoding = 'utf-8 ')
2. Self-encoding generationAfter toxml is used, you can call the encode method to encode the document. However, this method cannot obtain the appropriate XML Declaration (the encoding part in the first line of the XML document ). Do not use xmldoc. createprocessinginstruction to create a processing action: <? XML version = '1. 0' encoding = 'utf-8'?> Although XML declaration looks like this, it is not actually a processing action. The following method can be used to obtain a satisfactory XML file: Print> OUTFILE, "<? XML version = '1. 0' encoding = 'utf-8'?>"
Print> OUTFILE, xmldoc. toxml (). encode ('utf-8') [22:]
The second line needs to filter out the "<? XML version = '1. 0'?> ", Its length is 22. In addition, in the shell of idle, do not assign values to attributes using u'china. As discussed above, the resulting Unicode string is incorrect.