How to use Chinese in Python _python

Source: Internet
Author: User
Tags windows ssh client in python
Let's look at the Python version:
>>> Import Sys
>>> sys.version
' 2.5.1 (r251:54863, APR 2007, 08:51:08) [MSC v.1310 bit (Intel)] '

(i)
Create a file chinesetest.py with Notepad, default ANSI:
s = "Chinese"
Print S

Test to see:
E:\project\python\test>python chinesetest.py
File "chinesetest.py", line 1
Syntaxerror:non-ascii character ' \xd6 ' in file chinesetest.py on line 1, but no encoding declared; Http://www.python.org/peps/pep-0263.html for details

Secretly to change the file code to UTF-8:
E:\project\python\test>python chinesetest.py
File "chinesetest.py", line 1
Syntaxerror:non-ascii character ' \xe4 ' in file chinesetest.py on line 1, but no encoding declared; Http://www.python.org/peps/pep-0263.html for details

No avail...
Since it provides a Web site, look at it. Simply browsing, and finally knowing that if there are non-ASCII characters in the file, you need to specify the encoding declaration in the first or second line. Change the encoding of the chinesetest.py file to ANSI and add the code declaration:
# CODING=GBK
s = "Chinese"
Print S

Try again:
E:\project\python\test>python chinesetest.py
Chinese

Normal slightly:)
(b)
Take a look at its length:
# CODING=GBK
s = "Chinese"
Print Len (s)
Results: 4.
s here is the STR type, so when calculating one Chinese is equivalent to two English characters, so the length is 4.
We write this:
# CODING=GBK
s = "Chinese"
S1 = u "Chinese"
S2 = Unicode (S, "GBK") #省略参数将用python默认的ASCII来解码
S3 = S.decode ("GBK") #把str转换成unicode是decode, the Unicode function works the same
Print Len (S1)
Print len (S2)
Print Len (S3)
Results:
2
2
2
(iii)
Then look at the processing of the file:
Create a file test.txt, file format with ANSI, content:
ABC Chinese
Using Python to read
# CODING=GBK
Print open ("Test.txt"). Read ()
Result: ABC Chinese
Change the file format to UTF-8:
Result: ABC Juan  PO
Obviously, you need to decode this:
# CODING=GBK
Import Codecs
Print open ("Test.txt"). Read (). Decode ("Utf-8")
Result: ABC Chinese
The above Test.txt I edited with EditPlus, but when I use the Notepad editor with Windows as the UTF-8 format,
Run times wrong:
Traceback (most recent call last):
File "chinesetest.py", line 3, in <module>
Print open ("Test.txt"). Read (). Decode ("Utf-8")
Unicodeencodeerror: ' GBK ' codec can ' t encode character U ' \ufeff ' in position 0:illegal multibyte

It turns out that some software, such as Notepad, when saving a file encoded in UTF-8, inserts three invisible characters (0xEF 0xBB 0xBF, or BOM) where the file begins.
So we need to get rid of these characters when we read, and the codecs module in Python defines this constant:
# CODING=GBK
Import Codecs
data = open ("Test.txt"). Read ()
If data[:3] = = codecs. Bom_utf8:
data = Data[3:]
Print Data.decode ("Utf-8")
Result: ABC Chinese

(iv) a bit of legacy
In the second section, we convert STR to Unicode using the Unicode function and the Decode method. Why are the arguments for these two functions "GBK"?
The first reaction was that we used GBK (# CODING=GBK) In our coding statements, but really?
Modify the source file:
# Coding=utf-8
s = "Chinese"
Print Unicode (S, "Utf-8")
Run, Error:
Traceback (most recent call last):
File "chinesetest.py", line 3, in <module>
s = Unicode (S, "Utf-8")
Unicodedecodeerror: ' UTF8 ' codec can ' t decode bytes in position 0-1: Invalid data
Obviously, if the front is normal because both sides of the use of GBK, so here I keep the two sides utf-8 consistent, it should be normal, not error.
Further examples, if we convert here still with GBK:
# Coding=utf-8
s = "Chinese"
Print Unicode (S, "GBK")
Results: Chinese
Read an English document that roughly explains the print principle in Python:
When Python executes a print statement, it simply passes the output to the operating system (using fwrite () or something L Ike it), and some the other program is responsible for actually displaying which output on the screen. For example, on Windows, it might is the Windows console subsystem that displays. Or If you are using Windows and running Python on a Unix box somewhere else, your the Windows SSH client is actually RESPONSIBL E for displaying the data. If you are are running Python in a xterm on Unix, then xterm and your X server handle the display. <>

To print data reliably, your must know the encoding this display program expects.

Simply put, the print in Python passes strings directly to the operating system, so you need to decode STR into a format consistent with the operating system. Windows uses CP936 (almost the same as GBK), so you can use GBK here.
Final Test:
# Coding=utf-8
s = "Chinese"
Print Unicode (S, "cp936")
Results: Chinese

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.