Adaptation and processing of related files in Python

Last Update:2013-12-17 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In Python, there are many difficulties in Chinese, that is, the beginner's "Ke Xing". The following article provides related solutions to this problem, we hope we can get some good methods to solve these problems, so that we can better flexibly use them in computer operation.

Import sys. version & apos; 2.5.1 (r5.1: 54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit (Intel)] & apos; use NotePad to create a file ChineseTest. py, default ANSI:
S = "Chinese"
Print s

Test it:

 
 
  
  E: \ Project \ Python \ Test> python Chinese Python ChineseTest. py
  
  File "ChineseTest. py", line 1
  
  SyntaxError: Non-ASCII character & apos; \ xd6 & apos; in file
ChineseTest. py on line 1, but no encoding declared;

Secretly change the file encoding to UTF-8:

 
 
  
  E: \ Project \ Python \ Test> python Chinese ChineseTest. py
  
  File "ChineseTest. py", line 1
  
  SyntaxError: Non-ASCII character & apos; \ xe4 & apos; in file
ChineseTest. py on line 1, but no encoding declared; no help.

Now that it provides a URL, let's take a look. After a brief look, we finally know that if the file contains non-ASCII characters, We need to specify the encoding declaration in the first or second line. Change the encoding of the ChineseTest. py file to ANSI and add the encoding statement.

 
 
  
  # Coding = gbk
  
  S = "Chinese"
  
  Print s:
  
  E: \ Project \ Python \ Test> python ChineseTest. py
  
  Chinese is normal :)

I)Take a look at its length:

 
 
  
  # Coding = gbk
  
  S = "Chinese"
  
  Print len (s)

S is 'str' type. Therefore, a Chinese character is equivalent to two English characters, so the length is 4.
We write it like this

 
 
  
  # Coding = gbk
  
  S = "Chinese"
  
  S1 = u "Chinese"
  
  S2 = unicode (s, "gbk") # omitting parameters will be decoded using the default ASCII in python Chinese
  
  Ss3 = s. decode ("gbk") # convert str to unicode: decode. unicode Function
Same role
  
  Print len (s1)
  
  Print len (s2)
  
  Print len (s3)

II) Then let's take a look at the processing of the file.:

Create a file named test.txt in ANSI format with the following content:
Abc Chinese
Read data using python

 
 
  
  # Coding = gbk
  
  Print open ("Test.txt"). read ()
  
  Result: abc (Chinese)

The file format into UTF-8:
Result: abc Juan
Obviously, decoding is required here:

 
 
  
  # Coding = gbk
  
  Import codecs
  
  Print open ("Test.txt"). read (). decode ("UTF-8 ")
  
  Result: abc (Chinese)

I used Editplus to edit test.txt, but when I used the notepad that came with Windows to edit and coexist in UTF-8 format,
Running error:
Originally, some software, such as notepad, will insert three invisible characters 0xEF 0xBB 0xBF at the beginning of the file when saving a file encoded in UTF-8 ).
Therefore, we need to remove these characters during reading. The codecs module in python Chinese defines this constant:

 
 
  
  # Coding = gbk
  
  Import codecs
  
  Data = open ("Test.txt"). read ()
  
  If data [: 3] = codecs. BOM_UTF8:
  
  Datadata = data [3:]
  
  Print data. decode ("UTF-8 ")
  
  Result: abc (Chinese)

3) A few issues left behind

In the second part, we use the unicode function and decode method to convert str to unicode. Why do the parameters of these two functions use "gbk?
The first response is that gbk (# coding = gbk) is used in our encoding statement, but is it true?
Modify the source file:

 
 
  
  # Coding = UTF-8
  
  S = "Chinese"
  
  Print unicode (s, "UTF-8 ")
  
  Run, error:
  
  Traceback (most recent call last ):
  
  File "ChineseTest. py", line 3, in <module>
  
  S = unicode (s, "UTF-8 ")
  
  UnicodeDecodeError: & apos; utf8 & apos; codec can & apos; t decode
Bytes in position 0-1: invalid data

To put it simply, print in python directly transmits the string to the operating system, so you need to decode str to the same format as the operating system. CP936 (almost the same as gbk) is used in Windows, so gbk can be used here.
Last test:

 
 
  
  # Coding = UTF-8
  
  S = "Chinese"
  
  Print unicode (s, "cp936 ")
  
  Result: Chinese

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Adaptation and processing of related files in Python

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Adaptation and processing of related files in Python

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support