Python character set encoding and file read/write

Source: Internet
Author: User

In python, the default encoding is ASCII, which can be set and obtained in the following ways:

Import sys
Print sys. getdefaultencoding ()
SYS. setdefaultencoding ('gbk ')

However, the new default encoding does not take effect until python is restarted. After I try it out, setdefaultencoding will always fail without this attribute. It is true that there is no dir. The Python version is 2.5. I don't know if it has been canceled.

When print is used for output, Python passes the content to the system for processing, and Windows outputs the content according to the system's default encoding. If it contains Chinese characters, pay attention to the following points.

1. Python code file encoding
By default, Py files are ASCII encoded. When Chinese files are displayed, an ascii code is converted to the default system encoding. An error occurs: syntaxerror: Non-ASCII character. You need to add encoding instructions in the first or second lines of the code file:

1 # Coding = GBK
2 print 'Chinese'

2 string Encoding
Strings directly entered as above are processed according to the code file encoding. For unicode encoding, there are three methods:

1 S1 = u'chinese'
2 S2 = Unicode ('Chinese', 'gbk ')
3 S3 = s1.decode ('gbk ')

Unicode is a built-in function. The second parameter indicates the encoding format of the source string.
Decode is a method used by any string to convert the string to unicode format. The parameter indicates the encoding format of the source string.
Encode is also a method of any string. It converts a string to the format specified by the parameter.

3. Default System Encoding
For Chinese systems, the default value is GBK and gb2312, because it is a GBK word set. When print is used for output, the string is converted to this format. When implicit conversion is performed, it is converted from the code file encoding format to GBK. The default value is ascii-> GBK. Consider the second point above. If the string encoding is not ASCII, implicit conversion will fail. explicit conversion is required and the encode method is used. If the code file format is specified as GBK, there is no problem with implicit conversion.

1 # Coding = GBK
2
3 S = u'chinese'
4 print S. encode ('gbk ')

File read/write

It is simple to read and write files in ASCII or GBK encoding format. The read and write operations are as follows:

1 # Coding = GBK
2
3 F = open ('C:/intimate.txt ', 'R') # R indicates the file opening mode, that is, read-only
4 S1 = f. Read ()
5 S2 = f. Readline ()
6 S3 = f. readlines () # Read all content
7
8 F. Close ()
9
10 F = open ('C:/intimate.txt ', 'w') # W write a file
11 F. Write (S1)
12 F. writelines (S2) # No writeline
13 F. Close ()

F. writelines does not output line breaks.
Unicode file read/write:

1 # Coding = GBK
2 Import codecs
3
4 f = codecs. Open ('C:/intimate.txt ', 'A', 'utf-8 ')
5 F. Write (u'chinese ')
6 S = 'Chinese'
7 F. Write (S. Decode ('gbk '))
8 F. Close ()
9
10 F = codecs. Open ('C:/intimate.txt ', 'R', 'utf-8 ')
11 S = f. readlines ()
12 F. Close ()
13 For line in S:
14 print line. encode ('gbk ')

 

# Renewal #---------------------------------------------------------------------------------------
The above content is forwarded, And I will add some content below. You can solve the problem of 'setulultencoding always error' in the above section by using the following methods:
Import sys
Reload (sys)
Print sys. getdefaultencoding ()
SYS. setdefaultencoding ('utf-8') through the above Code, we should be able to change the encoding method of the system. an exception is thrown when we decode an invalid encoded string: >>> S = "\ x84 \ xe5 \ xb0 \ x8f \ xe6 \ x98 \ x8e"
>>> S. Decode ('utf-8') traceback (most recent call last ):
File "<interactive input>", line 1, in?
File "E: \ Program Files \ python24 \ Lib \ encodings \ utf_8.py", line 16, in decode
Unicodedecodeerror: 'utf8' codec can't decode byte 0x84 in position 0: Unexpected code byte
>>> Next, analyze s: Convert s to binary code: 10000100 11100101 10110000 10001111 10110110 10011000 10001011 analysis of this binary code: the first byte is 10000100, in UTF-8 encoding rules, bytes in the range of 0x80 to 0 x BF are only followed by Bytes. They are not characters in themselves. Therefore, it is invalid to place them in the first byte. Then, the second byte is analyzed and we can see that there are three consecutive 1 on the left of the byte, which means the second byte and the next two bytes (that is, 11100101 10110000 10001111) three bytes are encoded as one character. Likewise, we can separate the three following bytes into one character. Therefore, we only need to remove the first illegal byte, And we can properly decode it> S = '\ xe5 \ xb0 \ x8f \ xe6 \ x98 \ x8e'
>>> Print S. Decode ('utf-8 ')
James
>>>

The problem is that if we do not know the location of the invalid byte in the character, how can we decode and produce the results of those characters normally encoded? We can use the following two methods:

>>> S = '\ x84 \ xe5 \ xb0 \ x8f \ xe6 \ x98 \ x8e \ x84'
>>> S. Decode ('utf-8 ')
Traceback (most recent call last ):
File "<interactive input>", line 1, in?
File "E: \ Program Files \ python24 \ Lib \ encodings \ utf_8.py", line 16, in decode
Unicodedecodeerror: 'utf8' codec can't decode byte 0x84 in position 0: Unexpected code byte
>>> Print S. Decode ('utf-8', 'ignore ')
James
>>> Print S. Decode ('utf-8', 'replace ')
? James?
>>>

From the code above, we can see that the second parameter when calling the decode function is the method for processing illegal (that is, an error) bytes. The default value is 'strict ', that is, if an error byte occurs, an exception is thrown. In 'ignore' mode, invalid bytes in the string are ignored. In 'replace 'mode, invalid bytes are replaced with fixed bytes.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.