The character string encoding rules in Python have always been a headache for me. It took me some time to study the encoding rules. The main content involved is: the encoding format of the file, the default encoding format of the system, and the encoding conversion of the string.
This article does not study the specific encoding format, and the relevant content can be Google.
File encoding the so-called file encoding refers to the Python source code encoding format. Generally, notepad ++ can see the encoding of the source code file. The format of the source code file affects the strings defined in the source code. If the source code encoding format is UTF-8, the encoding format of the strings defined below is UTF-8.
S = 'hello'
To facilitate the subsequent analysis of strings, we have defined two functions.
import chardetdef toHexString(s): return ":".join("{0:x}".format(ord(c)) for c in s)def getCharset(s): return chardet.detect(s)['encoding']With these two functions, you can find the specific content of the string and the encoding format of the string. (Chardet library is required here) the file encoding format can be declared in the Source Code. For more information, see PEP 0263 -- Defining Python Source Code Encodings. You can define the file encoding format in one of the following three methods in the first or second lines of the file, so that the Python parser can parse the file correctly.
# coding=
#!/usr/bin/python# -*- coding:
-*-
#!/usr/bin/python# vim: set fileencoding=
:
If the source code encoding format is not specified, the default value is ascii. For details about the supported encoding formats, see here. Note that utf_8 and uft-8 are the same name. In actual use, if the source code format is UTF-8, you do not need to specify it. The above defined string is UTF-8. If the file format is ANSI, use the following definition to use the above variable s definition normally, and the format in s is gb2312.
#coding=gb2312
The default encoding of the system can be obtained in the following way. The default encoding is ascii. It affects the understanding of transcoding between strings mentioned later. Note that this is only easy to understand.
import syssys.getdefaultencoding()
For more information about this function, see here. For encoding conversion, see what encoding is and what decoding is. Assume there is a script as follows:
import base64s1 = 'hello'print s1s2 = base64.b64encode(s1)print s2 # out: aGVsbG8=
The content of s1 is 'Hello'. After base64 encoding, the content of s2 is 'agvsbg8 = '. The process from s1 to s2 is called encoding, and from s2 to s1 is called decoding. The conversion between the encoding formats of strings in Python is similar to the preceding. For strings, two functions are provided: str. encode and str. decode. Both functions are converted to the system default encoding. encode is the system default encoding to the specified encoding, while decode is the specified encoding to the system encoding. See the following example:
# Coding = utf-8import chardetdef toHexString (s): return ":". join ("{0: x }". format (ord (c) for c in s) def getCharset (s): return chardet. detect (s) ['encoding'] s = ''print getCharset (s) s1 = s. decode ('utf-8 '). encode ('gb2312') print getCharset (s1)The source code encoding format is UTF-8, so the s1 encoding format is UTF-8. If you want to convert the format to gb2312, you must first decode it into the system default encoding and then encode it into gb2312.