Python string encoding rules

Last Update:2013-12-30 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The character string encoding rules in Python have always been a headache for me. It took me some time to study the encoding rules. The main content involved is: the encoding format of the file, the default encoding format of the system, and the encoding conversion of the string.

This article does not study the specific encoding format, and the relevant content can be Google.

File encoding the so-called file encoding refers to the Python source code encoding format. Generally, notepad ++ can see the encoding of the source code file. The format of the source code file affects the strings defined in the source code. If the source code encoding format is UTF-8, the encoding format of the strings defined below is UTF-8.

S = 'hello'

To facilitate the subsequent analysis of strings, we have defined two functions.

import chardetdef toHexString(s):    return ":".join("{0:x}".format(ord(c)) for c in s)def getCharset(s):    return chardet.detect(s)['encoding']

With these two functions, you can find the specific content of the string and the encoding format of the string. (Chardet library is required here) the file encoding format can be declared in the Source Code. For more information, see PEP 0263 -- Defining Python Source Code Encodings. You can define the file encoding format in one of the following three methods in the first or second lines of the file, so that the Python parser can parse the file correctly.

# coding=

#!/usr/bin/python# -*- coding: 
 
   -*-

#!/usr/bin/python# vim: set fileencoding=
 
   :

If the source code encoding format is not specified, the default value is ascii. For details about the supported encoding formats, see here. Note that utf_8 and uft-8 are the same name. In actual use, if the source code format is UTF-8, you do not need to specify it. The above defined string is UTF-8. If the file format is ANSI, use the following definition to use the above variable s definition normally, and the format in s is gb2312.

#coding=gb2312

The default encoding of the system can be obtained in the following way. The default encoding is ascii. It affects the understanding of transcoding between strings mentioned later. Note that this is only easy to understand.

import syssys.getdefaultencoding()

For more information about this function, see here. For encoding conversion, see what encoding is and what decoding is. Assume there is a script as follows:

import base64s1 = 'hello'print s1s2 = base64.b64encode(s1)print s2  # out: aGVsbG8=

The content of s1 is 'Hello'. After base64 encoding, the content of s2 is 'agvsbg8 = '. The process from s1 to s2 is called encoding, and from s2 to s1 is called decoding. The conversion between the encoding formats of strings in Python is similar to the preceding. For strings, two functions are provided: str. encode and str. decode. Both functions are converted to the system default encoding. encode is the system default encoding to the specified encoding, while decode is the specified encoding to the system encoding. See the following example:

# Coding = utf-8import chardetdef toHexString (s): return ":". join ("{0: x }". format (ord (c) for c in s) def getCharset (s): return chardet. detect (s) ['encoding'] s = ''print getCharset (s) s1 = s. decode ('utf-8 '). encode ('gb2312') print getCharset (s1)

The source code encoding format is UTF-8, so the s1 encoding format is UTF-8. If you want to convert the format to gb2312, you must first decode it into the system default encoding and then encode it into gb2312.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python string encoding rules

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support