Detailed description of string operations and encoding Unicode in Python

Last Update:2017-05-14 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In fact, strings are also a type of data. However, there is another encoding problem for strings. The following article describes how to operate strings and encode Unicode in Python. For more information, see the following. This article mainly introduces some knowledge about character string operations and Unicode encoding in Python. I will not talk about it here. let's take a look at it.

String type

str: Unicode string. String constructed using ''or R'' is str, which can be replaced by double quotation marks or three quotation marks. No matter which method is used, there is no difference in the internal storage of Python.

bytes: Binary string. Because files in other formats such as jpg cannot be displayed using str, bytes are used for representation. each bytes byte is a number ranging from 0. If it is printed, Python displays the parts that can be expressed in ASCII as ASCII, which facilitates reading. Bytes supports almost all str methods except formatting, and even includes the re module.

bytearray() : A binary string that can be changed in the original position.

UTF-8 encoding range

Range	Bytes	Storage format
0x0000 ~ 0x007F (0 ~ 127)	1 byte	0 xxxxxxx
0x0080 ~ 0x07FF (128 ~ 2047)	2 bytes	110 xxxxx 10 xxxxxx
0x0800 ~ FFFF (2048 ~ 65535)	3 bytes	1110 xxxx 10 xxxxxx 10 xxxxxx
0x10000 ~ 1 FFFFFF (65536 ~ 2097152)	4 bytes	11110xxx 10 xxxxxx 10 xxxxxx 10 xxxxxx
0x2000000 ~ 0x3FFFFFF	5 bytes	111110xx 10 xxxxxx 10 xxxxxx 10 xxxxxx 10 xxxxxx
0x4000000 ~ 0x7FFFFFFF)	6 bytes	1111110x 10 xxxxxx 10 xxxxxx 10 xxxxxx 10 xxxxxx 10 xxxxxx

Byte sequence Mark BOM

BOM is the abbreviation of byte order marker,

Encoding Rules

Python does not write the BOM header when writing files using 'utf-8' encoding. However, if you specify the encoding 'utf-8-sig ', it will force Python to write a BOM header.

Using 'utf-16-be' will not write a BOM header, but using 'utf-16' will write a BOM header.

>>> open('h.txt','w',encoding='utf-8-sig').write('aaa')3>>> open('h.txt','rb').read()b'\xef\xbb\xbfaaa'>>> open('h.txt','w',encoding='utf-16').write('bbb')3>>> open('h.txt','rb').read()b'\xff\xfeb\x00b\x00b\x00'>>> open('hh.txt','w',encoding='utf-16-be').write('ccc')3>>> open('hh.txt','rb').read()b'\x00c\x00c\x00c'>>> open('h.txt','w',encoding='utf-8').write('ddd')3>>> open('h.txt','rb').read()b'ddd'

Read rules

If the correct encoding is specified, BOM is ignored. Otherwise, the BOM is garbled or an exception is returned.

>>> Open('h.txt ', 'r '). read () 'commandid DD' >>> open('h.txt ', 'R', encoding = 'utf-8-sig '). read () 'ddd'

Encoding and decoding

Chr and ord

>>> Ord ('中') #20013 >>> chr (20013) # '中'

Hard-coded Unicode into a string.

'\ Xhh': represents a character in a two-digit hexadecimal notation.

'\ Uhhhh': represents a character in a 4-digit hexadecimal notation:

'\ Uhhhhhh': represents a character in an 8-digit hexadecimal format.

>>> S = 'py \ x74h \ u4e2don '# 'pyth on'

Str, bytes, and bytearray conversion

str.encode(encoding='utf-8')

bytes(s,encoding='utf-8')

bytes.decode(encoding='utf-8')

str(B, encoding='utf-8')

bytearray(string, encoding='utf-8')

bytearray(bytes)

Document code declaration

Python uses UTF-8 encoding by default.

# -*- coding: latin-1 -*- : The declaration document is latin-1 encoded.

Help functions

Sys. platform #'win32 'sys. getdefaultencoding () # 'utf-8' sys. byteorder # 'Little s. isalnum () # s indicates the string s. isalpha () s. isdecimals. isdigit () s. isnumeric () s. isprintable () s. isspace () s. isidentifier () # if a string can be used as a variable name, Trues is returned. islower () s. isupper () s. istitle ()

For more details about string operations and Unicode encoding in Python, please follow the PHP Chinese network!

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Detailed description of string operations and encoding Unicode in Python

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Detailed description of string operations and encoding Unicode in Python

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support