Detailed description of string operations and encoding Unicode in Python

Source: Internet
Author: User
In fact, strings are also a type of data. However, there is another encoding problem for strings. The following article describes how to operate strings and encode Unicode in Python. For more information, see the following. This article mainly introduces some knowledge about character string operations and Unicode encoding in Python. I will not talk about it here. let's take a look at it.

String type

str: Unicode string. String constructed using ''or R'' is str, which can be replaced by double quotation marks or three quotation marks. No matter which method is used, there is no difference in the internal storage of Python.

bytes: Binary string. Because files in other formats such as jpg cannot be displayed using str, bytes are used for representation. each bytes byte is a number ranging from 0. If it is printed, Python displays the parts that can be expressed in ASCII as ASCII, which facilitates reading. Bytes supports almost all str methods except formatting, and even includes the re module.

bytearray() : A binary string that can be changed in the original position.

UTF-8 encoding range

Range Bytes Storage format
0x0000 ~ 0x007F (0 ~ 127) 1 byte 0 xxxxxxx
0x0080 ~ 0x07FF (128 ~ 2047) 2 bytes 110 xxxxx 10 xxxxxx
0x0800 ~ FFFF (2048 ~ 65535) 3 bytes 1110 xxxx 10 xxxxxx 10 xxxxxx
0x10000 ~ 1 FFFFFF (65536 ~ 2097152) 4 bytes 11110xxx 10 xxxxxx 10 xxxxxx 10 xxxxxx
0x2000000 ~ 0x3FFFFFF 5 bytes 111110xx 10 xxxxxx 10 xxxxxx 10 xxxxxx 10 xxxxxx
0x4000000 ~ 0x7FFFFFFF) 6 bytes 1111110x 10 xxxxxx 10 xxxxxx 10 xxxxxx 10 xxxxxx 10 xxxxxx

Byte sequence Mark BOM

BOM is the abbreviation of byte order marker,

Encoding Rules

Python does not write the BOM header when writing files using 'utf-8' encoding. However, if you specify the encoding 'utf-8-sig ', it will force Python to write a BOM header.

Using 'utf-16-be' will not write a BOM header, but using 'utf-16' will write a BOM header.

>>> open('h.txt','w',encoding='utf-8-sig').write('aaa')3>>> open('h.txt','rb').read()b'\xef\xbb\xbfaaa'>>> open('h.txt','w',encoding='utf-16').write('bbb')3>>> open('h.txt','rb').read()b'\xff\xfeb\x00b\x00b\x00'>>> open('hh.txt','w',encoding='utf-16-be').write('ccc')3>>> open('hh.txt','rb').read()b'\x00c\x00c\x00c'>>> open('h.txt','w',encoding='utf-8').write('ddd')3>>> open('h.txt','rb').read()b'ddd'

Read rules

If the correct encoding is specified, BOM is ignored. Otherwise, the BOM is garbled or an exception is returned.

>>> Open('h.txt ', 'r '). read () 'commandid DD' >>> open('h.txt ', 'R', encoding = 'utf-8-sig '). read () 'ddd'

Encoding and decoding

  • Chr and ord

>>> Ord ('中') #20013 >>> chr (20013) # '中'

  • Hard-coded Unicode into a string.

'\ Xhh': represents a character in a two-digit hexadecimal notation.

'\ Uhhhh': represents a character in a 4-digit hexadecimal notation:

'\ Uhhhhhh': represents a character in an 8-digit hexadecimal format.

>>> S = 'py \ x74h \ u4e2don '# 'pyth on'

Str, bytes, and bytearray conversion

str.encode(encoding='utf-8')

bytes(s,encoding='utf-8')

bytes.decode(encoding='utf-8')

str(B, encoding='utf-8')

bytearray(string, encoding='utf-8')

bytearray(bytes)

Document code declaration

Python uses UTF-8 encoding by default.

# -*- coding: latin-1 -*- : The declaration document is latin-1 encoded.

Help functions

Sys. platform #'win32 'sys. getdefaultencoding () # 'utf-8' sys. byteorder # 'Little s. isalnum () # s indicates the string s. isalpha () s. isdecimals. isdigit () s. isnumeric () s. isprintable () s. isspace () s. isidentifier () # if a string can be used as a variable name, Trues is returned. islower () s. isupper () s. istitle ()


For more details about string operations and Unicode encoding in Python, please follow the PHP Chinese network!

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.