Detailed description of string operations and encoding Unicode in Python, pythonunicode
This article mainly introduces some knowledge about character string operations and Unicode encoding in Python. I will not talk about it here. Let's take a look at it.
String type
str
: Unicode string. String constructed using ''or r'' is str, which can be replaced by double quotation marks or three quotation marks. No matter which method is used, there is no difference in the internal storage of Python.
bytes
: Binary string. Because files in other formats such as jpg cannot be displayed using str, bytes are used for representation. Each bytes byte is a number ranging from 0. If it is printed, Python displays the parts that can be expressed in ASCII as ASCII, which facilitates reading. Bytes supports almost all str Methods except formatting, and even includes the re module.
bytearray()
: A binary string that can be changed in the original position.
UTF-8 encoding range
Range |
Bytes |
Storage Format |
0x0000 ~ 0x007F (0 ~ 127) |
1 byte |
0 xxxxxxx |
0x0080 ~ 0x07FF (128 ~ 2047) |
2 bytes |
110 xxxxx 10 xxxxxx |
0x0800 ~ FFFF (2048 ~ 65535) |
3 bytes |
1110 xxxx 10 xxxxxx 10 xxxxxx |
0x10000 ~ 1 FFFFFF (65536 ~ 2097152) |
4 bytes |
11110xxx 10 xxxxxx 10 xxxxxx 10 xxxxxx |
0x2000000 ~ 0x3FFFFFF |
5 bytes |
111110xx 10 xxxxxx 10 xxxxxx 10 xxxxxx 10 xxxxxx |
0x4000000 ~ 0x7FFFFFFF) |
6 bytes |
1111110x 10 xxxxxx 10 xxxxxx 10 xxxxxx 10 xxxxxx 10 xxxxxx |
Byte sequence mark BOM
BOM is the abbreviation of byte order marker,
Encoding Rules
Python does not write the BOM header when writing files using 'utf-8' encoding. However, if you specify the encoding 'utf-8-sig ', it will force Python to write a BOM header.
Using 'utf-16-be' will not write a BOM header, but using 'utf-16' will write a BOM header.
>>> open('h.txt','w',encoding='utf-8-sig').write('aaa')3>>> open('h.txt','rb').read()b'\xef\xbb\xbfaaa'>>> open('h.txt','w',encoding='utf-16').write('bbb')3>>> open('h.txt','rb').read()b'\xff\xfeb\x00b\x00b\x00'>>> open('hh.txt','w',encoding='utf-16-be').write('ccc')3>>> open('hh.txt','rb').read()b'\x00c\x00c\x00c'>>> open('h.txt','w',encoding='utf-8').write('ddd')3>>> open('h.txt','rb').read()b'ddd'
Read rules
If the correct encoding is specified, BOM is ignored. Otherwise, the BOM is garbled or an exception is returned.
>>> Open('h.txt ', 'R '). read () 'commandid dd' >>> open('h.txt ', 'R', encoding = 'utf-8-sig '). read () 'ddd'
Encoding and decoding
>>> Ord ('中') #20013 >>> chr (20013) # '中'
- Hard-coded Unicode into a string.
'\ Xhh': represents a character in a two-digit hexadecimal notation.
'\ Uhhhh': represents a character in a 4-digit hexadecimal notation:
'\ Uhhhhhh': represents a character in an 8-digit hexadecimal format.
>>> S = 'py \ x74h \ u4e2don '# 'pyth on'
Str, bytes, and bytearray Conversion
str.encode(encoding='utf-8')
bytes(s,encoding='utf-8')
bytes.decode(encoding='utf-8')
str(B, encoding='utf-8')
bytearray(string, encoding='utf-8')
bytearray(bytes)
Document code Declaration
Python uses UTF-8 encoding by default.
# -*- coding: latin-1 -*-
: The Declaration Document is latin-1 encoded.
Help Functions
Sys. platform #'win32 'sys. getdefaultencoding () # 'utf-8' sys. byteorder # 'little s. isalnum () # s indicates the string s. isalpha () s. isdecimals. isdigit () s. isnumeric () s. isprintable () s. isspace () s. isidentifier () # If a string can be used as a variable name, Trues is returned. islower () s. isupper () s. istitle ()
Summary
The above is all about this article. I hope this article will help you in your study or work. If you have any questions, please leave a message.