In fact, the string is also a data type, but the string is special is also a coding problem. The following article mainly for you to introduce the Python string operation and encoding Unicode details of the relevant information, the need for friends can refer to the following to see together.
This article mainly introduces some knowledge about the string manipulation and encoding Unicode in Python, so I don't want to say a few words, so we need to learn from them.
String type
str
: A Unicode string. Strings constructed with ' or R ' are all str, and single quotes can be substituted with double or triple quotes. In either case, there is no difference in how the python is stored inside.
bytes
: binary string. Because files in other formats such as JPG cannot be displayed with str, Bytes is used to indicate that each byte of the bytes is a 0-255 digit. If you print, Python will display portions of ASCII as ASCII, so it's easy to read. Bytes almost supports all methods of STR except formatting, even including the RE module
bytearray()
: A string in which binary can be changed in situ.
UTF-8 Encoding Range
Range |
Number of bytes |
Storage format |
0x0000~0x007f (0 ~ 127) |
1 bytes |
0xxxxxxx |
0X0080~0X07FF (128 ~ 2047) |
2 bytes |
110xxxxx 10xxxxxx |
0X0800~FFFF (2048 ~ 65535) |
3 bytes |
1110xxxx 10xxxxxx 10xxxxxx |
0X10000~1FFFFFF (65536 ~ 2097152) |
4 bytes |
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx |
0x2000000~0x3ffffff |
5 bytes |
111110XX 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx |
0X4000000~0X7FFFFFFF) |
6 bytes |
1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx |
BYTE order Mark BOM
The BOM is the abbreviation for byte order marker,
Specify rules for encoding when writing
Python does not write to the BOM header when writing to a file using the ' utf-8 ' encoding, but specifying the encoded ' Utf-8-sig ' will force Python to write to a BOM header.
Using ' utf-16-be ' will not write a BOM header, but ' utf-16 ' will write to a BOM header.
>>> open (' H.txt ', ' W ', encoding= ' Utf-8-sig '). Write (' AAA ') 3>>> open (' h.txt ', ' RB '). Read () b ' \xef\ Xbb\xbfaaa ' >>> open (' H.txt ', ' W ', encoding= ' utf-16 '). Write (' BBB ') 3>>> open (' h.txt ', ' RB '). Read () B ' \xff\xfeb\x00b\x00b\x00 ' >>> open (' Hh.txt ', ' W ', encoding= ' utf-16-be '). Write (' CCC ') 3>>> open (' Hh.txt ', ' RB '). Read () b ' \x00c\x00c\x00c ' >>> open (' H.txt ', ' W ', encoding= ' utf-8 '). Write (' ddd ') 3>>> Open (' H.txt ', ' RB '). Read () b ' DDD '
Rules for Reading
If the correct encoding is specified, the BOM is ignored or the BOM is displayed as garbled or returns an exception.
>>> open (' H.txt ', ' R '). Read () ' Nobelium 縟 dd ' >>> open (' H.txt ', ' R ', encoding= ' Utf-8-sig '). Read () ' DDD '
Encoding and decoding
>>> Ord (' Middle ') #20013 >>> chr (20013) # ' Medium '
' \xhh ': use 2-bit hexadecimal to represent one character
' \uhhhh ': use 4-bit hexadecimal to represent one character:
' \uhhhhhhhh ': Use 8-bit hexadecimal to represent one character
>>> s = 'py\x74h\u4e2don' #'pyth中on'
Str and bytes, ByteArray for conversion
str.encode(encoding='utf-8')
bytes(s,encoding='utf-8')
bytes.decode(encoding='utf-8')
str(B, encoding='utf-8')
bytearray(string, encoding='utf-8')
bytearray(bytes)
Document Encoding Declaration
Python uses UTF-8 encoding by default.
# -*- coding: latin-1 -*-
: Indicates that the declaration document is LATIN-1 encoded.
Help function
Sys.platform # ' Win32 ' Sys.getdefaultencoding () # ' Utf-8 ' sys.byteorder # ' little ' S.isalnum () # s represents the String S.isalpha () S.isdecimals.isdigit () s.isnumeric () s.isprintable () S.isspace () S.isidentifier () #如果字符串可以用作变量名, Then return Trues.islower () S.isupper () S.istitle ()