I. Character encoding process characters translated by characters into binary numbers--------(translation process)-------> Numbers This process is actually the standard of how a character corresponds to a particular number, which is called a character encoding. The history of character encodingPhase One: Modern computers originated in the United States, and the earliest birth was also based on the English-considered ASCIIAscii: A bytes represents a character(All other characters on the English character/keyboard), 1bytes=8bit,8bit can represent a change of 0-2**8-1, which can represent 256 charactersASCII originally used only the last seven digits, 127 digits, has been fully able to represent all the characters on the keyboard (English characters/keyboard all other characters) later in order to Latin also encoded into the ASCII table, the highest bit also occupy theStage Two: In order to satisfy Chinese, the Chinese have customized the GBKGbk:2bytes represents a characterIn order to meet other countries, various countries have customized their own code Japanese to Shift_JIS, South Korea to the EUC-KR in the third stage: countries have national standards, will inevitably appear conflict, the result is, in the multi-language mixed text, the display will be garbled. The result isUnicode, unified 2Bytes for a character, 2**16-1=65535, can represent more than 60,000 characters, thus compatible with the universal languageBut for the whole English text, this encoding is undoubtedly one times more storage space (the binary is ultimately stored in the form of electricity or magnetic storage media) and thus producedUTF-8, the English character is only used in 1Bytes, the Chinese characters with 3Bytes One thing to emphasize is:
Unicode: Simple rough, all characters are 2Bytes, the advantage is the character---digital conversion speed, the disadvantage is the space-occupying large
utf-8: Precision, with different lengths for different characters, the advantage is to save space, the disadvantage is: character---number conversion speed is slow, because each time you need to calculate how long the character needs to be bytes to accurately display
-the encoding used in memory is Unicode, space-time (the program needs to be loaded into memory to run, so the memory should be as fast as possible)-Utf-8 of the hard drive or network transmission, network I/O latency or disk I/O latency is much larger than the utf-8 conversion delay, and I/O should be as much as possible to save bandwidth and ensure the stability of data transmission. What type of character encodes the data and then uses what type of character to encode the data!
In the latest version of Python 3, strings are encoded in Unicode, meaning that Python strings support multiple languages, such as:
>>> print(‘包含中文的str‘)包含中文的str
For the encoding of a single character, Python provides an ord()
integer representation of the function to get the character, and the chr()
function converts the encoding to the corresponding character:
>>> ord(‘A‘)65>>> ord(‘中‘)20013>>> chr(66)‘B‘>>> chr(25991)‘文‘
If you know the integer encoding of a character, you can also write it in hexadecimal str
:
>>> ‘\u4e2d\u6587‘‘中文‘
The two formulations are completely equivalent.
Because the Python string type is str
, in memory, in Unicode, one character corresponds to a number of bytes. If you want to transfer on a network, or save to disk, you need to turn it str
into bytes bytes
.
Python bytes
uses b
a prefixed single or double quotation mark for data of type:
x = b‘ABC‘
Be aware of the distinction ‘ABC‘
and the b‘ABC‘
former is str
that although the content is displayed in the same way as the former, bytes
each character occupies only one byte.
The str
pass method, expressed in Unicode encode()
, can be encoded as specified bytes
, for example:
>>> ‘ABC‘.encode(‘ascii‘)b‘ABC‘>>> ‘中文‘.encode(‘utf-8‘)b‘\xe4\xb8\xad\xe6\x96\x87‘>>> ‘中文‘.encode(‘ascii‘)Traceback (most recent call last): File "<stdin>", line 1, in <module>UnicodeEncodeError: ‘ascii‘ codec can‘t encode characters in position 0-1: ordinal not in range(128)
Pure English str
can be ASCII
encoded as bytes
, content is the same, containing Chinese str
can be UTF-8
encoded as bytes
. str
cannot be encoded in Chinese ASCII
because the range of Chinese encoding exceeds the range of the ASCII
encoding, and Python will make an error.
In bytes
, the bytes that cannot be displayed as ASCII characters are \x##
displayed.
Conversely, if we read the byte stream from the network or disk, then the data read is bytes
. To turn bytes
str
it into, you need to use the decode()
method:
>>> b‘ABC‘.decode(‘ascii‘)‘ABC‘>>> b‘\xe4\xb8\xad\xe6\x96\x87‘.decode(‘utf-8‘)‘中文‘
To calculate str
how many characters are included, you can use a len()
function:
>>> len(‘ABC‘)3>>> len(‘中文‘)2
len()
The function calculates the str
number of characters, and if bytes
so, the len()
function calculates the number of bytes:
>>> len(b‘ABC‘)3>>> len(b‘\xe4\xb8\xad\xe6\x96\x87‘)6>>> len(‘中文‘.encode(‘utf-8‘))6
As can be seen, 1 Chinese characters are UTF-8 encoded and typically consume 3 bytes, while 1 English characters take up only 1 bytes.
We often encounter str
and convert to and bytes
from each other when manipulating strings. In order to avoid garbled problems, we should always adhere to the use of UTF-8 encoding str
and bytes
conversion.
Because the Python source code is also a text file, so when your source code contains Chinese, it is important to specify that you save it as UTF-8 encoding when you save it. When the Python interpreter reads the source code, in order for it to be read by UTF-8 encoding, we usually write these two lines at the beginning of the file:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
The first line of comments is to tell the Linux/os x system that this is a python executable and the Windows system ignores this comment;The second line of comments is to tell the Python interpreter to read the source code according to the UTF-8 encoding, otherwise the Chinese output you write in the source code may be garbled. Two python modes for file-handling operations open files are:
- R, read-only mode "default mode, file must exist, not present, throw exception"
- W, write-only mode "unreadable; not exist" created; empty content "
- X, write-only mode "unreadable; not present, create, present error"
- A, append mode "readable; not present, create; append content only"
"+" means you can read and write a file at the same time
- r+, read and write "readable, writable"
- w+, write "readable, writable"
- x+, write "readable, writable"
- A +, write "readable, writable"
"B" means to operate in bytes
- RB or R+b
- WB or W+b
- XB or W+b
- AB or A+b
NOTE: When opened in B, read the content is byte type, write also need to provide byte type, can not specify encoding #r mode, default mode, the text does not exist error # F=open (' a.txt ', encoding= ' Utf-8 ') # # print (' First-read: ', F.read ()) # # print (' Seconde-read: ', F.read ()) # # # print (F.readline (), end= ") # # Print (f.readline (), end= ") # # # # print (F.readlines ()) # # # print (F.write (' Asdfasdfasdfasdfasdfasdf ')) # # f.close () #w模式, text does not exist then create, file exists then overwrite # f=open (' A.txt ', ' W ', encoding= ' Utf-8 ') # f.write (' 1111111\n22222\n3333\n ') # # F.write (' 2222222\n ') # # # # F.writelines ([' 11111\n ', ' 22222\n ', ' 3333\n ']) # # f.close () #a模式, the text does not exist then created, the file exists does not overwrite, Write content is appended to the way write # F=open (' A.txt ', ' a ', encoding= ' Utf-8 ') # f.write (' \n444444\n ') # f.write (' 5555555\n ') # f.close () #其他方法 # f=open (' A.txt ', ' W ', encoding= ' Utf-8 ') # f.write (' asdfasdf ') # F.flush () #吧内存数据刷到硬盘 # f.close () # Print ( f.closed) #判断文件是否关闭 # f.readlines () # print (f.name,f.encoding) # Print (F.readable ()) # Print (F.writable ()) ## f=open (' C.txt ', ' R ', encoding= ' utf-8 ') # print (F.read (3)) # Print (' First_read: ', F.read() # F.seek (0) # print (' Second_read: ', F.read ()) ## F.seek (3) # Print (F.tell ()) # Print (F.read ()) # f=open (' C.txt ', ' W ', encoding= ' Utf-8 ') # f.write (' 1111\n ') # f.write (' 2222\n ') # f.write (' 3333\n ') # F.write (' 444\n ') # f.write (' 5555\n ') # f.truncate (3) # f=open (' a.txt ', ' a ', encoding= ' Utf-8 ') # # F.truncate ( 2) #r w A; RB WB ab# F=open (' a.txt ', ' RB ') # # print (F.read ()) # Print (F.read (). Decode (' Utf-8 ')) # f.close () # f= Open (' A.txt ', ' WB ') # # print (f.write (' Hello '. Encode (' Utf-8 ')) # F.close () context management We often forget a small operation when opening a file--close the document. Colse () We can avoid this small error by using the mode of with as to open the document operation! With open (' A.txt ', ' W ') as f: pass How to batch modify the contents of a file:
Read_f=open (' A.txt ', ' R ', encoding= ' utf-8 ')
Write_f=open ('. A.txt.swp ', ' W ', encoding= ' utf-8 ')
With open (' A.txt ', ' R ', encoding= ' Utf-8 ') as read_f,\ #将文件打开
Open ('. A.txt.swp ', ' W ', encoding= ' Utf-8 ') as Write_f: #并且再创建一个文件名为. a.txt.swp file
For line in Read_f:
If ' Alex ' in line: #找到想要替换的内容
Line=line.replace (' Alex ', ' Alexsb ') #并且将旧的内容替换成新的内容存在. a.txt.swp files
Write_f.write (line) #不符合条件的不动
Os.remove (' a.txt ') #将源文件删除
Os.rename ('. A.txt.swp ', ' a.txt ') #将. a.txt.swp This file should be a.txt so that the file content of the batch modification!
The file object is created using the Open function, and the following table lists the functions commonly used by file objects:
Serial Number |
Method and Description |
1 |
File.close () Close the file. The file can no longer read and write after closing. |
2 |
File.flush () Refreshes the file internal buffer, directly writes the internal buffer data immediately to the file, rather than passively waits for the output buffer to write. |
3 |
File.fileno () Returns an integer that is a file descriptor (Descriptor FD Integer) that can be used in some of the underlying operations, such as the Read method of the OS module. |
4 |
File.isatty () Returns True if the file is connected to an end device, otherwise False is returned. |
5 |
File.next () Returns the next line of the file. |
6 |
File.read ([size]) Reads the specified number of bytes from the file, if none is given or is negative. |
7 |
File.readline ([size]) Reads the entire line, including the "\ n" character. |
8 |
File.readlines ([Sizeint]) Reads all rows and returns a list, and if given sizeint>0, returns a row with a sum of approximately sizeint bytes, the actual read value may be larger than sizeint because the buffer needs to be populated. |
9 |
File.seek (offset[, whence]) Set the current location of the file |
10 |
File.tell () Returns the current location of the file. |
11 |
File.truncate ([size]) Truncates from the beginning of the first line of the file, truncates the file to size characters, no size indicates truncation from the current position, and all characters after the truncation are deleted, where the line break under the Widnows system represents a 2-character size. |
12 |
File.write (str) Writes a string to the file without a return value. |
13 |
File.writelines (Sequence) Writes a list of sequence strings to the file and adds a newline character to each line if a line break is required. |
python-character encoding and file processing