character encoding and file processing I. The process of translating character encoding from a character into a binary number character--------(translation process)-------> number This process is actually the standard of how a character corresponds to a particular number, This standard is called character encoding. History of character encoding stage One: Modern computers originated in the United States, and the earliest birth was also based on English-ascii ASCII: A bytes represents a character (English characters/all other characters on the keyboard), 1bytes=8bit, 8bit can represent 0-2**8-1 changes, that can represent 256 characters ASCII originally used only the last seven digits, 127 digits, has been fully able to represent all the characters on the keyboard (English characters/keyboard all other characters) Later, in order to encode Latin into the ASCII table, the highest bit also took up the Phase II: In order to meet Chinese, the Chinese customized the gbk gbk:2bytes to represent a character in order to meet other countries, Countries have customized their own code Japan to the Japanese Shift_JIS, South Korea to the EUC-KR Stage Three: countries have national standards, will inevitably appear conflict, the result is, in the multi-language mixed text, The display will be garbled. Thus produced Unicode, unified 2Bytes for a character, 2**16-1=65535, can represent more than 60,000 characters, and thus compatible with the universal language But for the text is English throughout, This encoding is undoubtedly one times more storage space (the binary is ultimately stored in the form of electrical or magnetic storage media) Thus produced a UTF-8, the English characters are only used in 1Bytes, the Chinese characters with 3bytes need to emphasize is: unicode: Simple rough, all characters are 2Bytes, the advantage is the character---digital conversion speed fast, the disadvantage is that occupy space large utf-8: accurate, different characters with different lengths, the advantage is space-saving, the disadvantage is: character- The conversion speed of the number is slow, because each time you need to calculate how long the characters need to bytes to accurately indicate that the encoding used in -memory is Unicode, space-time (the program needs to be loaded into memory to run, so that memory should be as fast as possible)- Utf-8, network I/O latency or disk I/O latency for the hard drive or network transportThe conversion delay between Yuanda and utf-8, and I/O should be as much as possible to save bandwidth and ensure the stability of data transmission. What type of character encodes the data and then uses what type of character to encode the data!
In the latest version of Python 3, strings are encoded in Unicode, meaning that Python strings support multiple languages, such as:
>>> print(‘包含中文的str‘)包含中文的str
For the encoding of a single character, Python provides an ord()
integer representation of the function to get the character, and the chr()
function converts the encoding to the corresponding character:
>>> ord(‘A‘)65>>> ord(‘中‘)20013>>> chr(66)‘B‘>>> chr(25991)‘文‘
If you know the integer encoding of a character, you can also write it in hexadecimal str
:
>>> ‘\u4e2d\u6587‘‘中文‘
The two formulations are completely equivalent.
Because the Python string type is str
, in memory, in Unicode, one character corresponds to a number of bytes. If you want to transfer on a network, or save to disk, you need to turn it str
into bytes bytes
.
Python bytes
uses b
a prefixed single or double quotation mark for data of type:
x = b‘ABC‘
Be aware of the distinction ‘ABC‘
and the b‘ABC‘
former is str
that although the content is displayed in the same way as the former, bytes
each character occupies only one byte.
The str
pass method, expressed in Unicode encode()
, can be encoded as specified bytes
, for example:
>>> ‘ABC‘.encode(‘ascii‘)b‘ABC‘>>> ‘中文‘.encode(‘utf-8‘)b‘\xe4\xb8\xad\xe6\x96\x87‘>>> ‘中文‘.encode(‘ascii‘)Traceback (most recent call last): File "<stdin>", line 1, in <module>UnicodeEncodeError: ‘ascii‘ codec can‘t encode characters in position 0-1: ordinal not in range(128)
Pure English str
can be ASCII
encoded as bytes
, content is the same, containing Chinese str
can be UTF-8
encoded as bytes
. str
cannot be encoded in Chinese ASCII
because the range of Chinese encoding exceeds the range of the ASCII
encoding, and Python will make an error.
In bytes
, the bytes that cannot be displayed as ASCII characters are \x##
displayed.
Conversely, if we read the byte stream from the network or disk, then the data read is bytes
. To turn bytes
str
it into, you need to use the decode()
method:
>>> b‘ABC‘.decode(‘ascii‘)‘ABC‘>>> b‘\xe4\xb8\xad\xe6\x96\x87‘.decode(‘utf-8‘)‘中文‘
To calculate str
how many characters are included, you can use a len()
function:
>>> len(‘ABC‘)3>>> len(‘中文‘)2
len()
The function calculates the str
number of characters, and if bytes
so, the len()
function calculates the number of bytes:
>>> len(b‘ABC‘)3>>> len(b‘\xe4\xb8\xad\xe6\x96\x87‘)6>>> len(‘中文‘.encode(‘utf-8‘))6
As can be seen, 1 Chinese characters are UTF-8 encoded and typically consume 3 bytes, while 1 English characters take up only 1 bytes.
We often encounter str
and convert to and bytes
from each other when manipulating strings. In order to avoid garbled problems, we should always adhere to the use of UTF-8 encoding str
and bytes
conversion.
Because the Python source code is also a text file, so when your source code contains Chinese, it is important to specify that you save it as UTF-8 encoding when you save it. When the Python interpreter reads the source code, in order for it to be read by UTF-8 encoding, we usually write these two lines at the beginning of the file:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
The first line of comments is to tell the Linux/os x system that this is a python executable, and the Windows system ignores this comment; the second line of comments is to tell the Python interpreter to read the source code according to UTF-8 encoding, otherwise The Chinese output you write in the source code may be garbled. Two python modes for file-handling operations open files are:
- R, read-only mode "default mode, file must exist, not present, throw exception"
- W, write-only mode "unreadable; not exist" created; empty content "
- X, write-only mode "unreadable; not present, create, present error"
- A, append mode "readable; not present, create; append content only"
"+" means you can read and write a file at the same time
- r+, read and write "readable, writable"
- w+, write "readable, writable"
- x+, write "readable, writable"
- A +, write "readable, writable"
"B" means to operate in bytes
- RB or R+b
- WB or W+b
- XB or W+b
- AB or A+b
NOTE: When opened in B, read the content is byte type, write also need to provide byte type, can not specify encoding #r mode, default mode, the text does not exist error # F=open (' a.txt ', encoding= ' Utf-8 ') # # print (' First-read: ', F.read ()) # # print (' Seconde-read: ', F.read ()) # # # print (F.readline (), end= ") # # Print (f.readline (), end= ") # # # # print (F.readlines ()) # # # print (F.write (' Asdfasdfasdfasdfasdfasdf ')) # # f.close () #w模式, text does not exist then create, file exists then overwrite # f=open (' A.txt ', ' W ', encoding= ' Utf-8 ') # f.write (' 1111111\n22222\n3333\n ') # # F.write (' 2222222\n ') # # # # F.writelines ([' 11111\n ', ' 22222\n ', ' 3333\n ']) # # f.close () #a模式, the text does not exist then created, the file exists does not overwrite, Write content is appended to the way write # F=open (' A.txt ', ' a ', encoding= ' Utf-8 ') # f.write (' \n444444\n ') # f.write (' 5555555\n ') # f.close () #其他方法 # f=open (' A.txt ', ' W ', encoding= ' Utf-8 ') # f.write (' asdfasdf ') # F.flush () #吧内存数据刷到硬盘 # f.close () # Print ( f.closed) #判断文件是否关闭 # f.readlines () # print (f.name,f.encoding) # Print (F.readable ()) # Print (F.writable ()) ## f=open (' C.txt ', ' R ', encoding= ' utf-8 ') # print (F.read (3)) # Print (' First_read: ', F.read() # F.seek (0) # print (' Second_read: ', F.read ()) ## F.seek (3) # Print (F.tell ()) # Print (F.read ()) # f=open (' C.txt ', ' W ', encoding= ' Utf-8 ') # f.write (' 1111\n ') # f.write (' 2222\n ') # f.write (' 3333\n ') # F.write (' 444\n ') # f.write (' 5555\n ') # f.truncate (3) # f=open (' a.txt ', ' a ', encoding= ' Utf-8 ') # # F.truncate ( 2) #r w A; RB WB ab# F=open (' a.txt ', ' RB ') # # print (F.read ()) # Print (F.read (). Decode (' Utf-8 ')) # f.close () # f= Open (' A.txt ', ' WB ') # # print (f.write (' Hello '. Encode (' Utf-8 ')) # F.close () context management We often forget a small operation when opening a file--close the document. Colse () We can avoid this small error by using the mode of with as to open the document operation! With open (' A.txt ', ' W ') as f: pass How to batch modify the contents of a file:
Read_f=open (' A.txt ', ' R ', encoding= ' utf-8 ')
Write_f=open ('. A.txt.swp ', ' W ', encoding= ' utf-8 ')
With open (' A.txt ', ' R ', encoding= ' Utf-8 ') as read_f,\ #将文件打开
Open ('. A.txt.swp ', ' W ', encoding= ' Utf-8 ') as Write_f: #并且再创建一个文件名为. a.txt.swp file
For line in Read_f:
If ' Alex ' in line: #找到想要替换的内容
Line=line.replace (' Alex ', ' Alexsb ') #并且将旧的内容替换成新的内容存在. a.txt.swp files
Write_f.write (line) #不符合条件的不动
Os.remove (' a.txt ') #将源文件删除
Os.rename ('. A.txt.swp ', ' a.txt ') #将. a.txt.swp This file should be a.txt so that the file content of the batch modification!
Python Basics (c)----character encoding and file handling