Python Basics (c)----character encoding and file handling

Source: Internet
Author: User
Tags file handling ord readable throw exception

  character encoding and file processing I. The process of translating character encoding from a character into a binary number   character--------(translation process)-------> number   This process is actually the standard of how a character corresponds to a particular number, This standard is called character encoding.   History of character encoding   stage One: Modern computers originated in the United States, and the earliest birth was also based on English-ascii  ASCII: A bytes represents a character (English characters/all other characters on the keyboard), 1bytes=8bit,    8bit can represent 0-2**8-1 changes, that can represent 256 characters   ASCII originally used only the last seven digits, 127 digits, has been fully able to represent all the characters on the keyboard (English characters/keyboard all other characters)   Later, in order to encode Latin into the ASCII table, the highest bit also took up the    Phase II: In order to meet Chinese, the Chinese customized the gbk  gbk:2bytes to represent a character    in order to meet other countries, Countries have customized their own code   Japan to the Japanese Shift_JIS, South Korea to the EUC-KR    Stage Three: countries have national standards, will inevitably appear conflict, the result is, in the multi-language mixed text, The display will be garbled.   Thus produced Unicode, unified 2Bytes for a character, 2**16-1=65535, can represent more than 60,000 characters, and thus compatible with the universal language   But for the text is English throughout, This encoding is undoubtedly one times more storage space (the binary is ultimately stored in the form of electrical or magnetic storage media)   Thus produced a UTF-8, the English characters are only used in 1Bytes, the Chinese characters with 3bytes   need to emphasize is:  unicode: Simple rough, all characters are 2Bytes, the advantage is the character---digital conversion speed fast, the disadvantage is that occupy space large  utf-8: accurate, different characters with different lengths, the advantage is space-saving, the disadvantage is: character- The conversion speed of the number is slow, because each time you need to calculate how long the characters need to bytes to accurately indicate that the encoding used in   -memory is Unicode, space-time (the program needs to be loaded into memory to run, so that memory should be as fast as possible)- Utf-8, network I/O latency or disk I/O latency for the hard drive or network transportThe conversion delay between Yuanda and utf-8, and I/O should be as much as possible to save bandwidth and ensure the stability of data transmission. What type of character encodes the data and then uses what type of character to encode the data!  

In the latest version of Python 3, strings are encoded in Unicode, meaning that Python strings support multiple languages, such as:

>>> print(‘包含中文的str‘)包含中文的str

For the encoding of a single character, Python provides an ord() integer representation of the function to get the character, and the chr() function converts the encoding to the corresponding character:

>>> ord(‘A‘)65>>> ord(‘中‘)20013>>> chr(66)‘B‘>>> chr(25991)‘文‘

If you know the integer encoding of a character, you can also write it in hexadecimal str :

>>> ‘\u4e2d\u6587‘‘中文‘

The two formulations are completely equivalent.

Because the Python string type is str , in memory, in Unicode, one character corresponds to a number of bytes. If you want to transfer on a network, or save to disk, you need to turn it str into bytes bytes .

Python bytes uses b a prefixed single or double quotation mark for data of type:

x = b‘ABC‘

Be aware of the distinction ‘ABC‘ and the b‘ABC‘ former is str that although the content is displayed in the same way as the former, bytes each character occupies only one byte.

The str pass method, expressed in Unicode encode() , can be encoded as specified bytes , for example:

>>> ‘ABC‘.encode(‘ascii‘)b‘ABC‘>>> ‘中文‘.encode(‘utf-8‘)b‘\xe4\xb8\xad\xe6\x96\x87‘>>> ‘中文‘.encode(‘ascii‘)Traceback (most recent call last):  File "<stdin>", line 1, in <module>UnicodeEncodeError: ‘ascii‘ codec can‘t encode characters in position 0-1: ordinal not in range(128)

Pure English str can be ASCII encoded as bytes , content is the same, containing Chinese str can be UTF-8 encoded as bytes . strcannot be encoded in Chinese ASCII because the range of Chinese encoding exceeds the range of the ASCII encoding, and Python will make an error.

In bytes , the bytes that cannot be displayed as ASCII characters are \x## displayed.

Conversely, if we read the byte stream from the network or disk, then the data read is bytes . To turn bytes str it into, you need to use the decode() method:

>>> b‘ABC‘.decode(‘ascii‘)‘ABC‘>>> b‘\xe4\xb8\xad\xe6\x96\x87‘.decode(‘utf-8‘)‘中文‘

To calculate str how many characters are included, you can use a len() function:

>>> len(‘ABC‘)3>>> len(‘中文‘)2

len()The function calculates the str number of characters, and if bytes so, the len() function calculates the number of bytes:

>>> len(b‘ABC‘)3>>> len(b‘\xe4\xb8\xad\xe6\x96\x87‘)6>>> len(‘中文‘.encode(‘utf-8‘))6

As can be seen, 1 Chinese characters are UTF-8 encoded and typically consume 3 bytes, while 1 English characters take up only 1 bytes.

We often encounter str and convert to and bytes from each other when manipulating strings. In order to avoid garbled problems, we should always adhere to the use of UTF-8 encoding str and bytes conversion.

Because the Python source code is also a text file, so when your source code contains Chinese, it is important to specify that you save it as UTF-8 encoding when you save it. When the Python interpreter reads the source code, in order for it to be read by UTF-8 encoding, we usually write these two lines at the beginning of the file:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
The first line of comments is to tell the Linux/os x system that this is a python executable, and the Windows system ignores this comment; the second line of comments is to tell the Python interpreter to read the source code according to UTF-8 encoding, otherwise The Chinese output you write in the source code may be garbled. Two python modes for file-handling operations open files are:
    • R, read-only mode "default mode, file must exist, not present, throw exception"
    • W, write-only mode "unreadable; not exist" created; empty content "
    • X, write-only mode "unreadable; not present, create, present error"
    • A, append mode "readable; not present, create; append content only"
"+" means you can read and write a file at the same time
    • r+, read and write "readable, writable"
    • w+, write "readable, writable"
    • x+, write "readable, writable"
    • A +, write "readable, writable"
"B" means to operate in bytes
    • RB or R+b
    • WB or W+b
    • XB or W+b
    • AB or A+b
  NOTE: When opened in B, read the content is byte type, write also need to provide byte type, can not specify encoding #r mode, default mode, the text does not exist error   # F=open (' a.txt ', encoding= ' Utf-8 ') # # print (' First-read: ', F.read ()) # # print (' Seconde-read: ', F.read ()) # # # print (F.readline (), end= ") # # Print (f.readline (), end= ") # # # # print (F.readlines ()) # # # print (F.write (' Asdfasdfasdfasdfasdfasdf ')) # # f.close ()       #w模式, text does not exist then create, file exists then overwrite # f=open (' A.txt ', ' W ', encoding= ' Utf-8 ') # f.write (' 1111111\n22222\n3333\n ') # # F.write (' 2222222\n ') # # # # F.writelines ([' 11111\n ', ' 22222\n ', ' 3333\n ']) # # f.close ()    #a模式, the text does not exist then created, the file exists does not overwrite, Write content is appended to the way write # F=open (' A.txt ', ' a ', encoding= ' Utf-8 ') # f.write (' \n444444\n ') # f.write (' 5555555\n ') # f.close ()     #其他方法 # f=open (' A.txt ', ' W ', encoding= ' Utf-8 ') # f.write (' asdfasdf ') # F.flush () #吧内存数据刷到硬盘 # f.close () # Print ( f.closed) #判断文件是否关闭 # f.readlines ()   # print (f.name,f.encoding) # Print (F.readable ()) # Print (F.writable ())   ## f=open (' C.txt ', ' R ', encoding= ' utf-8 ')  # print (F.read (3)) # Print (' First_read: ', F.read() # F.seek (0) # print (' Second_read: ', F.read ())  ## F.seek (3) # Print (F.tell ()) # Print (F.read ())      # f=open (' C.txt ', ' W ', encoding= ' Utf-8 ') # f.write (' 1111\n ') # f.write (' 2222\n ') # f.write (' 3333\n ') # F.write (' 444\n ') # f.write (' 5555\n ') # f.truncate (3)   # f=open (' a.txt ', ' a ', encoding= ' Utf-8 ') # # F.truncate ( 2)   #r w A; RB WB ab# F=open (' a.txt ', ' RB ') # # print (F.read ()) # Print (F.read (). Decode (' Utf-8 ')) # f.close ()    # f= Open (' A.txt ', ' WB ') # # print (f.write (' Hello '. Encode (' Utf-8 ')) # F.close () context management We often forget a small operation when opening a file--close the document. Colse () We can avoid this small error by using the mode of with as to open the document operation! With open (' A.txt ', ' W ')  as f:    pass How to batch modify the contents of a file:

Read_f=open (' A.txt ', ' R ', encoding= ' utf-8 ')
Write_f=open ('. A.txt.swp ', ' W ', encoding= ' utf-8 ')
With open (' A.txt ', ' R ', encoding= ' Utf-8 ') as read_f,\ #将文件打开
Open ('. A.txt.swp ', ' W ', encoding= ' Utf-8 ') as Write_f: #并且再创建一个文件名为. a.txt.swp file
For line in Read_f:
If ' Alex ' in line: #找到想要替换的内容

Line=line.replace (' Alex ', ' Alexsb ') #并且将旧的内容替换成新的内容存在. a.txt.swp files

Write_f.write (line) #不符合条件的不动
Os.remove (' a.txt ') #将源文件删除
Os.rename ('. A.txt.swp ', ' a.txt ') #将. a.txt.swp This file should be a.txt so that the file content of the batch modification!

Python Basics (c)----character encoding and file handling

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.