Python-based character encoding and file manipulation

Last Update:2017-07-25 Source: Internet

Author: User

Tags format definition

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

I. Understanding the knowledge reserve before character encoding

1, computer running program or read the principle of the file

In order for the computer to run faster than the I/O operation slows down, the CPU does not read the data directly from the hard disk, because the hard disk reads and writes and the CPU is too different, so the CPU is reading the data from the relatively fast memory. and the program files and text files in order to permanently save and all saved on the hard disk, so the computer to run the program or read the file process is generally the case, first by the operating system control the hard disk to read the program files or text files into memory, and then the CPU from memory read data run or output to the terminal print to the screen.

2, Text editor read the principle of the file

2.1 Text editor Program files read into memory run

2.2 Reading a text file into the memory space of the text editor

2.3 Text Editor reads the contents of the text file into the in-memory print to the terminal

3. How the Python interpreter works

3.1 Python interpreter program files read into memory run

3.2 Python Program files read into the memory space of the Python interpreter

3.3 Python interpreter interprets code in the Execute Python program file

So the text editor and the Python interpreter are the same as when they read the Python program file, and the Python program file is no different than the normal text file before it is executed, unlike the text editor reading the Python program file for printing to the terminal display, The Python interpreter is meant to explain the execution of Python code, and the contents of the Python program file in this step.

Second, what is character encoding

The characters in the file are recognizable by the person, the computer can only recognize the binary number, and how to make the characters in the file can be recognized by the computer, that is, how to convert the characters to binary numbers, which uses the character encoding. Character encoding is to match the character to a specific number, with the character encoding the character to the number of the corresponding relationship, only need to convert these numbers to a binary number can be used to allow the computer to recognize the characters that those people can recognize.

There are only two scenarios where character encoding can be used in a computer:

1, text files including the program code files are accessed when

2. When the string in the program file is defined and used

Iii. Unicode and UTF-8

Character encoding at the beginning of the only ASCII code, specifically for the identification of English letters and some special symbols, with 1Bytes for a character, then the computer popularized all countries have their own character code, In order to encode a uniform standard for characters so that the programs of each country can function correctly, a Unicode universal code appears, which is compatible with the languages of all countries, and Unicode specifies that all languages use 2Bytes to represent one character. However, this is the space to store English files for an unreasonable number of times, so in order to save this part of the space has appeared UTF-8, it stipulates that the storage of English with 1Bytes for a character, the rest of the 3Bytes represents a character.

Unicode because of the simple and rude rule that all countries ' languages use 2Bytes to represent a character, no time to judge English characters, so the conversion speed is faster, the same in memory using Unicode encoded format to store characters

UTF-8 is a UTF-8 encoded format used in hard disk storage and network transmission because it is more space-saving in storing English and is used to judge English characters much less than hard-disk I/O and network latency.

Four, the conversion between character encoding

4.1. Conversion relationship

Unicode-------"encode------" other encoding formats

Other encoding formats-----"decode--------" Unicode

1) Encode, decode must use the same character encoding, otherwise it will be garbled

2) in encode and decode need to explain the encoding and decoding of the character encoding, where the character encoding refers to the encoding format used by the file

4.2. Default encoding

The character encoding for default file access in Python3 is UTF-8, and the default string encoding format is Unicode

The character encoding for default file access in Python2 is ASCII, the default string encoding is ASCII and the storage form is encode after the bytes type, you can make the encoding format Unicode by adding U to the string, such as U ' Hello '

The default character encoding for Windows systems is GBK

4.3, garbled

Because the file in the memory and the hard disk access to two modes of operation, so garbled production also has two scenarios

1) garbled from the time the file is saved

Since the file is saved with an inappropriate character encoding format that prevents the file from being saved correctly, the file is garbled, and the garbled file is actually corrupted. For example, the contents of the file are in Chinese, but the use of Japanese character encoding when saving will cause this situation.

2) garbled from the time of file reading

Because the file is read by using a different character encoding and file saving, so that the file does not display properly, this garbled can be solved, only need to use the correct character encoding.

4.4. Python Specifies the encoding format used by the program execution

We may use other encoding formats when saving Python files, in order not to let the interpreter use the default encoding format to run the program caused an error, you need to tell the interpreter in advance what encoding format should be used to run their own program, the simple way is to write on the first line of code #coding: File Encoding format

4.5. Character encoding problem during program execution

Because the same Unicode is used in memory, all files are Unicode encoded when they are read into memory, and when the program encounters a string definition, it will open memory in memory to store the string. This is usually the Unicode encoding format, but you can also specify other formats

4.6. The difference between Python2 and Python3

There are two defining formats for strings in Python2, 1 name= ' Alex ', which is stored in memory with the bytes type after encode because Python2 is defined in Bytes=str, While encode character encoding is the file header specified encoding format or not specified is python2 default ascii;2 name=u ' Alex ', this format-defined string in memory is in the Unicode encoding format.

Python3 also has two kinds of string definition format, 1name= ' Alex ', this format definition of the string in Python3 is directly in the Unicode encoding format stored; 2 s= ' Alex ', S1=s.encode (' Utf-8 '), S1 is the bytes type data in Python3.

Five, the conclusion of character coding

1, encode with what code, decode with what decoding

2, Bytes Type is unicode encoded data, bytes type want to see information will decode into Unicode

3, in the hard disk storage data and network transmission must be bytes type of data

VI. file Operations

1) Open File

The open file in Python uses the Open function, such as open (r ' file path ', ' Open file Mode ', encodig= ' file using character encoding '). The three parameters in the Open function are:

File path: Specifies the path of the file to be manipulated, usually with R in front of the file path, so that some special symbols such as \t,\n are converted to ordinary characters

File open mode: The common mode has r read-only mode, the file can only view content can not add content, the file opened the cursor at the beginning of the file; W write-only mode, the file can only be added to the content can not be viewed, and will empty the original content of the file, the file opened after the cursor at the beginning; Can only append the content at the end of the file, cannot view, the file opened after the cursor at the end of the file; RB, WB, AB in the previous three modes based on the B means the file is operated in binary form, does not need to consider the encoding format, applicable to any file. If you do not specify a file open mode, the default is open in read-only mode.

Character encoding used by the file: What character encoding is used to save the file when it is opened, and default to Utf-8 if not specified

When the file is opened, it gets a file handle, which needs to be assigned to a variable for subsequent manipulation of the file.

2) Read the file

F.read (): Returns the entire contents of a file as a string, or specifies the length of the read file, such as F.read (3), which reads the contents of the file from the beginning to the third character

F.readline: Each call reads only one line of the file, returning as a string

F.readlines: Returns the entire contents of a file as a list, each line of the file as an element of the list, and as a string

For loop output file contents:

For line in F:

Print (line)

The above is the fastest way to read files, because in memory there will be only one line of the file

3) write files

F.write: Writing a row of data to a file

F.writelines: Write multiple lines of data to a file, as in the form of a list, f.writelines ([' 1111 ', ' aaaa ', ' dddd ') indicates that three rows of data are written as ' 1111 ', ' aaaa ', ' dddd '

4) Close the file

F.close: A request to shut down a file to the operating system, at which time the file handle for F is still present but the file cannot be manipulated

5) Other operations of the file

1. Context Management

Python in order to prevent programmers forget to close the file provides a convenient function is the WITH function, it can automatically close the file after the file operation, and open the file can be opened at the same time multiple files, such as with open (' A.txt ', ' R ', encoding= ' Utf-8 ') as read_f,\

Open (' A.txt.swap ', ' R ') as Write_f:

2. Cursor movement

Seek: Moves the cursor to a specified position, such as F.seek (0) to move the cursor to the beginning of the file

Tell: Returns where the cursor is currently located

Truncate: Truncate the file, except for the rest of the contents of the specified location, such as F.truncate (3), which preserves the contents at the beginning of the file to the third character, and the remainder is cleared

3. Force the memory data to be written to the hard disk

F.flush: Forces the in-memory data to be written to the hard disk immediately without waiting for the system to write

4. File modification

In Python to modify the content of a file is generally open source files, and then opened a blank file, while reading from the source file to write a blank file and then modify the line to write to the blank file, and finally delete the source file and then rename the new file to the previous file name.

Python-based character encoding and file manipulation

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More