Python programming (c) character encoding and file processing

Last Update:2017-06-13 Source: Internet

Author: User

Tags throw exception

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Computers want to work must be energized, that is, ' electricity ' drives the computer to work, and the ' power ' is the high and low level (high and low levels are binary number 1, the lower level is the binary number 0), that is, the computer only know the number

The purpose of programming is to let the computer work, and the result of programming is simply a bunch of characters, that is to say, what we are programmed to achieve is: a bunch of characters drive a computer to work

So you have to go through a process:

character--------(translation process)-------> Numbers

This process is actually how a character corresponds to the standard of a particular number, which is called a character encoding.

Character encoding: Character--The standard of binary numbers
Phase One:
ASCII: A bytes represents a character (English + symbol)
A 1bytes=8bit,8bit can represent 256 characters
Initially only used the latter 7 digits, 127 digits, reserved a

Stage Two:
China formulates GBK
2Bytes represents a character

Phase III:
Language is many, mixed appear garbled
Produces Unicode, unified with 2Bytes for one character, 2**16-1 65,535

For the English language, this way is more than a space, a waste of space
So produce UTF-8, to English only use 1Bytes, to noon with 3 bYtes

Unicode: All of them are 2Bytes

Use character encoding correctly:
Unicode is the default in memory
File Village: Memory brush to hard disk
File read: Hard disk read to memory
1. Before the file is executed: What code is used when the file is stored, and the same code is used when reading
2. File execution: The concept of a string that data type
x= ' Hello ' python3 string is Unicode by default
The Unicode type can be encode non-x.encode (' GBK ') python3

Notice:

Unicode: Simple rough, all characters are 2Bytes, the advantage is the character---digital conversion speed, the disadvantage is the space-occupying large

Utf-8: precision, for different characters with different lengths, the advantage is to save space, the disadvantage is: character---number conversion speed is slow, because each time you need to calculate how long the character needs bytes to be able to accurately represent

All programs that eventually have to be loaded into memory, programs saved to hard drives in different countries in different encoding formats, but into memory we are compatible with all nations (the computer can run any country's program for this reason), unified and fixed using Unicode, which is why memory is fixed with Unicode, You may say that compatible with all nations I can use utf-8 ah, can, completely normal work, the reason is not sure that Unicode is more efficient than utf-8 AH (uicode fixed with 2 byte encoding, utf-8 need to calculate), but Unicode is more wasted space, yes, This is a way to use space for time, and storage to the hard disk, or network transmission, all need to turn Unicode into utf-8, because the transmission of data, the pursuit of stability, high efficiency, the smaller the amount of data transmission is more reliable, so all turned into utf-8 format, not Unicode.

The encoding used in memory is Unicode, with space-time (the program needs to be loaded into memory to run, so the memory should be as fast as possible)
In the hard disk or network transmission with UTF-8, network I/O latency or disk I/O latency is much larger than the utf-8 conversion delay, and I/O should be as much as possible to save bandwidth, ensure the stability of data transmission.

Garbled:

Garbled one: files are garbled when they are stored

Save the file, because the document has the text of each country, we shiftjis to save,

In essence, the writing of the Open function can be tested, F=open (' A.txt ', ' W ', encodig= ' Shift_JIS ') due to the lack of correspondence in the ShiftJIS and the failure of the storage in other countries.

F.write (' You \nて\n ') # ' you ' are unable to save success because there is no correspondence in ShiftJIS, save 'て\n' to succeed

But when we use the file editor to save the time, the editor will help us do the conversion, to ensure that the Chinese can also be used ShiftJIS storage (hard to save, it must be garbled), which led to the file stage has been garbled

In this case, when we open the file with ShiftJIS, Japanese can display normally, while Chinese is garbled.

Garbled two: When the file is not garbled and read the file garbled

Save the file with Utf-8 encoding, to ensure that compatible with all nations, not garbled, and read the file when the wrong decoding method, such as GBK, then in the reading stage garbled, read the stage garbled is can be resolved, select the correct decoding method is OK, and the file is garbled, it is a kind of data corruption.

Summarize:

No matter what the editor, to prevent garbled files (please note that the file stored in a piece of code is just a normal file, here refers to the file is not executed before we open the file when the garbled)

The core rule is that what code the file is stored in, and how it's coded to open it.

File processing:

When you open a file, you need to specify the file path and how you want to open the file, and then open it to get the file handle and manipulate it later through the file handle.

The mode of opening the file is:

R, read-only mode "default mode, file must exist, not present, throw exception"
W, write-only mode "unreadable; not exist" created; empty content "
X, write-only mode "unreadable; not present, create, present error"
A, append mode "readable; not present, create; append content only"

"+" means you can read and write a file at the same time

r+, read and write "readable, writable"
w+, write "readable, writable"
x+, write "readable, writable"
A +, write "readable, writable"

"B" means to operate in bytes

RB or R+b
WB or W+b
XB or W+b
AB or A+b

Note: When opened in B, the content read is a byte type, and a byte type is required for writing, and encoding cannot be specified

Flush principle:

File operation is the software to read files from the hard disk to memory
The operation of writing to the file is also stored in buffer buffers (memory speed faster than the hard disk, if the data written to the file from memory to the hard disk, memory and hard disk speed delay will be infinitely amplified, inefficient, so to brush the data to the hard disk we unified into the memory of a small piece of space, buffer, After some time the operating system will flash the data in buffer to the hard disk.
Flush that is, forcing the written data to be brushed to the hard disk

1. Open () syntax

Open (file[, mode[, buffering[, encoding[, errors[, newline[, Closefd=true] []] ])
The Open function has a number of parameters, commonly used file,mode and encoding
file files, need to be quoted
mode File open mode, See the following 3
buffering for 0,1,>1 three, 0 for buffer off (binary mode only), and 1 for line buffer (text mode only),> 1 indicates the buffer size of the initialization;
encoding indicates what encoding is used for the returned data, generally UTF8 or GBK;
Errors The value is generally strict,ignore, when taking strict, character encoding problems, will error, when taking ignore, coding problems, the program will be ignored, continue to execute the following program.
NewLine can take a value of None, \ n, \ r, ", ' \ r \ n ', to differentiate between newline characters, but this parameter is valid only for text mode;
closefd , is related to the file parameters passed in, by default, True, the file parameter passed to the file name, the value is false, file can only be a document descriptor, what is a file descriptor, is a non-negative integer, in the Unix kernel system, open a file, A file descriptor is returned.

2. the difference between file () and open () in Python
Both can open the file, the operation of the file, but also have similar usage and parameters, but, the two file open way there is an essential difference, filefor the document class , the file () to open files , equivalent to this is in the construction of the file class, and open () Opening the file is done using Python's built-in functions , and it is recommended to use the Open

3. Basic values of the parameter mode

' W ' ' B ' text mode (default) Td> Td>

Character	Meani ng
' R '	open for reading (default)
open for writing, truncating the file first
' a '	open for writing, appending to the end of the file if it exists
binary mode
' t '
' + '	open a disk file for updating (R Eading and writing)
' U '	Universal newline mode (for Backwa RDS compatibility; Should not being used in new code)

R, W, A is the basic mode of open file, corresponding to read-only, write-only, append mode;
B, T, +, u these four characters, with the above file open mode combination, binary mode, text mode, read and write mode, universal line break, according to the actual situation combination of use,

Common mode value Combinations

R or RT default mode, text mode read  RB   binary file 3     w or wt text mode write, open before file storage is emptied WB binary write, File store is also emptied a append mode, can only be written at the end of the file 8 A + read/write mode, write can only be written at the end of the file w+ can read and write, the difference between A + is to clear the contents  of the file

With open ('a.txt','w'pass

With open ('a.txt','r') as Read_f,open ('b.txt','w') As Write_f:data=read_f.read () write_f.write (data)

ImportOswith Open (‘A.txt‘,‘R', encoding=‘Utf-8‘) as Read_f, open (‘. A.txt.swap‘,‘W', encoding=‘Utf-8‘) as Write_f: for line in< Span style= "COLOR: #000000" > read_f: if line.startswith (hello "): Line= '  hahaha \n "  Write_f.write (line) os.remove (a.txt ") Os.rename ( ' .a.txt.swap ", " a.txt

Python programming (c) character encoding and file processing

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More