20.python of file Processing

Source: Internet
Author: User
Tags deprecated for in range

We usually follow this logic when working with files: Open files, manipulate files, and save closed files.

In Python, however, it is divided into the following steps: Create a File object, manipulate the file object (read-in, write, and so on), and close the file.

Since file operations differ in python2.x and python3.x, 3.x can accept more parameters.

So here is the note: The following is for python2.x, accurate is python2.7.

Below to analyze each:

1. Create a File object

There are two ways to create a file object, the first one is to use the factory function file (name[, mode[, buffering]), file object, and the other is to call the built-in function open (name[, mode[, Buffering]]), file object .

There is no intrinsic difference between the two, in fact, open is also called file implementation, but Python official suggested that we use open for file object creation, so we also conform to the official recommendations, the following demonstration is based on open.

First, when we use open to innovate a file object, we need some parameters, where name is required, it accepts a string , indicating the file name , the file name can take an absolute path, or a relative path, and mode represents the pattern, which accepts a string Indicates what pattern to use to create this file object, and the different patterns will affect the file, here is a summary table about the pattern:

a+ The
file mode action
R Open as read-only, default
RU or Ua read-Open with Universal line feed support (PEP 278)
W Open as write (empty if necessary)
a open in Append mode (starting with EOF, creating new files if necessary)
r+ Open in read-write mode
w+ Open in read-write mode (see W)
opens in read-write mode (see a)
RB open in binary read mode
WB opens in binary write mode (see W)
ab opens in binary append mode (see a)
rb+ opens in binary read-write mode (see r+)
wb+ opens in binary read-write mode (see w+)
ab+ opens in binary read-write mode (see A +)

The following is a further explanation of these patterns:

R: As the name implies, read in the file object created by the pattern, not write, etc., throws a IOError exception when the file to be opened does not exist.

W: Open in write-only mode, only write operation, not read operation, if the open file exists, then empty the original file and open, if the open file does not exist, then create the file and then open. This mode is relatively dangerous because, regardless of whether the file exists or not, the last action is an empty file, at which point the write is written from scratch, which is where the file pointer is at the beginning of the file.

A: Open in Append mode, the mode cannot read the file, but can write operation. If an open file exists, open the file and move the file's pointer to the end of the file, at which point any newly written content will be at the very end of the file. If the file does not exist, create a new file and write from the beginning. Unlike W, when an open file is present, the original file is not emptied, and only the append is written.

R+: R itself cannot write, and after it has been expanded to r+, it is able to write. The r+ open file does not empty the file, and when it is written, its pointer will be at the very beginning, that is, the contents of my file is 123456, when I write the ' abc ' string with r+, the contents of the file become abc456. Similarly, an exception is thrown when the file does not exist.

W+,a+: Because W and a both do not have Read permission, so after the expansion of the + number can be read operation, the other behavior of the original.

U: Universal line feed support, the symbols used by different platforms to represent the end of the line are different, for example \ n, \ r, or \ r \ n. But if you just write a way to deal with line breaks, it can't be the same on other platforms, and it's too cumbersome to write a method for every platform. So Python introduced UNS in Python 2.3. When you open a file with the ' U ' flag, all line breaks (or line terminators, whatever it is) are replaced with a newline character NEWLINE (\ n) when returned via Python's Input method (for example: Read ()). (' RU ' mode also supports ' RB ' option). This feature also supports files that contain different types of line terminators, and the Newlines property of the file object records the line terminator of the file it had "seen".

B: Binary mode, the default in Linux is binary open, so this option is optional for Linux, but if you really need to use binary mode, it is recommended to write, increase cross-platform capabilities. When you use this mode for write operations, you can write not only the string, but also the buffer object.

  

  Buffering indicates the buffering method used to indicate access to the file. where 0 means no buffering, 1 means that only one row of data is buffered, and any other value greater than 1 represents a buffer size using the given value. This parameter is not provided, or a given negative value represents the use of the system default buffering mechanism, which uses a row buffer for any class Telegraph (TTY) device, and other devices use normal buffering. In general, you can use the system default mode.

  

  Finally, to summarize, the creation of a file object does not mean that the content of the file is read, which is different from the definition of open files in our daily life. The open file for Python is the handle to the file, which is the file's entry, and the read file content needs to be read into Python's memory, known as input.

2. Enter

The so-called input, is the content of the file read into the Python, there are several methods.

1. Read ([size]), read at the most size bytes, returned as a string.

If size (in bytes) is negative or not given, it is read to EOF, which is the end of the file. Returns a string that contains all the contents (including newline characters). Note that when in non-blocking mode, the data may be returned even if it is less than the required size parameter.

This method is a one-time read, that is, loading 1GB files will directly occupy 1GB of memory, it is not suitable for reading large files.

f = open ('test.txt'= f.read ()print  aprint Repr (a)

File contents:

Code output:

Note the line break here. There are 3 carriage returns used in the file, and there are 3 line breaks.

We used the repr () function in Python coding to get the encoding of a string, and here we use this function to see what encoding the Open file is.

First of all, my code is declared as: # coding= utf-8

  

The code is declared as: # coding= GBK

The returned string is the same regardless of what the encoding declaration is. Description The encoding declaration does not affect the encoding of the string read from the file.

So, we write a file with Utf-8 encoding:

# !/usr/bin/env python # coding= Utf-8  = open ('test1.txt','w') f.write ( ' The first line \ n the second line \ n the third line \ n ' ) F.close ()

To read again:

And the same as above.

Then write a file with GBK:

# !/usr/bin/env python # coding= GBK  = open ('test2.txt','w') f.write ( ' The first line \ n the second line \ n the third line \ n ' ) F.close ()

Read again:

The results are different.

  Description of a problem: reading the obtained string in a file object, whose encoding is irrelevant to the Python encoding declaration, is only relevant to the encoding used when the file itself is saved.

  Python's encoding declaration only affects strings created in Python, so I based on this feature, in Python to create the corresponding encoded string, and then save to the file, all the characters in the file encoding will not be the same, I hope you don't go around halo.

In this case, you can see how the code conversion of a file is done in Python.

The encoded conversion is actually for the string, and all that is used is the built-in method in the string:

S.decode ([encoding[,errors]]), Object

The method returns the decoded string.

Encoding-the encoding to use, such as "UTF-8" (default). Errors--Sets the processing scheme for different errors. The default is ' strict ', which means a unicodeerror is caused by a coding error. Other possible values are ' ignore ', ' replace ', ' xmlcharrefreplace ', ' backslashreplace ' and any value registered through Codecs.register_error ().

code example:

f = open ('test2.txt','r'== A.decode ( encoding='gbk')print  bprint repr (b)

As you can see, a Unicode string is obtained after decoding.

Next we'll look at how to encode.

S.encode ([encoding[,errors]]), Object

The method returns the encoded string.

Encoding-the encoding to use, such as "UTF-8" (default). Errors--Sets the processing scheme for different errors. The default is ' strict ', which means a unicodeerror is caused by a coding error. Other possible values are ' ignore ', ' replace ', ' xmlcharrefreplace ', ' backslashreplace ' and any value registered through Codecs.register_error ().

code example:

f = open ('test2.txt','r'== A.decode ( encoding='gbk'= B.encode (encoding='utf-8'  )print  cprint repr (c)

It is the UTF-8 encoded string, which completes the conversion of the Gbk-->unicode-->utf-8. At this point, I can see what I have in Python's code, and I understand what Unicode means as a bridge.

f = open ('Test2.txt','R') A=F.read () f.close () b= A.decode (encoding='GBK') C= B.encode (encoding='Utf-8') F= Open ('Test2.txt','W') F.write (c) f.close ()

Using the above code to complete the encoding of the file conversion, of course, there is some space for optimization, here only as a logical demonstration.

Next proceed to the description of the input method.

2. ReadLine ([size]), next line from the file, as a string.

Reads one row at a time, preserves newline characters, and returns a String object. A size (the default is-1, which represents the read to line terminator) is a non-negative number, which is used to limit the maximum read bytes, and when the number of bytes is set less than the true line of bytes, an incomplete row is returned, which is not affected. When EOF is the end of the file, an empty string is returned. With a For loop, you can iterate through the line. Because a string is returned, it is also possible to use a string for the returned object, such as codecs.

f = open ('test2.txt'r') for in Range (2):    print f.readline ()

Because the Print keyword adds line breaks by default, and the file has a newline character, two line breaks appear, and the output is interlaced. You can add a comma to cancel the default behavior of print.

f = open ('test2.txt'r') for in Range (2):    print f.readline (),

In addition, there is a problem to note, do not want to loop here:

f = open ('test2.txt'r') for in F.readline ():  # is equivalent to reading a line, traversing each byte    print x     #  At this point the print is a byte, Utf-8, the display of Chinese requires 3 bytes, the output is garbled    print repr (x)   # See if it's byte

  Each time a row is read, the memory consumption is naturally lower, but it is difficult to know how many lines the file has, so it is not an ideal way to traverse.

3. readlines ([size]), List of strings, each a line from the file.

method does not return a string like the other two input methods, it reads all (the remaining) rows and then returns them as a list of strings. Its optional parameter size represents the size of the large section returned. If it is greater than 0, then all rows returned should have approximately size bytes (which may be slightly larger than this number because the buffer size needs to be pooled.) For example, the buffer size can only be a multiple of 4K, if size is 15k, then the return may be 16k).

Because the return is a list of all the rows, it can be traversed directly, paying attention to the differences above.

f = open ('test2.txt'r')for in  f.readlines ():    print x,

  However, it is also a one-time read of the entire file, so memory consumption is also very large.

At this point, there is a more efficient Xreadlines method, the parameters are the same, but its essence is a generator, that is, each call to return a row, the iteration is read by row, more efficient.

However, this is not the best approach, the best way is to iterate directly over the file object:

f = open ('test2.txt'r')for in  F:    print x,

This is a new feature of iterators and file iterations introduced since python2.2, and file objects become their own iterators, which means that users do not have to call the read* () method to iterate over each line of a file in a for loop. Alternatively we can use the next method of the iterator, File.next () can be used to read the next line of the file. Like other iterators, Python throws a Stopiteration exception after all row iterations have completed. But the For loop automatically calls the next method and handles the exception that is thrown after the iteration completes, so the direct iteration of the file object becomes the best use.

Another deprecated method is Readinto (), which reads a given number of bytes into a writable buffer object, and the discarded buffer () built-in function returns the same type of object. (Because buffer () is no longer supported, Readinto () is deprecated. )

3. Output

1. write (str), None. Write string str to file.

Writes a string to the file, which is no longer demonstrated here.

2. Writelines (sequence_of_strings), None. Write the strings to the file.

Accepts a list of strings as parameters and writes them to the file. The line terminator is not automatically added, so if necessary, you must add a line terminator to the end of each line before calling Writelines ().

Note that there is no "WriteLine ()" method, because it is equivalent to calling the Write () method with a single-line string ending with a terminator.

4. Movement of pointers within files

  The pointer in the file is relative to the cursor we normally see, which represents the position of our various operations. For example, when we use r+ to open a file and write something, the cursor is at the beginning of the file, so the beginning of an example will overwrite the original character, and a mode open after the cursor at the end of the file, so in a mode to write the content will be appended to the end of the file.

1. Seek (offset[, whence]), None. Move to new file position.

  Move the cursor position in the file from whence (0 for the file start, 1 for the current position, the current position is determined by the open mode, or before the cursor is moved, 2 for the end of the file) offset off byte, off is positive when moving to the right, negative to the left. When the cursor is at the beginning of a file, it can only be positive; at the end of the file, only negative. Although some files support the cursor beyond the end, it is better not to exceed it.

2. Tell ()-Current file position, an integer (may be a long integer).

Returns an integer (possibly a long integer) that represents the position of the current cursor.

3. truncate ([size]), None. Truncate the file to in the most size bytes.

Intercepts the file to a large size byte, which defaults to the current file location (Size=file.tell ()). The so-called interception is the reservation in front of the cursor, all the back is removed.

5. Close Save

1. close (), None or (perhaps) an integer. Close the file.

Close the file. If the file is not closed, various operations on the file are saved in the buffer, and the file is closed to write the contents of the buffer to disk.

2. flush (), None. Flush the internal I/O buffer.

Save the contents of the buffer to disk without closing the file.

6. Other methods

1. Isatty () True or false.

Determine if file is a class TTY device

2. X.next (), the next value, or raise stopiteration

Returns the next line of the file (similar to File.readline ()) and throws an Stopiteration exception when there are no other rows.

7.file Object-related properties

file.closed True to indicate that the file has been closed or False

Encoding used by the file.encoding file-when Unicode strings are written to data, they are automatically converted to byte strings using File.encoding; Use system default encoding if File.encoding is None

The access mode used when the File.mode file is opened

File.name file name

File.newlines is none when the row delimiter is not read, only one row delimiter is a string, and when the file has more than one type of line terminator, it is a list that contains all the currently encountered line Terminators.

A file.softspace of 0 means that after outputting a data, a space character is added, and 1 means no. This property is generally not used by programmers, but by internal programs.

About the file operation is summed up here, there are any errors or additions will be corrected later.

20.python of file Processing

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.