When I recently processed a text document (about 2GB of file size), memoryerror errors and slow file reads, we found two methods for faster large file Reading, and this article describes both of these reading methods.
Preliminary
When we talk about "text processing", we usually refer to what we are dealing with. Python is easy to read the contents of a text file into a string variable that can be manipulated. The file object provides three read methods:. Read (),. ReadLine (), and. ReadLines (). Each method can accept a variable to limit the amount of data read at a time, but they usually do not use a variable. read () reads the entire file at a time, and is typically used to place the contents of the file in a string variable. However, the. Read () generates the most direct string representation of a file's content, but it is unnecessary for sequential row-oriented processing, and is not possible if the file is larger than available memory. The following is an example of the Read () method:
Try:
f = open ('/path/to/file ', ' R ')
print f.read ()
finally:
if f:
F.close ()
If the file is small, read () is the most convenient read (), and if you cannot determine the file size, call the read (size) to compare insurance; if it is a configuration file, it is most convenient to call ReadLines ():
For line in F.readlines ():
process (line) # <do something with line>
Read in Chunks
Processing large files is very easy to think of is to split large files into a number of small file processing, processing each small file after the release of that part of memory. ITER & yield are used here:
def read_in_chunks (FilePath, chunk_size=1024*1024): "" "
Lazy function (generator) to read a file piece by piece.< C2/>default Chunk size:1m
can set your own chunk size
"" "
file_object = open (FilePath) while
True:
chunk_data = File_object.read (chunk_size)
if not chunk_data:
break
yield chunk_data
if __name__ = = "__main__":
filePath = './path/filename ' for
Chunk in Read_in_chunks (FilePath):
process (chunk) # < Do something with chunk>
Using with open ()
The With statement opens and closes the file, including throwing an inner block exception. The for-line in F file object F is considered an iterator that automatically uses buffer IO and memory management, so you don't have to worry about large files.
#If the "based" with
open (...) as F: For line in
F:
process (line) # <do something with LINE&G T
Conclusion
When you use Python for large file reads, you should let the system handle it, and in the simplest way, give it to the interpreter, just manage your work.