How to read large files using Python
Background
When a text file is recently processed (the file size is about 2 GB ),memoryError
Errors and files are too slow to read. Later, I found two fastLarge File Reading
This article describes the two reading methods.
Original URL
Preparations
When talking about "text processing", we usually refer to the content to be processed. Python reads the content of a text file into a string variable that can be operated easily. The file object provides three "read" methods:.read()
,.readline()
And.readlines()
. Each method can accept a variable to limit the amount of data read each time, but they usually do not use variables..read()
Each time you read the entire file, it is usually used to put the file content into a string variable. However.read()
Generate the most direct string representation of the file content, but it is unnecessary for continuous row-oriented processing. If the file is larger than the available memory, this processing is impossible. Below isread()
Example:
try:f = open('/path/to/file', 'r')print f.read()finally:if f:f.close()
Callread()
It will read all the content of the file at a time. If the file has 10 Gb, the memory will burst. Therefore, to be safe, you can call it repeatedly.read(size)
Method to read a maximum of size bytes each time. In additionreadline()
You can read a row of content each time and callreadlines()
Read all the content at a time and return the list by row. Therefore, you need to decide how to call it as needed.
If the file is small,read()
One-time reading is the most convenient; if the file size cannot be determined, it is called repeatedlyread(size)
Relatively safe; if it is a configuration file, callreadlines()
Most convenient:
for line in f.readlines():process(line) #
Multipart read
It is easy to think of splitting a large file into several small files for processing. After processing each small file, this part of memory is released. Here we useIter and yield
:
def read_in_chunks(filePath, chunk_size=1024*1024):"""Lazy function (generator) to read a file piece by piece.Default chunk size: 1MYou can set your own chunk size"""file_object = open(filePath)while True:chunk_data = file_object.read(chunk_size)if not chunk_data:breakyield chunk_dataif __name__ == "__main__":filePath = './path/filename'for chunk in read_in_chunks(filePath):process(chunk) # <do something with chunk>
Use With open ()
with
Statement to open and close the file, including throwing an internal block exception.for line in f
Objectf
As an iterator, the buffer is automatically used.IO
And memory management, so you don't have to worry about large files.
The Code is as follows:
#If the file is line basedwith open(...) as f: for line in f: process(line) # <do something with line>
Optimization
There is no problem with using with open for millions of rows of large data, but different parameters here will also lead to different efficiency. After testing, when the initial parameter is "rb", the efficiency is six times that of "r. We can see that binary reading is still the fastest mode.
with open(filename,"rb") as f: for fLine in f: pass
Test results: the rb mode is the fastest, with 2.9 rows traversing for seconds. It can basically meet the needs of medium and large file processing efficiency. If the read speed is changed from rb (second-level read) to r (read mode), the speed is 5-6 times slower. Conclusion
When using python to read large files, you should let the system handle them. The simplest way is to hand it over to the interpreter to manage your work. At the same time, you can select different read parameters to achieve higher performance based on different requirements.