This article mainly describes the use of Python to read large files, the need for friends can refer to the following
Background
When I recently processed a text document (about 2GB in size), there was a memoryerror error and a slow file read, and two more fast large file Reading were found, this article describes the two methods of reading.
Preparatory work
When we talk about "text processing," we usually refer to what we are dealing with. Python reads the contents of a text file into a string variable that can be manipulated very easily. The file object provides three "read" methods:. Read (),. ReadLine (), and. ReadLines (). Each method can accept a variable to limit the amount of data that is read each time, but they typically do not use variables. Read () reads the entire file each time, and it is typically used to place the contents of the file into a string variable. however. Read () produces the most direct string representation of the file content, but it is not necessary for continuous row-oriented processing and is not possible if the file is larger than the available memory. The following is an example of the Read () method:
try:f = open ('/path/to/file ', ' R ') print F.read () finally:if f:f.close ()
The call to read () reads the entire contents of the file one time, and if the file is 10G, the memory explodes, so to be safe, you can repeatedly call the read (size) method to read the contents of a size byte at most. In addition, call ReadLine () can read a line of content at a time, call ReadLines () to read all of the content one time and return the list by row. Therefore, you need to decide how to call as needed.
If the file is small, read () One-time reading is the most convenient, if the file size can not be determined, repeated calls to read(size) compare insurance, if it is a configuration file, call ReadLines () the most convenient:
For line in F.readlines ():p rocess (line) #
Chunked Read
Processing large files is easy to think of is to divide large files into small file processing, after processing each small file to release the portion of memory. ITER and yield are used here:
def read_in_chunks (FilePath, chunk_size=1024*1024): "" "Lazy function (generator) to read a file piece by piece. Default Chunk Size:1myou can set your own chunk size "" "File_object = Open (filePath) while true:chunk_data = File_object.re AD (chunk_size) if not chunk_data:breakyield chunk_dataif __name__ = = "__main__": FilePath = './path/filename ' for chunk in R Ead_in_chunks (FilePath):p rocess (chunk) # <do something with chunk>
Use with open ()
The With statement opens and closes the file, including throwing an inner block exception. The for line in F file object F is treated as an iterator that automatically uses buffered IO and memory management, so you don't have to worry about large files.
The code is as follows:
#If the file is line Basedwith open (...) as F:for line in f:process (line) # <do something with line>
Optimization
There is no problem with using the with open for large data with millions of rows, but the differences in the parameters also lead to different efficiencies. The efficiency of the test is "RB" when the first parameter is 6 times times the "R". As a binary read is still the fastest mode.
with open (filename, "RB") as F: for fline in F: Pass
Test Result: The RB mode is the fastest, 100w line full traversal 2.9 seconds. Basic ability to meet the requirements of large file processing efficiency. If reading from RB (secondary read) is changed to R (read mode), it is 5-6 times slower.
Conclusion
When using Python for large file reading, you should let the system to handle, the simplest way, to the interpreter, to manage their own work is OK. At the same time, according to different requirements can choose different reading parameters to further achieve higher performance.