Ways to read large files using Python

Source: Internet
Author: User
This article mainly describes the use of Python to read large files, the need for friends can refer to the following

Background

When I recently processed a text document (about 2GB in size), there was a memoryerror error and a slow file read, and two more fast large file Reading were found, this article describes the two methods of reading.

Preparatory work

When we talk about "text processing," we usually refer to what we are dealing with. Python reads the contents of a text file into a string variable that can be manipulated very easily. The file object provides three "read" methods:. Read (),. ReadLine (), and. ReadLines (). Each method can accept a variable to limit the amount of data that is read each time, but they typically do not use variables. Read () reads the entire file each time, and it is typically used to place the contents of the file into a string variable. however. Read () produces the most direct string representation of the file content, but it is not necessary for continuous row-oriented processing and is not possible if the file is larger than the available memory. The following is an example of the Read () method:

try:f = open ('/path/to/file ', ' R ') print F.read () finally:if f:f.close ()

The call to read () reads the entire contents of the file one time, and if the file is 10G, the memory explodes, so to be safe, you can repeatedly call the read (size) method to read the contents of a size byte at most. In addition, call ReadLine () can read a line of content at a time, call ReadLines () to read all of the content one time and return the list by row. Therefore, you need to decide how to call as needed.

If the file is small, read () One-time reading is the most convenient, if the file size can not be determined, repeated calls to read(size) compare insurance, if it is a configuration file, call ReadLines () the most convenient:

For line in F.readlines ():p rocess (line) #

  

Chunked Read

Processing large files is easy to think of is to divide large files into small file processing, after processing each small file to release the portion of memory. ITER and yield are used here:

def read_in_chunks (FilePath, chunk_size=1024*1024): "" "Lazy function (generator) to read a file piece by piece. Default Chunk Size:1myou can set your own chunk size "" "File_object = Open (filePath) while true:chunk_data = File_object.re AD (chunk_size) if not chunk_data:breakyield chunk_dataif __name__ = = "__main__": FilePath = './path/filename ' for chunk in R Ead_in_chunks (FilePath):p rocess (chunk) # <do something with chunk>

Use with open ()

The With statement opens and closes the file, including throwing an inner block exception. The for line in F file object F is treated as an iterator that automatically uses buffered IO and memory management, so you don't have to worry about large files.

The code is as follows:

#If the file is line Basedwith open (...) as F:for line in f:process (line) # <do something with line>

Optimization

There is no problem with using the with open for large data with millions of rows, but the differences in the parameters also lead to different efficiencies. The efficiency of the test is "RB" when the first parameter is 6 times times the "R". As a binary read is still the fastest mode.

with open (filename, "RB") as F: for   fline in F:   Pass

Test Result: The RB mode is the fastest, 100w line full traversal 2.9 seconds. Basic ability to meet the requirements of large file processing efficiency. If reading from RB (secondary read) is changed to R (read mode), it is 5-6 times slower.

Conclusion

When using Python for large file reading, you should let the system to handle, the simplest way, to the interpreter, to manage their own work is OK. At the same time, according to different requirements can choose different reading parameters to further achieve higher performance.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.