How to read large files using Python

Source: Internet
Author: User

How to read large files using Python
Background

When a text file is recently processed (the file size is about 2 GB ),memoryErrorErrors and files are too slow to read. Later, I found two fastLarge File ReadingThis article describes the two reading methods.

Original URL

Preparations

When talking about "text processing", we usually refer to the content to be processed. Python reads the content of a text file into a string variable that can be operated easily. The file object provides three "read" methods:.read(),.readline()And.readlines(). Each method can accept a variable to limit the amount of data read each time, but they usually do not use variables..read()Each time you read the entire file, it is usually used to put the file content into a string variable. However.read()Generate the most direct string representation of the file content, but it is unnecessary for continuous row-oriented processing. If the file is larger than the available memory, this processing is impossible. Below isread()Example:

try:f = open('/path/to/file', 'r')print f.read()finally:if f:f.close()

 

Callread()It will read all the content of the file at a time. If the file has 10 Gb, the memory will burst. Therefore, to be safe, you can call it repeatedly.read(size)Method to read a maximum of size bytes each time. In additionreadline()You can read a row of content each time and callreadlines()Read all the content at a time and return the list by row. Therefore, you need to decide how to call it as needed.
If the file is small,read()One-time reading is the most convenient; if the file size cannot be determined, it is called repeatedlyread(size)Relatively safe; if it is a configuration file, callreadlines()Most convenient:

for line in f.readlines():process(line) #

  

Multipart read

It is easy to think of splitting a large file into several small files for processing. After processing each small file, this part of memory is released. Here we useIter and yield:

def read_in_chunks(filePath, chunk_size=1024*1024):"""Lazy function (generator) to read a file piece by piece.Default chunk size: 1MYou can set your own chunk size"""file_object = open(filePath)while True:chunk_data = file_object.read(chunk_size)if not chunk_data:breakyield chunk_dataif __name__ == "__main__":filePath = './path/filename'for chunk in read_in_chunks(filePath):process(chunk) # <do something with chunk>

 

Use With open ()

withStatement to open and close the file, including throwing an internal block exception.for line in fObjectfAs an iterator, the buffer is automatically used.IOAnd memory management, so you don't have to worry about large files.

The Code is as follows:

#If the file is line basedwith open(...) as f:  for line in f:    process(line) # <do something with line>

 

Optimization

There is no problem with using with open for millions of rows of large data, but different parameters here will also lead to different efficiency. After testing, when the initial parameter is "rb", the efficiency is six times that of "r. We can see that binary reading is still the fastest mode.

with open(filename,"rb") as f:     for fLine in f:       pass  

 

Test results: the rb mode is the fastest, with 2.9 rows traversing for seconds. It can basically meet the needs of medium and large file processing efficiency. If the read speed is changed from rb (second-level read) to r (read mode), the speed is 5-6 times slower. Conclusion

When using python to read large files, you should let the system handle them. The simplest way is to hand it over to the interpreter to manage your work. At the same time, you can select different read parameters to achieve higher performance based on different requirements.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.