How to read large files using Python

Last Update:2018-02-11 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

How to read large files using Python
Background

When a text file is recently processed (the file size is about 2 GB ),memoryErrorErrors and files are too slow to read. Later, I found two fastLarge File ReadingThis article describes the two reading methods.

Original URL

Preparations

When talking about "text processing", we usually refer to the content to be processed. Python reads the content of a text file into a string variable that can be operated easily. The file object provides three "read" methods:.read(),.readline()And.readlines(). Each method can accept a variable to limit the amount of data read each time, but they usually do not use variables..read()Each time you read the entire file, it is usually used to put the file content into a string variable. However.read()Generate the most direct string representation of the file content, but it is unnecessary for continuous row-oriented processing. If the file is larger than the available memory, this processing is impossible. Below isread()Example:

try:f = open('/path/to/file', 'r')print f.read()finally:if f:f.close()

Callread()It will read all the content of the file at a time. If the file has 10 Gb, the memory will burst. Therefore, to be safe, you can call it repeatedly.read(size)Method to read a maximum of size bytes each time. In additionreadline()You can read a row of content each time and callreadlines()Read all the content at a time and return the list by row. Therefore, you need to decide how to call it as needed.
If the file is small,read()One-time reading is the most convenient; if the file size cannot be determined, it is called repeatedlyread(size)Relatively safe; if it is a configuration file, callreadlines()Most convenient:

for line in f.readlines():process(line) #

Multipart read

It is easy to think of splitting a large file into several small files for processing. After processing each small file, this part of memory is released. Here we useIter and yield:

def read_in_chunks(filePath, chunk_size=1024*1024):"""Lazy function (generator) to read a file piece by piece.Default chunk size: 1MYou can set your own chunk size"""file_object = open(filePath)while True:chunk_data = file_object.read(chunk_size)if not chunk_data:breakyield chunk_dataif __name__ == "__main__":filePath = './path/filename'for chunk in read_in_chunks(filePath):process(chunk) # <do something with chunk>

Use With open ()

withStatement to open and close the file, including throwing an internal block exception.for line in fObjectfAs an iterator, the buffer is automatically used.IOAnd memory management, so you don't have to worry about large files.

The Code is as follows:

#If the file is line basedwith open(...) as f:　　for line in f:　　　　process(line) # <do something with line>

Optimization

There is no problem with using with open for millions of rows of large data, but different parameters here will also lead to different efficiency. After testing, when the initial parameter is "rb", the efficiency is six times that of "r. We can see that binary reading is still the fastest mode.

with open(filename,"rb") as f:     for fLine in f:     　　pass

Test results: the rb mode is the fastest, with 2.9 rows traversing for seconds. It can basically meet the needs of medium and large file processing efficiency. If the read speed is changed from rb (second-level read) to r (read mode), the speed is 5-6 times slower. Conclusion

When using python to read large files, you should let the system handle them. The simplest way is to hand it over to the interpreter to manage your work. At the same time, you can select different read parameters to achieve higher performance based on different requirements.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

How to read large files using Python

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

How to read large files using Python

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support