Python reads "Pit" and memory footprint detection for large files

Last Update:2018-08-24 Source: Internet

Author: User

Tags readline

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Python's API to read and write files is simple, and it's easy to step on the pit when you're not careful. The author records a step of the pit process, and gave some summary, I hope to everyone in the process of using Python, can avoid some of the potential pitfalls of code.

1.read () and ReadLines ():

A tutorial that searches Python for read and write files, and often sees read () and ReadLines () this pair of functions. So we will often see the following code:

with open(file_path, 'rb') as f:    sha1Obj.update(f.read())

with open(file_path, 'rb') as f:    for line in f.readlines():        print(line)

This pair of methods does not produce any exceptions when reading small files, but once large files are read, it is easy to generate Memoryerror, which is a memory overflow problem.

# # # #Why Memory Error?
Let's take a look at these two methods first:

When the default parameter is Size=-1, the Read method reads until EOF, and a memory overflow error naturally occurs when the file size is larger than the available memory.

Similarly, ReadLines constructs a list. List instead of ITER, so all content is stored on top of the memory, and memory overflow errors occur.

2. Correct usage:

In a system that is actually running, it is very dangerous to write the above code, and This "pit" is very covert. so let's take a look at the correct use, the correct usage is very simple, according to the API description of the function of the corresponding encoding is OK:

If the binary file is recommended as follows, you can specify how many bytes the buffer has. Obviously, the larger the buffer, the faster the read speed.

with open(file_path, 'rb') as f:    while True:        buf = f.read(1024)        if buf:                sha1Obj.update(buf)        else:            break

In the case of a text file, you can use the ReadLine method or directly iterate over the file (Python encapsulates a syntactic sugar, the endogenous logic of the two is consistent, but obviously the iteration of the file is more pythonic ) Each time a row is read, the efficiency is relatively low. The author briefly tested, under the 3G file, probably the performance and the former difference of 20%.

with open(file_path, 'rb') as f:    while True:        line = f.readline()        if buf:                print(line)        else:            breakwith open(file_path, 'rb') as f:    for line in f:        print(line)

3. Description of the Memory Detection tool:

For the memory footprint of Python code, it is necessary to monitor the code for memory. Here I recommend two gadgets to detect the memory footprint of Python code.

# # #memory_profiler
First install Memory_profiler with PIP

pip install memory_profiler

Memory_profiler is working with Python adorners, so we need to add adorners to the function we're testing.

from hashlib import sha1import sys@profiledef my_func():    sha1Obj = sha1()    with open(sys.argv[1], 'rb') as f:        while True:            buf = f.read(10 * 1024 * 1024)            if buf:                sha1Obj.update(buf)            else:                break    print(sha1Obj.hexdigest())if __name__ == '__main__':    my_func()

Then add *-M memory_profiler** When you run the code
You can understand the memory footprint of each step of the function's code.

Guppy

Leaf out, is still installed by Pip first Guppy

pip install guppy

You can then use Guppy in your code to directly print out how many objects are created for each Python type (list, tuple, dict, and so on), and how much memory is being consumed.

from guppy import hpyimport sysdef my_func():    mem = hpy()    with open(sys.argv[1], 'rb') as f:        while True:            buf = f.read(10 * 1024 * 1024)            if buf:                print(mem.heap())            else:                break

As shown, you can see that the corresponding memory consumption data is printed out:

Both tools Guppy and Memory_profiler can be used to monitor the memory footprint of Python code as it runs.

4. Summary:

Python is a simple language, but it is because of its brevity that much more detail needs to be carefully scrutinized and thought out. I hope that everyone in the daily work and learning can also be more on some of the details of the summary, less stepping on some unnecessary "pit."

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More