Python reads "Pit" and memory footprint detection for large files

Source: Internet
Author: User
Tags readline

Python's API to read and write files is simple, and it's easy to step on the pit when you're not careful. The author records a step of the pit process, and gave some summary, I hope to everyone in the process of using Python, can avoid some of the potential pitfalls of code.

1.read () and ReadLines ():

A tutorial that searches Python for read and write files, and often sees read () and ReadLines () this pair of functions. So we will often see the following code:

with open(file_path, 'rb') as f:    sha1Obj.update(f.read())

Or

with open(file_path, 'rb') as f:    for line in f.readlines():        print(line)

This pair of methods does not produce any exceptions when reading small files, but once large files are read, it is easy to generate Memoryerror, which is a memory overflow problem.

# # # #Why Memory Error?
Let's take a look at these two methods first:

When the default parameter is Size=-1, the Read method reads until EOF, and a memory overflow error naturally occurs when the file size is larger than the available memory.

Similarly, ReadLines constructs a list. List instead of ITER, so all content is stored on top of the memory, and memory overflow errors occur.

2. Correct usage:

In a system that is actually running, it is very dangerous to write the above code, and This "pit" is very covert. so let's take a look at the correct use, the correct usage is very simple, according to the API description of the function of the corresponding encoding is OK:

If the binary file is recommended as follows, you can specify how many bytes the buffer has. Obviously, the larger the buffer, the faster the read speed.

with open(file_path, 'rb') as f:    while True:        buf = f.read(1024)        if buf:                sha1Obj.update(buf)        else:            break

In the case of a text file, you can use the ReadLine method or directly iterate over the file (Python encapsulates a syntactic sugar, the endogenous logic of the two is consistent, but obviously the iteration of the file is more pythonic ) Each time a row is read, the efficiency is relatively low. The author briefly tested, under the 3G file, probably the performance and the former difference of 20%.

with open(file_path, 'rb') as f:    while True:        line = f.readline()        if buf:                print(line)        else:            breakwith open(file_path, 'rb') as f:    for line in f:        print(line)
3. Description of the Memory Detection tool:

For the memory footprint of Python code, it is necessary to monitor the code for memory. Here I recommend two gadgets to detect the memory footprint of Python code.

# # #memory_profiler
First install Memory_profiler with PIP

pip install memory_profiler

Memory_profiler is working with Python adorners, so we need to add adorners to the function we're testing.

from hashlib import sha1import sys@profiledef my_func():    sha1Obj = sha1()    with open(sys.argv[1], 'rb') as f:        while True:            buf = f.read(10 * 1024 * 1024)            if buf:                sha1Obj.update(buf)            else:                break    print(sha1Obj.hexdigest())if __name__ == '__main__':    my_func()

Then add *-M memory_profiler** When you run the code
You can understand the memory footprint of each step of the function's code.

Guppy

Leaf out, is still installed by Pip first Guppy

pip install guppy

You can then use Guppy in your code to directly print out how many objects are created for each Python type (list, tuple, dict, and so on), and how much memory is being consumed.

from guppy import hpyimport sysdef my_func():    mem = hpy()    with open(sys.argv[1], 'rb') as f:        while True:            buf = f.read(10 * 1024 * 1024)            if buf:                print(mem.heap())            else:                break

As shown, you can see that the corresponding memory consumption data is printed out:

Both tools Guppy and Memory_profiler can be used to monitor the memory footprint of Python code as it runs.

4. Summary:

Python is a simple language, but it is because of its brevity that much more detail needs to be carefully scrutinized and thought out. I hope that everyone in the daily work and learning can also be more on some of the details of the summary, less stepping on some unnecessary "pit."

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.