[Python] multi-process memory Replication

Source: Internet
Author: User

Curious about python's actual usage of the copy on write mechanism in multiple processes. Currently, from the experiment results, python uses multiprocessing to create a multi-process, regardless of whether the data is not changed, the sub-process will copy the status of the parent process (memory space data, etc ). Therefore, if the main process consumes a large amount of resources, it will lead to unnecessary memory replication, which may lead to full memory.


Example
For example, assume that the main process reads all rows of a large file object, creates a working process through multiprocessing, and cyclically submits each row of data to the Working Process for processing:
[Python]
Def parse_lines (args ):
# Working
...
 
Def main_logic ():
F = open (filename, 'R ')
<SPAN style = "COLOR: # cc0000"> lines = f. readlines () </SPAN>
F. close ()
 
Pool = multiprocessing. Pool (processes = 4)
<SPAN style = "COLOR: # cc0000"> rel = pool. map (parse_lines, itertools. izip (lines, itertools. repeat (second_args), int (len (lines)/4) </SPAN>
Pool. close ()
Pool. join ()

Def parse_lines (args ):
# Working
...

Def main_logic ():
F = open (filename, 'R ')
Lines = f. readlines ()
F. close ()

Pool = multiprocessing. Pool (processes = 4)
Rel = pool. map (parse_lines, itertools. izip (lines, itertools. repeat (second_args), int (len (lines)/4 ))
Pool. close ()
Pool. join ()


The following are the top and ps results:

(Four sub-processes)


(Parent process and four child processes)

The above two figures show that both the parent process and child process occupy about 4 GB of memory space respectively. Most of the memory space is stored in the read data lines, so such memory overhead is too wasteful.

 


Optimization plan

Plan 1: Reduce memory overhead through Memory sharing.

 

 

 

Plan 2: the main process no longer reads the file object and submits it to each worker process to read the corresponding part of the file.

Improved code:


[Python]
Def line_count (file_name ):
Count =-1 # display the row number of an empty file as 0
For count, line in enumerate (open (file_name): pass
# The enumerate is formatted as a tuples, and the count is the row number, because the number must be + 1 from 0.
Return count + 1
 
Def parse_lines (args ):
F = open (args [0], 'R ')
<SPAN style = "COLOR: # cc0000"> lines = f. readlines () [args [1]: args [2] # read some lines </SPAN>
F. close ()
# Working
 
Def main_logic (filename, process_num ):
Line_count = line_count (filename)
Avg_len = int (line_count/process_num)
Left_cnt = line_count % process_num;
 
Pool = multiprocessing. Pool (processes = process_num)
For I in xrange (0, process_num ):
Ext_cnt = (I> = process_num-1 and [left_cnt] or [0]) [0]
St_line = I * avg_len
<SPAN style = "COLOR: # cc0000"> pool. apply_async (parse_lines, (filename, st_line, st_line + avg_len + ext_cnt) # specifies the number of rows of data read by a process. </SPAN>
Pool. close ()
Pool. join ()

Def line_count (file_name ):
Count =-1 # display the row number of an empty file as 0
For count, line in enumerate (open (file_name): pass
# The enumerate is formatted as a tuples, and the count is the row number, because the number must be + 1 from 0.
Return count + 1

Def parse_lines (args ):
F = open (args [0], 'R ')
Lines = f. readlines () [args [1]: args [2] # read some lines
F. close ()
# Working

Def main_logic (filename, process_num ):
Line_count = line_count (filename)
Avg_len = int (line_count/process_num)
Left_cnt = line_count % process_num;

Pool = multiprocessing. Pool (processes = process_num)
For I in xrange (0, process_num ):
Ext_cnt = (I> = process_num-1 and [left_cnt] or [0]) [0]
St_line = I * avg_len
Pool. apply_async (parse_lines, (filename, st_line, st_line + avg_len + ext_cnt) # specifies the number of rows of data read by a process
Pool. close ()
Pool. join ()
Use top or ps to view the memory usage of the process again:
(Four sub-processes)


(Parent process and four child processes)

 

 

Summary

The memory usage is compared twice, and the memory occupied by the parent process and child process after the code is improved is significantly reduced. All memory usage is equivalent to half of the original memory usage, which is the effect of reducing memory replication.

This experiment is now in progress. There are still many optimization methods and space for memory usage. We will continue to study it later.

 

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.