[Python] multi-process memory Replication

Last Update:2013-12-04 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Curious about python's actual usage of the copy on write mechanism in multiple processes. Currently, from the experiment results, python uses multiprocessing to create a multi-process, regardless of whether the data is not changed, the sub-process will copy the status of the parent process (memory space data, etc ). Therefore, if the main process consumes a large amount of resources, it will lead to unnecessary memory replication, which may lead to full memory.

Example
For example, assume that the main process reads all rows of a large file object, creates a working process through multiprocessing, and cyclically submits each row of data to the Working Process for processing:
[Python]
Def parse_lines (args ):
# Working
...

Def main_logic ():
F = open (filename, 'R ')
 lines = f. readlines () 
F. close ()

Pool = multiprocessing. Pool (processes = 4)
 rel = pool. map (parse_lines, itertools. izip (lines, itertools. repeat (second_args), int (len (lines)/4) 
Pool. close ()
Pool. join ()

Def parse_lines (args ):
# Working
...

Def main_logic ():
F = open (filename, 'R ')
Lines = f. readlines ()
F. close ()

Pool = multiprocessing. Pool (processes = 4)
Rel = pool. map (parse_lines, itertools. izip (lines, itertools. repeat (second_args), int (len (lines)/4 ))
Pool. close ()
Pool. join ()

The following are the top and ps results:

(Four sub-processes)

(Parent process and four child processes)

The above two figures show that both the parent process and child process occupy about 4 GB of memory space respectively. Most of the memory space is stored in the read data lines, so such memory overhead is too wasteful.

Optimization plan

Plan 1: Reduce memory overhead through Memory sharing.

Plan 2: the main process no longer reads the file object and submits it to each worker process to read the corresponding part of the file.

Improved code:

[Python]
Def line_count (file_name ):
Count =-1 # display the row number of an empty file as 0
For count, line in enumerate (open (file_name): pass
# The enumerate is formatted as a tuples, and the count is the row number, because the number must be + 1 from 0.
Return count + 1

Def parse_lines (args ):
F = open (args [0], 'R ')
 lines = f. readlines () [args [1]: args [2] # read some lines 
F. close ()
# Working

Def main_logic (filename, process_num ):
Line_count = line_count (filename)
Avg_len = int (line_count/process_num)
Left_cnt = line_count % process_num;

Pool = multiprocessing. Pool (processes = process_num)
For I in xrange (0, process_num ):
Ext_cnt = (I> = process_num-1 and [left_cnt] or [0]) [0]
St_line = I * avg_len
 pool. apply_async (parse_lines, (filename, st_line, st_line + avg_len + ext_cnt) # specifies the number of rows of data read by a process. 
Pool. close ()
Pool. join ()

Def line_count (file_name ):
Count =-1 # display the row number of an empty file as 0
For count, line in enumerate (open (file_name): pass
# The enumerate is formatted as a tuples, and the count is the row number, because the number must be + 1 from 0.
Return count + 1

Def parse_lines (args ):
F = open (args [0], 'R ')
Lines = f. readlines () [args [1]: args [2] # read some lines
F. close ()
# Working

Def main_logic (filename, process_num ):
Line_count = line_count (filename)
Avg_len = int (line_count/process_num)
Left_cnt = line_count % process_num;

Pool = multiprocessing. Pool (processes = process_num)
For I in xrange (0, process_num ):
Ext_cnt = (I> = process_num-1 and [left_cnt] or [0]) [0]
St_line = I * avg_len
Pool. apply_async (parse_lines, (filename, st_line, st_line + avg_len + ext_cnt) # specifies the number of rows of data read by a process
Pool. close ()
Pool. join ()
Use top or ps to view the memory usage of the process again:
(Four sub-processes)

(Parent process and four child processes)

Summary

The memory usage is compared twice, and the memory occupied by the parent process and child process after the code is improved is significantly reduced. All memory usage is equivalent to half of the original memory usage, which is the effect of reducing memory replication.

This experiment is now in progress. There are still many optimization methods and space for memory usage. We will continue to study it later.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

[Python] multi-process memory Replication

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

[Python] multi-process memory Replication

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support