Curious about python's actual usage of the copy on write mechanism in multiple processes. Currently, from the experiment results, python uses multiprocessing to create a multi-process, regardless of whether the data is not changed, the sub-process will copy the status of the parent process (memory space data, etc ). Therefore, if the main process consumes a large amount of resources, it will lead to unnecessary memory replication, which may lead to full memory.
Example
For example, assume that the main process reads all rows of a large file object, creates a working process through multiprocessing, and cyclically submits each row of data to the Working Process for processing:
[Python]
Def parse_lines (args ):
# Working
...
Def main_logic ():
F = open (filename, 'R ')
<SPAN style = "COLOR: # cc0000"> lines = f. readlines () </SPAN>
F. close ()
Pool = multiprocessing. Pool (processes = 4)
<SPAN style = "COLOR: # cc0000"> rel = pool. map (parse_lines, itertools. izip (lines, itertools. repeat (second_args), int (len (lines)/4) </SPAN>
Pool. close ()
Pool. join ()
Def parse_lines (args ):
# Working
...
Def main_logic ():
F = open (filename, 'R ')
Lines = f. readlines ()
F. close ()
Pool = multiprocessing. Pool (processes = 4)
Rel = pool. map (parse_lines, itertools. izip (lines, itertools. repeat (second_args), int (len (lines)/4 ))
Pool. close ()
Pool. join ()
The following are the top and ps results:
(Four sub-processes)
(Parent process and four child processes)
The above two figures show that both the parent process and child process occupy about 4 GB of memory space respectively. Most of the memory space is stored in the read data lines, so such memory overhead is too wasteful.
Optimization plan
Plan 1: Reduce memory overhead through Memory sharing.
Plan 2: the main process no longer reads the file object and submits it to each worker process to read the corresponding part of the file.
Improved code:
[Python]
Def line_count (file_name ):
Count =-1 # display the row number of an empty file as 0
For count, line in enumerate (open (file_name): pass
# The enumerate is formatted as a tuples, and the count is the row number, because the number must be + 1 from 0.
Return count + 1
Def parse_lines (args ):
F = open (args [0], 'R ')
<SPAN style = "COLOR: # cc0000"> lines = f. readlines () [args [1]: args [2] # read some lines </SPAN>
F. close ()
# Working
Def main_logic (filename, process_num ):
Line_count = line_count (filename)
Avg_len = int (line_count/process_num)
Left_cnt = line_count % process_num;
Pool = multiprocessing. Pool (processes = process_num)
For I in xrange (0, process_num ):
Ext_cnt = (I> = process_num-1 and [left_cnt] or [0]) [0]
St_line = I * avg_len
<SPAN style = "COLOR: # cc0000"> pool. apply_async (parse_lines, (filename, st_line, st_line + avg_len + ext_cnt) # specifies the number of rows of data read by a process. </SPAN>
Pool. close ()
Pool. join ()
Def line_count (file_name ):
Count =-1 # display the row number of an empty file as 0
For count, line in enumerate (open (file_name): pass
# The enumerate is formatted as a tuples, and the count is the row number, because the number must be + 1 from 0.
Return count + 1
Def parse_lines (args ):
F = open (args [0], 'R ')
Lines = f. readlines () [args [1]: args [2] # read some lines
F. close ()
# Working
Def main_logic (filename, process_num ):
Line_count = line_count (filename)
Avg_len = int (line_count/process_num)
Left_cnt = line_count % process_num;
Pool = multiprocessing. Pool (processes = process_num)
For I in xrange (0, process_num ):
Ext_cnt = (I> = process_num-1 and [left_cnt] or [0]) [0]
St_line = I * avg_len
Pool. apply_async (parse_lines, (filename, st_line, st_line + avg_len + ext_cnt) # specifies the number of rows of data read by a process
Pool. close ()
Pool. join ()
Use top or ps to view the memory usage of the process again:
(Four sub-processes)
(Parent process and four child processes)
Summary
The memory usage is compared twice, and the memory occupied by the parent process and child process after the code is improved is significantly reduced. All memory usage is equivalent to half of the original memory usage, which is the effect of reducing memory replication.
This experiment is now in progress. There are still many optimization methods and space for memory usage. We will continue to study it later.