Python multi-core programming mpi4py practices

Source: Internet
Author: User
Tags root access

Python multi-core programming mpi4py practices

 

I. Overview

CPU from 8086 years ago, to the Pentium ten years ago, to the current multi-core i7. At the beginning, the cpu performance was increased at a high speed with the single-core cpu clock speed as the goal, the architecture improvement and the progress of the integrated circuit technology, the frequency of Single-core cpu is close to 4 GHz in the MHz phase of the old car. However, because of the limitations of technology and power consumption, single-core cpu has encountered the ceiling of life, and it is urgent to change thinking to meet endless performance requirements. The multi-core cpu is now on the stage of history. Add two more engines to your car to make you feel like a Ferrari. In the current era, even mobile phones are clamoring for a 4-core 8-core processor era, not to mention PC.

Anyway, for programmers, we have to consider how to use such a powerful engine to complete our tasks. With the increase in large-scale data processing, large-scale problems, and complex system solutions, the previous single-core programming has become powerless. If a program runs for several hours or even a day, you cannot forgive yourself. So how can we make ourselves faster and more advanced to multi-core parallel programming? Haha, the power of the masses of the people!

Currently, I have come into contact with three parallel processing frameworks: MPI, OpenMP, and MapReduce (Hadoop) (CUDA is GPU parallel programming, which is not mentioned here ). Both MPI and Hadoop can run in the cluster, while OpenMP cannot run on the cluster because of the shared storage structure. It can only run on a single machine. In addition, MPI allows data to be stored in the memory, and the context can be saved for communication and data interaction between nodes. Therefore, it can execute iterative algorithms, but Hadoop does not. Therefore, most of the machine learning algorithms that require iteration are implemented using MPI. Of course, some machine learning algorithms can also be completed by designing and using Hadoop. (For more information, see. If this is an error, please kindly advise. Thank you ).

This article mainly introduces the practical basis of MPI Programming in the Python environment.

 

Ii. MPI and mpi4py

MPI is short for Message Passing Interface, that is, Message transmission. Message Passing means that each process in parallel execution has its own independent stack and code segment, and is executed independently as multiple unrelated programs, information Interaction Between processes is completed by explicitly calling the communication function.

Mpi4py is a python library built on mpi, so that the python data structure can be transmitted between processes (or multiple CPUs.

2.1. how MPI works

Very simple: you have started a set of MPI processes, and each process executes the same code! Each process then has an ID, that is, rank, to mark who I am. What does it mean? Assume that a CPU is a worker you ask for. There are 10 workers in total. You have 100 bricks to move, and it's fair to let every worker move 10. At this time, you write the task into a task card, so that 10 workers can execute the task in this task card, that is, move bricks! The "moving bricks" in this task card is the code you write. Then 10 CPUs run the same piece of code. Note that all variables in the Code are unique to each process, although the names are the same.

For example, a script named test. py contains the following code:

 

from mpi4py import MPIprint(hello world'')print(my rank is: %d %MPI.rank)
Run the following command in the command line:

 

# Mpirun-np 5 python test. py

-Np5 specifies to start five mpi processes to execute subsequent programs. It is equivalent to copying five copies of the script. Each process runs one copy without interfering with each other. The only difference in the code during running is that the rank or ID is different. So this code will print 5 hello world and 5 different rank values, from 0 to 4.

2.2 point-to-point communication

Point-to-PointCommunication is the most basic requirement of the information transmission system. This means that two processes can directly transmit data, that is, one sends data and the other receives data. There are two interfaces, send and recv. Here is an example:

 

import mpi4py.MPI as MPI comm = MPI.COMM_WORLDcomm_rank = comm.Get_rank()comm_size = comm.Get_size() # point to point communicationdata_send = [comm_rank]*5comm.send(data_send,dest=(comm_rank+1)%comm_size)data_recv =comm.recv(source=(comm_rank-1)%comm_size)print(my rank is %d, and Ireceived: % comm_rank)print data_recv

 

Start five processes and run the above Code. The result is as follows:

 

my rank is 0, and I received:[4, 4, 4, 4, 4]my rank is 1, and I received:[0, 0, 0, 0, 0]my rank is 2, and I received:[1, 1, 1, 1, 1]my rank is 3, and I received:[2, 2, 2, 2, 2]my rank is 4, and I received:[3, 3, 3, 3, 3]

 

We can see that each process creates an array, passes it to the next process, and the last process passes it to the first process. Comm_size is the number of mpi processes, that is, the number specified by-np. MPI. COMM_WORLD indicates the communication group where the process is located.

However, there is a problem to be aware of. If we want to send a small amount of data, mpi will cache our data, that is, when we execute the send code, the sent data is cached, and subsequent commands are executed without waiting for the other process to execute the recv command to receive the data. However, if the data to be sent is large, the process suspends and waits until the receiving process executes the recv command to receive the data. Therefore, there is no problem in sending the above Code [rank] * 5. If you send the [rank] * 500 program, it will be half-dead. Because all processes are stuck in sending this command and wait for the next process to initiate and receive this command, but the process can only execute the received command after executing the sent command, this is similar to the deadlock. Therefore, we can change it to the following method:

 

import mpi4py.MPI as MPI comm = MPI.COMM_WORLDcomm_rank = comm.Get_rank()comm_size = comm.Get_size() data_send = [comm_rank]*5if comm_rank == 0:   comm.send(data_send, dest=(comm_rank+1)%comm_size)if comm_rank > 0:   data_recv = comm.recv(source=(comm_rank-1)%comm_size)   comm.send(data_send, dest=(comm_rank+1)%comm_size)if comm_rank == 0:   data_recv = comm.recv(source=(comm_rank-1)%comm_size)print(my rank is %d, and Ireceived: % comm_rank)print data_recv

 

The first process sends data at the beginning. Other processes are waiting to receive data at the beginning. At this time, process 1 receives data from process 0, and then sends data from process 1. process 2 receives the data, resend process 2 data ...... Knowing that the last process 0 receives the data of the last process avoids the above problem.

A common method is to block a leader, that is, a main process. Generally, process 0 serves as the leader of the main process. The main process sends data to other processes. Other processes process the data and return the result to process 0. In other words, process 0 controls the entire data processing process.

2.3. Group Communication

Point-to-point communication means that A sends A message to B. A person tells another person his/her secret. A group communication means getting A big horn and telling everyone at once. The former is one-to-one, and the latter is one-to-many. However, group communication works in a more effective way. The principle is one: Try to use all processes at all times! We will describe in the following bcast section.

Group communication is both sending and receiving. One is to send data to everyone at a time, and the other is to recycle the results from everyone at a time.

1) broadcast bcast

Send a copy of data to all processes. For example, if I have 200 copies of data and 10 processes, each process will get 200 copies of data.

 

import mpi4py.MPI as MPI comm = MPI.COMM_WORLDcomm_rank = comm.Get_rank()comm_size = comm.Get_size() if comm_rank == 0:   data = range(comm_size)data = comm.bcast(data if comm_rank == 0else None, root=0)print 'rank %d, got:' % (comm_rank)print data

 

The result is as follows:

 

rank 0, got:[0, 1, 2, 3, 4]rank 1, got:[0, 1, 2, 3, 4]rank 2, got:[0, 1, 2, 3, 4]rank 3, got:[0, 1, 2, 3, 4]rank 4, got:[0, 1, 2, 3, 4]

 

The Root process creates a list and broadcasts it to all processes. In this way, all processes have this list. Then you can do what you like.

The most intuitive view of broadcast is that a specific process sends data to each process one by one. Suppose there are n processes, and assume that our data is in the 0 process, then the 0 process needs to send the data to the remaining n-1 processes, which is very inefficient, the complexity is O (n ). Is there an efficient way? One of the most common and efficient methods is protocol tree broadcast: all processes that receive broadcast data are involved in the data broadcast process. First, only one process has data, and then it sends it to the first process. At this time, two processes have data. Then both processes are involved in the next broadcast, then there will be four processes with data ,......, And so on, each time there will be 2 to the power of processes with data. Through the broadcast method of this Protocol tree, the complexity of broadcast is reduced to O (log n ). This is the efficient principle of mass communication: make full use of all processes to send and receive data.

2) spread scatter

Divide a piece of data into all processes. For example, if I have 200 copies of data and 10 processes, each process will get 20 copies of data respectively.

 

import mpi4py.MPI as MPI comm = MPI.COMM_WORLDcomm_rank = comm.Get_rank()comm_size = comm.Get_size() if comm_rank == 0:   data = range(comm_size)   print dataelse:   data = Nonelocal_data = comm.scatter(data, root=0)print 'rank %d, got:' % comm_rankprint local_data

 

The result is as follows:

 

[0, 1, 2, 3, 4]rank 0, got:0rank 1, got:1rank 2, got:2rank 3, got:3rank 4, got:4

 

Here, the root process creates a list and then spreads it to all processes, which is equivalent to dividing the list. Each process obtains the same data. Here is the number of each list. (Mainly based on the index of the list, the data whose list index is the I part is sent to the I process ). If it is a matrix, it will divide the rows equally, and each process will get the same number of rows for processing.

Note that the MPI method is that every process executes all the code, so every process executes the scatter command, but only when the root executes it, it combines the identity of the sender and the receiver (the root will also get its own data). For other processes, they are only the receiver.

3) collect gather

If there is sending, there will be a function to be recycled together. Gather collects data from all processes and combines them into a list. The following is a complete distribution and recovery process combined with scatter and gather:

 

import mpi4py.MPI as MPI comm = MPI.COMM_WORLDcomm_rank = comm.Get_rank()comm_size = comm.Get_size() if comm_rank == 0:   data = range(comm_size)   print dataelse:   data = Nonelocal_data = comm.scatter(data, root=0)local_data = local_data * 2print 'rank %d, got and do:' % comm_rankprint local_datacombine_data = comm.gather(local_data,root=0)if comm_rank == 0:printcombine_data

 

The result is as follows:

 

[0, 1, 2, 3, 4]rank 0, got and do:0rank 1, got and do:2rank 2, got and do:4rank 4, got and do:8rank 3, got and do:6[0, 2, 4, 6, 8]

 

The Root process distributes data to all processes through scatter. After all processes are processed (simply multiply by 2), the root process then recycles their results through gather, similar to the distribution principle, a list is formed. Another variant of Gather is allgather, which can be understood as a bcast of the gather Result Based on gather. What do you mean? This means that the root process recycles the results of all processes and then tells the whole statistical result. In this way, not only can root access combine_data, but all processes can access combine_data.

4) reduce

The protocol not only collects all the data back, but also performs simple calculations during the collection process, such as summation and maximization. Why? We can use gather to collect all the data and then calculate sum or max for the list? Isn't this exhausting team lead? Why is it insufficient for every worker? The Protocol is actually implemented using the Protocol tree. For example, after seeking the max value, the worker can obtain the maximum value of the two primary keys, and then return the maximum value of the two primary keys on the second layer until a final max value is returned to the leader. The team leader was very smart and assigned the workers efficiently. This is the complexity of O (n), down to the complexity of O (log n) (the base number is 2.

 

import mpi4py.MPI as MPI comm = MPI.COMM_WORLDcomm_rank = comm.Get_rank()comm_size = comm.Get_size()if comm_rank == 0:   data = range(comm_size)   print dataelse:   data = Nonelocal_data = comm.scatter(data, root=0)local_data = local_data * 2print 'rank %d, got and do:' % comm_rankprint local_dataall_sum = comm.reduce(local_data, root=0,op=MPI.SUM)if comm_rank == 0:print 'sumis:%d' % all_sum

 

The result is as follows:

 

[0, 1, 2, 3, 4]rank 0, got and do:0rank 1, got and do:2rank 2, got and do:4rank 3, got and do:6rank 4, got and do:8sum is:20

 

We can see that we can get a sum value.

Iii. Common usage

3.1 parallel processing of multiple rows of a file

 

#!usr/bin/env python#-*- coding: utf-8 -*- import sysimport osimport mpi4py.MPI as MPIimport numpy as np  ##  Global variables for MPI# # instance for invoking MPI relatedfunctionscomm = MPI.COMM_WORLD# the node rank in the whole communitycomm_rank = comm.Get_rank()# the size of the whole community, i.e.,the total number of working nodes in the MPI clustercomm_size = comm.Get_size()  if __name__ == '__main__':   if comm_rank == 0:       sys.stderr.write(processor root starts reading data...)       all_lines = sys.stdin.readlines()   all_lines = comm.bcast(all_lines if comm_rank == 0 else None, root = 0)   num_lines = len(all_lines)   local_lines_offset = np.linspace(0, num_lines, comm_size +1).astype('int')   local_lines = all_lines[local_lines_offset[comm_rank] :local_lines_offset[comm_rank + 1]]   sys.stderr.write(%d/%d processor gets %d/%d data  %(comm_rank, comm_size, len(local_lines), num_lines))   cnt = 0   for line in local_lines:       fields = line.strip().split('')       cnt += 1       if cnt % 100 == 0:           sys.stderr.write(processor %d has processed %d/%d lines  %(comm_rank, cnt, len(local_lines)))       output = line.strip() + ' process every line here'       print output

 

3.2 parallel processing of multiple files

If our file is too large, such as tens of millions of lines, mpi cannot cast such large data bcast to all processes, therefore, we can split large files into small files and then let each process a few files.

 

#!usr/bin/env python#-*- coding: utf-8 -*- import sysimport osimport mpi4py.MPI as MPIimport numpy as np ##  Global variables for MPI# # instance for invoking MPI relatedfunctionscomm = MPI.COMM_WORLD# the node rank in the whole communitycomm_rank = comm.Get_rank()# the size of the whole community, i.e.,the total number of working nodes in the MPI clustercomm_size = comm.Get_size()  if __name__ == '__main__':   if len(sys.argv) != 2:       sys.stderr.write(Usage: python *.py directoty_with_files)       sys.exit(1)   path = sys.argv[1]   if comm_rank == 0:       file_list = os.listdir(path)       sys.stderr.write(%d files % len(file_list))   file_list = comm.bcast(file_list if comm_rank == 0 else None, root = 0)   num_files = len(file_list)   local_files_offset = np.linspace(0, num_files, comm_size +1).astype('int')   local_files = file_list[local_files_offset[comm_rank] :local_files_offset[comm_rank + 1]]   sys.stderr.write(%d/%d processor gets %d/%d data  %(comm_rank, comm_size, len(local_files), num_files))    cnt = 0   for file_name in local_files:       hd = open(os.path.join(path, file_name))       for line in hd:           output = line.strip() + ' process every line here'           print output       cnt += 1       sys.stderr.write(processor %d has processed %d/%d files  %(comm_rank, cnt, len(local_files)))       hd.close()

 

3.3 combine numpy to concurrently process multiple rows or columns of a Matrix

Mpi4py is perfect for supporting numpy!

 

import os, sys, timeimport numpy as npimport mpi4py.MPI as MPI  ##  Global variables for MPI#  # instance for invoking MPI relatedfunctionscomm = MPI.COMM_WORLD# the node rank in the whole communitycomm_rank = comm.Get_rank()# the size of the whole community, i.e.,the total number of working nodes in the MPI clustercomm_size = comm.Get_size() # test MPIif __name__ == __main__:    #create a matrix   if comm_rank == 0:       all_data = np.arange(20).reshape(4, 5)       print ************ data ******************       print all_data       #broadcast the data to all processors   all_data = comm.bcast(all_data if comm_rank == 0 else None, root = 0)       #divide the data to each processor   num_samples = all_data.shape[0]   local_data_offset = np.linspace(0, num_samples, comm_size + 1).astype('int')       #get the local data which will be processed in this processor   local_data = all_data[local_data_offset[comm_rank] :local_data_offset[comm_rank + 1]]   print ****** %d/%d processor gets local data **** %(comm_rank, comm_size)   print local_data       #reduce to get sum of elements   local_sum = local_data.sum()   all_sum = comm.reduce(local_sum, root = 0, op = MPI.SUM)       #process in local   local_result = local_data ** 2       #gather the result from all processors and broadcast it   result = comm.allgather(local_result)   result = np.vstack(result)      if comm_rank == 0:       print *** sum: , all_sum       print ************ result ******************       print result

 

Iv. Establishment of MPI and mpi4py Environments

This chapter is here as an appendix. Our environment is linux. The packages to be installed include python, openmpi, numpy, cpython, and mpi4py. The process is as follows:

4.1 install Python

 

#tar xzvf Python-2.7.tgz#cd Python-2.7#./configure--prefix=/home/work/vis/zouxiaoyi/my_tools#make#make install

 

Put Python in the environment variable first, and there is also the Python plug-in Library

 

exportPATH=/home/work/vis/zouxiaoyi/my_tools/bin:$PATHexportPYTHONPATH=/home/work/vis/zouxiaoyi/my_tools/lib/python2.7/site-packages:$PYTHONPATH

 

Run # python. If you see the cute >>> command, it means the operation is successful. Press crtl + d to exit

4.2 install openmpi

 

#wget http://www.open-mpi.org/software/ompi/v1.4/downloads/openmpi-1.4.1.tar.gz#tar xzvf openmpi-1.4.1.tar.gz#cd openmpi-1.4.1#./configure--prefix=/home/work/vis/zouxiaoyi/my_tools#make -j 8#make install

 

Then add the bin path to the environment variable:

 

exportPATH=/home/work/vis/zouxiaoyi/my_tools/bin:$PATHexportLD_LIBRARY_PATH=/home/work/vis/zouxiaoyi/my_tools/lib:$LD_LIBRARY_PATH

 

Run # mpirun. If the help information is printed, the installation is complete. It should be noted that I have not successfully installed several versions, and finally installed version 1.4.1 to succeed, so it depends on your character.

4.3 install numpy and Cython

For how to install the python library, refer to the previous blog. The process is generally as follows:

 

#tar –xgvf Cython-0.20.2.tar.gz#cd Cython-0.20.2#python setup.py install

 

Open Python and import Cython. If no error is reported, the installation is successful.

4.4 install mpi4py

 

#tar –xgvf mpi4py_1.3.1.tar.gz#cd mpi4py#vi mpi.cfg

 

In line 68, under [openmpi], change the directory of the installed openmpi.

 

mpi_dir = /home/work/vis/zouxiaoyi/my_tools#python setup.py install

 

Open Python and import mpi4py as MPI. If no error is reported, the installation is successful.

Now you can start your parallel journey. Be brave enough to explore the fun of multi-core.

 

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.