Original information
I. Overview of Ideas
Suppose a machine has a k -GPU on it. Given the model that needs to be trained, each GPU maintains a complete set of model parameters independently.
k " > k " >k share and give each GPU a copy.
k " >< Span id= "Mathjax-element-2-frame" class= "Mathjax" data-mathml= " k " > Then each GPU calculates the gradient of the model parameters based on the sample of the training data it has been given and the model parameters that it maintains.
k " >< Span id= "Mathjax-element-2-frame" class= "Mathjax" data-mathml= " k " > Next, we'll k "> k on a GPU to calculate the sum of the gradients separately, resulting in the current small batch gradient.
Each GPU then uses this small batch gradient to update the complete model parameters that it maintains separately.
Second, network and auxiliary functions
Use the "convolutional neural network-zero-based" Lenet as an example of this section:
# Initialize the model parameters. Scale = 0.01W1 = Nd.random.normal (Scale=scale, shape= (1, 3, 3)) B1 = Nd.zeros (shape=20) W2 = Nd.random.normal (scale=scal E, Shape= (5, 5)) B2 = Nd.zeros (shape=50) W3 = Nd.random.normal (Scale=scale, shape= (+)) B3 = Nd.zeros (shape=128 ) W4 = Nd.random.normal (Scale=scale, shape= (+)) b4 = Nd.zeros (shape=10) params = [W1, B1, W2, B2, W3, B3, W4, b4]# define Model 。 def lenet (X, params): H1_conv = nd. Convolution (data=x, weight=params[0], bias=params[1], kernel= (3, 3), num_filter=20) h1_acti Vation = Nd.relu (h1_conv) h1 = nd. Pooling (Data=h1_activation, pool_type= "avg", Kernel= (2, 2), Stride= (2, 2)) H2_conv = nd. Convolution (DATA=H1, weight=params[2], bias=params[3], kernel= (5, 5), num_filter=50) h2_act Ivation = Nd.relu (h2_conv) h2 = nd. Pooling (Data=h2_activation, pool_type= "avg", Kernel= (2, 2), Stride= (2, 2)) H2 = Nd.flatten (h2) H3 _linear = Nd.dot (h2, params[4]) +PARAMS[5] H3 = Nd.relu (h3_linear) y_hat = Nd.dot (H3, Params[6]) + params[7] return y_hat# cross-entropy loss function. Loss = Gloss. Softmaxcrossentropyloss ()
Parameter list copied to the specified device
The following function will model parameters [parameter one, parameter two, ...] Copy to a specific GPU and mark the gradient solution:
def get_params (params, CTX): new_params = [P.copyto (CTX) for p in params] for p in New_params: P.attach_grad ( ) return New_params
Synchronization between devices of the same parameter
The following function adds the same parameter data on each GPU and then broadcasts it to all GPUs:
def allreduce (data): # Input as list, containing the same parameter on different devices for I in range (1, len (data)): data[0][:] + = Data[i].copyto ( Data[0].context) # Copy the I bit to a 0-bit device and add the 0-bit for I in range (1, len (data)): Data[0].copyto (Data[i]) # Replace I bit with cumulative 0-bit
Data partitioning to devices
Given a batch of data samples, the following functions can be divided and copied onto each GPU:
def split_and_load (data, CTX): N, k = data.shape[0], len (ctx) m = n/k assert m * k = = N, ' # examples is not divided by # devices. ' return [Data[i * M: (i + 1) * M].as_in_context (Ctx[i]) for I in range (k)]
Third, the training process
Copy the full model parameters onto multiple GPUs and perform multi-GPU training on a single small batch at each iteration:
Def train (Num_gpus, Batch_size, LR): train_iter, test_iter = Gb.load_data_fashion_mnist (batch_size) CTX = [ Mx.gpu (i) for I in Range (Num_gpus)] # device designator list print (' Running on: ', CTX) # Copies the model parameters to the Num_gpus GPU. gpu_params = [Get_params (params, c) for C in CTX] # Each element is a parameter on a device for the epoch in range (1, 6): start = time ( For X, y in Train_iter: # Multi-GPU training on a single small batch. Train_batch (X, y, Gpu_params, CTX, LR) nd.waitall () print (' Epoch%d, Time:%.1f sec '% (epoch, time ()-St ART) # validates the model on GPU0. net = Lambda x:lenet (x, gpu_params[0]) TEST_ACC = gb.evaluate_accuracy (test_iter, NET, ctx[0]) print (' Validation accuracy:%.4f '% TEST_ACC)
Implement multi-GPU training on a single small batch:
def train_batch (X, y, Gpu_params, CTX, LR): # Divide small batches of data samples and copy them onto each GPU. Gpu_xs = Split_and_load (X, ctx) Gpu_ys = Split_and_load (y, CTX) # Calculates the loss on each GPU. with Autograd.record (): ls = [loss (Lenet (gpu_x, Gpu_w), gpu_y) # Loss object on different devices for gpu_x, gpu_y, Gpu_w In Zip (Gpu_xs, Gpu_ys, Gpu_params)] # propagates backwards on each GPU. for L in LS: l.backward () # Adds the gradients on each GPU and then broadcasts them to all GPUs. for I in range (len (Gpu_params[0])): # Gpu_params[0]: All parameters on Device 0 allreduce ([Gpu_params[c][i].grad For C in range (CTX)]) # summarize gradients and broadcast # Update the full model parameters that you maintain on each GPU. For param in gpu_params: # Each device is updated gb.sgd (param, LR, x.shape[0] respectively)
"MXNet"--multi-GPU parallel programming