Distributed Echelon decline
Parallel Model
Model Parallelism ):Different machines (such as GPUs and CPUs) in a distributed system are responsible for different parts of the network model-for example, different network layers of the neural network model are allocated to different machines, or, different moral parameters in the same layer are assigned to different machines.
Data parallelism ):Different machines have multiple copies of the same model, each machine is allocated to different data, and then the computing structure of all machines is merged in a certain way.
Parallel Method
Synchronous Data Parallel Method (synchronous data parallel methods)
Split the data into many small portions, and each piece of training data is thrown to a machine. Then, the machine can perform training based on local data and update the neural network through gradient descent. After each part is updated with some neural networks, we hope you can work together for training. The grand prize will push what you learned to a public server, and the server will see all the stuff sent from the local machine, and he will make them an integration, after integration, you will get the state of the best model that considers all the information about the training machine, and then send the best model to each machine. Then each machine uses its own local parameters to update the next round of data. In this way, the cycle starts again. This is the classic parameter server architecture.
The parallel implementation of synchronization is equivalent to a serial gradient descent method, but the size of Min-batch has changed. However, the training data is always limited by the slowest machine.
Model Averaging)
Parameter mean is the simplest way to parallelize data. If the parameter average method is used, the training process is as follows:
- Model-based configuration random network model initialization parameters
- Distribute the current set of parameters to each worker Node
- Use a portion of the data in the dataset for training at each worker node.
- Use the mean value of parameters of each worker node as the global parameter value.
Dasgd (elastic averaging SGD)
The process is similar to the parameter average method, except that the parameter average is not used as the global parameter value, but that each worker server changes around the parameter average.
The P value controls the floating size between the optimal models. The larger the P value, the larger the exploration range in the parameter space, the closer the P value to the central parameter.
Asynchronous Data Parallel Method
In order to overcome the serious delay caused by the worker of the load in the synchronization Data Parallel, and further reduce the training time, a simple method is to remove the synchronization constraints. However, this has brought about some effects, some of which are not very obvious. The simplest concept is parameter timeout (parameter staleness ). Parameter staleness refers to the delay between a worker getting a parameter (pull) from the central variable and then submitting the parameter to the central variable. Intuitively, this means that worker is updating model parameters using the gradient of previous parameters based on the model. As shown in, parameters are obtained from the central variable at the time point for local computing. After the local computing is complete, the gradient is submitted to the central variable for parameter update at the time point, the time difference between them is called parameter staleness. Because a gradient update is submitted between and, the parameter update at the moment is delayed, which is unreasonable.
Asynchronous easgd
The asynchronous easgd update scheme is very similar to the synchronous easgd scheme, but it also has some important details. In the following description, we will use a vector to represent the elastic difference (elastic different ). In the synchronization version, this vector is actually used to explore the parameter space. This means that in the equation
In the left-side navigation pane. This is exactly what happens when elastic different is applied to worker.
In the asynchronous version, elastic difference has the same functions. However, it is also used to update the central variable. As described in the preceding section, elastic different is actually used for restricted exploration of the parameter space. However, if we deny elastic difference, that is, elastic difference can be used to optimize the described central variable (reverse arrow in the figure) while retaining the communication constraint ), this is a problem that easgd is trying to solve.
Downpour
In downpour, the Echelon communicates with the parameter server whenever the worker calculates a gradient (or gradient sequence. When the parameter server receives a gradient update request from the worker, it merges the update into the central variable. Unlike easgd, downpour does not assume that there are any communication restrictions. More importantly, if frequent communication with the parameter server does not occur (to reduce the worker difference), downpour does not assume that there are any communication restrictions. More importantly, if frequent communication with the parameter server does not occur (to reduce the worker difference), downpur will not converge (this is also related to the momentum issue caused by Asynchronization ). This is the same as the problem discussed above. If we allow worker to explore "too much" parameter space, all workers will not work together to find a good solution for the central variable. In addition, downpour does not have any fixed mechanism to maintain the central variable domain. Therefore, if the size of the communication window is increased, the gradient length sent to the parameter server is increased proportionally. Therefore, to make the worker variance in the parameter space more "small", we need to actively update the central variable.
Adag
We have noticed that a large communication windows is related to the decline in model performance. Using some models (such as downpour, as shown above), we note that this effect can be mitigated when the communication window is used to standardize the cumulative gradient. There are also several positive effects. For example, there is no normalizing for the number of parallel workers. Therefore, the benefits of parallel gradient descent (convergence speed) will not be lost. This has a side effect, that is, the variance generated by each worker for the central variable is also small, so it has a positive effect on the central objective! In addition, due to normalization, the model is not very sensitive to hyperparameters, especially the size of the communication window. However, in other words, a large communication window usually reduces the performance of the model, because it allows the worker to explore more parameter spaces based on the samples in the Data shard. In the first model, we adjusted downpour to adapt to this idea. We observe the following results. First, we observe that the model performance is significantly improved, and even the sequential optimization algorithms such as Adam are comparable. Secondly, we can increase the communication window by 3 times compared with downpour. Therefore, CPU resources can be used more effectively and the total training time can be further reduced. Finally, the gradient accumulated by normalization allows us to increase the communication window. Therefore, we can match the training time of easgd and achieve roughly the same (sometimes better, sometimes worse) results.
In short, the core idea of adag or asynchronous distributed adaptive gradient algorithm can be applied with any distributed optimization solution. With our observation and intuition (especially the implicit momentum caused by Asynchronization), we can guess that the idea of normalized cumulative gradient can be applied to any distributed optimization solution.
Distributed Echelon decline