Paper Reading 4:massively Parallel Methods for deep reinforcement learning

Source: Internet
Author: User

Source: ICML Deep learning workshopgoogle DeepMind Innovation point: Building the first large-scale distributed architecture for depth-enhanced learning

The structure consists of four parts:

    • Parallel action: Used to generate new behavior
    • Parallel learner: Used to train from storage experience
    • Distributed neural Networks: used to represent value function or policy
    • Distributed experience Storage
Experimental results:

Applying DQN to the architecture, the level of training in 49 games has 41 games over the level without distributed DQN, and reduces the training time

    • Training works Better
    • Shorter training time
    • The main disadvantage of parallel distribution is high energy consumption, expensive, high threshold. The average person can't do this.
Where to improve:

The main need to improve is the algorithm level. For example, DQN only one game at a time, can you train more than one game at a time to improve speed?

Detailed Analysis: Introduction introduction

Look at these introductions, there is a bit of nonsense: deep learning has made great progress in vision and speech, attributed to the ability to automatically extract high level features. The current reinforcement learning successfully combines the results of deep learning, that is, DQN, to get breakthrough on Atari games.
However, the problem came (elicit motive motivation)
The previous DQN just trained on a single machine and consumed a lot of training time, so the goal was to build a distributed structure that could take advantage of current computing resources to speed up computing.

Deep learning itself is relatively easy to handle in parallel. GPU acceleration is used. The main research in this aspect is how to use large-scale parallel computing to train huge amounts of data, so as to achieve better training results. In this regard, regardless of Google or Baidu are doing.

Chances are: There's no one to do the reinforcement learning system, so it's obviously worth doing DQN parallelization!

One of the characteristics of enhanced learning is that the agent will not have the same data distribution because it interacts with the environment. So, it is still very natural to think: to get multiple agents in parallel training each agent to store their own experience to the experience pool, then it is a distributed experience pool. This will greatly increase the capacity of the entire experience pool! Different agents can also use different policy to gain different experience. It's exciting to think!

The next step is a distributed learner (many of them) learning from different experience pools and updating the parameters of the network. Therefore, even the parameters are distributed. So, the question is, how are the final parameters merged??

This is the result of Google DeepMind: Gorila (General reinforcement learning Architecture) is the universal reinforcement learning structure. Faster and stronger!

Personal thinking: Whether it's a study, it's a step-by-step approach. Then for the former layman only because of the love to enter this field, and no guidance, self-able to explore the person, in the end what should be done to catch up and find a good entry point and make good results? Of course, DRL has so much to do in this field that it can be combined with any decision-making and control-related issues. But we want to make it general!!

Related work related jobs

This part is to check the data to see if there were any parallel RL work, obviously there will be. To introduce their work Luo:

    • Distributed Multi-Agent Systems: multiple agents work together in an environment to achieve common goals through collaboration. So, their algorithms are more concerned with effective teamwork and overall behavior.
    • Concurrent Reinforcement Learning: concurrency-enhanced learning. is an agent acting in multiple distributed environments (parallel universe)

The method here is different from the above, just to improve the efficiency of single-agent problem by using parallel computing.

    • MapReduce Framework: Do not know is God horse to see paper, mainly confined to linear function approximation.

The last is the closest job:
-Parallel Sarsa algorithm. Each computer has a separate agent and environment, running a simple linear sarsa algorithm. The parameters communicate with each other periodically, focusing on the parameters that change the most.

The method of this paper is to allow the client-server communication, and the action, learning and parameter update three parts separate. Of course, the most important thing is to apply deep learning.


Google's Distributed deep learning system, mainly embodied in
1. Model parallelism models in parallel. Different parts of a different machine training model
2. Data parallelism in parallel. Different parts of the same model training data.


has been analyzed in previous articles and is no longer duplicated.

Gorila (General reinforcement Learning)

This is the structure diagram of Gorila, which is divided into the following parts:

    • Actors the behavior of the player. is to select action execution. There are n different action processes in the Gorila, and each actor produces its own series of actions in the same environment. Of course, the resulting state space is also different. Each actor copies a copy of the q-network used to generate the action, and the parameters of the q-network are periodically synchronized from the parameter server.
    • Experience Replay memory Experience pool. Save each actor's state action series. The total replay memory can be cropped as needed.
    • Learners learning device. The Gorila contains n learner processes. Each one also replicates a q-network. Each learner samples data from the experience pool. The learner uses Off-policy enhanced learning algorithms such as DQN. The obtained gradients are then transferred to the parameter server for parameter updates.
    • Parameter Server parameter servers. The parameter server stores n versions of the q-network on different machines. Each machine applies only the gradient to update some parameters. An asynchronous random gradient descent asynchronous stochastic gradient descent algorithm is adopted.
Gorila DQN

The above description will look longer, in effect simply adding the distribution to the algorithm. There is no change for DQN. The main consideration is how to coordinate the updating of the parameters on different machines, to ensure the stability of the system.


Obviously, the distributed version will be better than the standalone version, for two reasons:

    • The training time is greatly improved relative to the single machine.
    • The parameter update differs from the standalone version, which is a partial update, with different parameters being updated on each machine. This asynchronous update has a certain effect on the results. And it uses Adagra instead of Rmsprop.

So, as a result, although most of the game effect is better, but also some of the game effect is worse.

This also shows that different ways of exploring the game space will make a big difference to the game.

In fact, some versions after dqn have not been able to improve the level of each game. This is also a question worthy of further discussion.


Gorila is the first large-scale distributed depth-enhancement learning framework, of course, without open source. Gorila Learning and action execution are parallel, using distributed experience pool and distributed neural network. The use of Gorila shows that in the case of increasing computational power, the level of the algorithm can be greatly improved, which proves that deep learning cannot see the upper limit in the case of increasing computational power and time.

Paper Reading 4:massively Parallel Methods for deep reinforcement learning

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.