Feudal Networks for hierarchical reinforcement Learning reading notes

Source: Internet
Author: User
feudal Networks for hierarchical reinforcement Learning

tags (space delimited): paper Notes Enhanced Learning Algorithm

Feudal Networks for hierarchical reinforcement Learning Abstract Introduction model Learning Transition Policy gradients A Rchitecture details Dilated Lstm didn't look

Abstract

This paper is mainly on the improvement and application of fedual reinforcenment learning,
First, the form of fedual reinforcement learning:
1. Mainly divided into two parts manger model and worker model;
2. Manger model is the role of the control system to complete which task, in the text, the author of each task encoded into a embedding (similar to the meaning of the word vector of natural language);
3. The Worker model refers to interacting with the environment (action) against a particular task;
4. Therefore, in the article mentioned Manger Model time resolution is very low, and the worker model time resolution is very high
5. The author refers to the concept of a sub-policies, and my understanding is that each task will have a different strategy;
6. Turning a task into a embedding can quickly take on a task.

We introduce feudal Networks (funs): A novel architecture for hierarchical reinforcement. Our approach are inspired by the feudal reinforcement learning proposal of Dayan and Hinton, and gains power and efficacy B Y decoupling End-to-end learning across multiple levels allowing it to utilise different of. Our framework employs a Manager module and a Worker module. The Manager operates at a lower temporal resolution and sets abstract goals which are conveyed to and enacted by the Worke R. The Worker generates primitive actions at every tick of the environment. The decoupled structure of FuN conveys several benefits in addition to facilitating very long timescale credit assignment It also encourages the emergence of sub-policies associated with different goals set by the Manager. These properties are allow FuN to dramatically outperform a strong baseline agents on the tasks that involve longterm credit assign ment or memorisation. We demonstrate the performance of our proposEd system on a range of tasks from the ATARI suite and also from a 3D deep-mind Lab environment. Introduction

The author mentions several difficulties in the application of reinforcement learning at present:
1. Increased learning has long been a problem of credit allocation (long-term), a problem that has always been solved with the Bellman formula, and someone recently decomposed each action of choice into four consecutive action;
2. The second difficulty is that reward feedback is sparse;

Based on the previous work, the author puts forward the network structure and training strategy for the above two problems:
1. The top-level, low temproal resolution Manger Model and the low-level, high temporal resolution Worker model< /c3>
2. Manger model Learning the potential state (personal understanding: implying the state to which goal to develop), and then the Worker model to receive Manger Model signal selection action
3. Manger Model Learning signal is not provided by the worker model, but only outside the environment provided, in other words, the external environment reward provided to the Manger model;
4. The learning signal of the Worker model is a system-internal state (intrinsic reward) provided
5. There is no gradient propagation between the manger model and the worker model

The architecture explored in this work are a fully-differentiable neural network with two levels of hierarchy (though T Here are obvious generalisations to deeper hierar-chies). The top level, the Manager, sets goals at a lower temporal resolution in a latent state-space this is itself learnt by the Manager. The lower level, the Worker, oper-ates at a higher temporal resolution and produces primitive actions, conditioned on the Goals it receives from the Man-ager. The Worker is motivated to follow the goals by a intrinsic reward. However, significantly, no gradients are propagated Worker and Manager; The Manager re-ceives its learning signal from the environment alone. In other words, the Manager learns to select latent goals that maximise extrinsic.

The author concludes the contribution of this paper:
1. Should be fedual reinforcenment learning generalized, can be used in many systems;
2. The authors propose a new approach to Training Manager model (Transition policy gradient), which generates some semantic information on the target (I feel like I'm going to embedding the target);
3. The traditional learning signals are completely dependent on the outside environment, but in this article, the outside Learning signal (reward) is used to train the manger model, and then train the worker model is the internal generated signal;
4. The author also uses the new LSTM network dilated lstm, because in the manger model, it takes a long time to remember the state, because the lstm time resolution is relatively low

The author compares his method with that of someone who was presented in the 2017 policy-over-option.

A key difference between our approach and the "options framework is", "we proposal the top level produces a Meaningfu L and explicit goal for the bottom level to achieve. Sub-goals emerge as directions in the latent state-space and are naturally.
Understand:
1. Manger model is in a top position in the whole models, which can produce a guiding signal to the lower network (Worker Model);
2. The second level of meaning may be that each big task has a lot of small tasks, the different stages of the task reward value may not be the same, so the author believes that large tasks under a lot of small tasks to lead to embedding diversity, somewhat similar to [1] the idea of this paper Model

The following is the model diagram and the specific calculation formula:

Here HM h^m and HW h^w correspond to the internal states of the Manager and the Worker respectively. A linear Transformϕ\phi maps a goal GT G_{t} into a embedding vector wt∈rk w_{t}\in r_{k}, which is then combined via Product with Matrix Ut u_{t} (workers output) to produce policyπ\pi–vector the probabilities over primitive actions.

Description
1. Fpercept f^percept is a feature extraction layer
2. Fmspace F^mspace does not change the dimension, there are two possible l2_norm and the fully connected layer
3.ϕ\phi is a fully-connected layer with no offset value
4. WT W_{t} is the so-called goal embedding
5. According to Formula 6, it is possible to know that the last worker model outputs the possibility of each action Learning

In this section, the author describes how to update the weight of the system.
1. The network part of the convolution layer (feature extraction layer) has two update channels, the first is policy gradient, the second is td-learning, respectively corresponding to the worker model and manger model;
2. In this part, the author simply explained, if in the training, the Worker model and manger model between the spread of the gradient, may lead to manger model some semantic information loss, so the GT G_{t} as the hidden information inside the system;
3. Manger model is based on value-base gradient, while the Worker model is based on policy-based gradient
4. Learning signal (reward), Manger Model learning signal is the sparse signal of the environment, and the model of the study of the model of the learning signal is produced by the manger





Formula 7,manger model gradient (loss function), AMt a^m_{t} Manger Model Td-error formula 8,worker model reward value, the formula obtained by internal calculation 9,worker model loss function ADt D_{t} Worker Model Td-error

The paper corresponds to the sentence:

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.