David Silver Intensive Learning 8:integrating Learning and Planning__reinforcement

Source: Internet
Author: User
First, model-based RL

Model-free RL, learning value functions (and/or strategies) from experience.

Model-based RL, from experience directly learns the MDP model of the environment. (State transition probability p and reward matrix R) plan The value function (and/or strategy) from the model. Can be more effective learning, reduce the uncertainty of the model, but the disadvantage is that it will bring two (learning model, planning) process error.

Here is an important assumption that R and P are independent of each other, that is, the state and behavior of a Moment (S,a) (S,a) obtains the next moment of income r∼rr∼r and the next moment state S∼ps∼p Independent. So the first step, the empirical learning model is two supervised learning problems:

Regression question: S,a→rs,a→r

Classification problem: s,a→s′s,a→s′

As for the P and r of the model, the Gaussian process model, the linear Gaussian model and the neural network model are all available. The second step is to use the learning model for planning.

We have value iterations, strategy iterations, tree search methods, and more. In addition, the known models can be sampled directly, and the sampled experience are planned using model-free methods such as q-learning, Sarsa, Monte-carlo-control, etc. in the previous sections. second, integrated Arch

Dyna: Learning and planning value functions (strategies) from real experience and simulated experience. The latter is sample produced by the MDP (imprecise model) we have learned.

From the algorithm process, is at each step, with the real environment of sample data to learn Q, and learn a model, and then the model produced by the sample learning n times Q. third, simulation-based Search

Focus on the current state, using the forward search algorithm, to establish a current state stst as root of the searching tree.

Based on the simulation search: starting from the current state, using our model to calculate the K-episode, and then use the Model-free method for learning and planning.

The strategy used in the simulation: if the current required state and action are already contained in the constructed tree, then maximize q; otherwise randomly select action (exploration).

Dyna-2, using real experience learning long-term memory, uses simulated experience to learn short-term memory. Original address: Http://cairohy.github.io/2017/09/11/deeplearning/%E3%80%8ADavid%20Silver%E5%BC%BA%E5%8C%96%E5%AD%A6%E4%B9 %a0%e5%85%ac%e5%bc%80%e8%af%be%e3%80%8b-8%ef%bc%9aintegrating%20learning%20and%20planning/

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.