First, model-based RL
Model-free RL, learning value functions (and/or strategies) from experience.
Model-based RL, from experience directly learns the MDP model of the environment. (State transition probability p and reward matrix R) plan The value function (and/or strategy) from the model. Can be more effective learning, reduce the uncertainty of the model, but the disadvantage is that it will bring two (learning model, planning) process error.
Here is an important assumption that R and P are independent of each other, that is, the state and behavior of a Moment (S,a) (S,a) obtains the next moment of income r∼rr∼r and the next moment state S∼ps∼p Independent. So the first step, the empirical learning model is two supervised learning problems:
Regression question: S,a→rs,a→r
Classification problem: s,a→s′s,a→s′
As for the P and r of the model, the Gaussian process model, the linear Gaussian model and the neural network model are all available. The second step is to use the learning model for planning.
We have value iterations, strategy iterations, tree search methods, and more. In addition, the known models can be sampled directly, and the sampled experience are planned using model-free methods such as q-learning, Sarsa, Monte-carlo-control, etc. in the previous sections. second, integrated Arch
Dyna: Learning and planning value functions (strategies) from real experience and simulated experience. The latter is sample produced by the MDP (imprecise model) we have learned.
From the algorithm process, is at each step, with the real environment of sample data to learn Q, and learn a model, and then the model produced by the sample learning n times Q. third, simulation-based Search
Focus on the current state, using the forward search algorithm, to establish a current state stst as root of the searching tree.
Based on the simulation search: starting from the current state, using our model to calculate the K-episode, and then use the Model-free method for learning and planning.
The strategy used in the simulation: if the current required state and action are already contained in the constructed tree, then maximize q; otherwise randomly select action (exploration).
Dyna-2, using real experience learning long-term memory, uses simulated experience to learn short-term memory. Original address: Http://cairohy.github.io/2017/09/11/deeplearning/%E3%80%8ADavid%20Silver%E5%BC%BA%E5%8C%96%E5%AD%A6%E4%B9 %a0%e5%85%ac%e5%bc%80%e8%af%be%e3%80%8b-8%ef%bc%9aintegrating%20learning%20and%20planning/