Finite Markov decision process in reinforcement learning finite Markov decision Processes in RL

Source: Internet
Author: User

Thanks Richard S. Sutton and Andrew G. Barto for their great work of reinforcement Learning:an introduction-2nd Edition .

Here we summarize some basic notions and formulations in most reinforcement learning problems. This note does not include the detailed explanantion of each notion. Refer to the references above if you want a deeper insight.

Agent-environment Interface goals and Rewards Returns and episodes policies and Value Functions Dynamic programming Policy Evaluation-prediction Problem Policy Improvement policy Iteration Convergence Proof

Markov decision Processes is a classcial formalization of sequential decision making, where actions influence not just IM Mediate rewards, but also subsequentsituations, or States, and through those the future rewards. MDPs is a mathematically idealized form of the reinforcement learning problem. agent-environment Interface


MDPs is meant to is a straightforward framing of the problem of learning from interaction to achieve a goal. The learner and decision maker is called the agent . The thing it interacts with, comprising everything outside the agent, is called the environment . These interact continually, the agent selecting Actions and the environment responding to these actions a nd presenting new situations ( State ) to the agent. The environment also gives rise to rewards , special numerical values of the agent seeks to maximize ove R time through its choice of actions.

More speci Cally, the agent and environment interact at each of a sequence of discrete time steps, t=0,1,2,3,... t = 0, 1, 2, 3, ... t=0,1,2,3,\dots. At all time step T T T, the agent receives some representation of environment ' s State , st∈s s t∈s s_{t }\in\mathcal{s}, and on this basis selects an action , At∈a (s) A t∈a (s) a_{t}\in \mathcal{a} (s). One time step later, in part as a consequence of its action, the agent receives a numerical reward , rt+1∈ R⊂r R T + 1∈r⊂r R_{t+1}\in\mathcal{r}\subset \mathbb{r}, and finds itself in a new state, st+1 S T + 1 s_{t+1}. The MDP and agent together thereby give rise to a sequence or trajectory so begins like this:
S0,a 0,R1,S1,A1,R2,S2,A2,R3,... s 0, a 0, R 1, S 1, a 1, R 2, s 2, a 2, R 3 </

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.