Thanks Richard S. Sutton and Andrew G. Barto for their great work of reinforcement Learning:an introduction-2nd Edition .
Here we summarize some basic notions and formulations in most reinforcement learning problems. This note does not include the detailed explanantion of each notion. Refer to the references above if you want a deeper insight.
Agent-environment Interface goals and Rewards Returns and episodes policies and Value Functions Dynamic programming Policy Evaluation-prediction Problem Policy Improvement policy Iteration Convergence Proof
Markov decision Processes is a classcial formalization of sequential decision making, where actions influence not just IM Mediate rewards, but also subsequentsituations, or States, and through those the future rewards. MDPs is a mathematically idealized form of the reinforcement learning problem. agent-environment Interface
MDPs is meant to is a straightforward framing of the problem of learning from interaction to achieve a goal. The learner and decision maker is called the agent . The thing it interacts with, comprising everything outside the agent, is called the environment . These interact continually, the agent selecting Actions and the environment responding to these actions a nd presenting new situations ( State ) to the agent. The environment also gives rise to rewards , special numerical values of the agent seeks to maximize ove R time through its choice of actions.
More speci Cally, the agent and environment interact at each of a sequence of discrete time steps, t=0,1,2,3,... t = 0, 1, 2, 3, ... t=0,1,2,3,\dots. At all time step T T T, the agent receives some representation of environment ' s State , st∈s s t∈s s_{t }\in\mathcal{s}, and on this basis selects an action , At∈a (s) A t∈a (s) a_{t}\in \mathcal{a} (s). One time step later, in part as a consequence of its action, the agent receives a numerical reward , rt+1∈ R⊂r R T + 1∈r⊂r R_{t+1}\in\mathcal{r}\subset \mathbb{r}, and finds itself in a new state, st+1 S T + 1 s_{t+1}. The MDP and agent together thereby give rise to a sequence or trajectory so begins like this:
S0,a 0,R1,S1,A1,R2,S2,A2,R3,... s 0, a 0, R 1, S 1, a 1, R 2, s 2, a 2, R 3 </