Sarsa algorithm strictly speaking, is TD (0) on the State action function estimation of the On-policy form, so its basic architecture and TD $v_{\pi}$ Estimation Algorithm (On-policy) is not very different, so here is no longer separate elaboration. In this paper, two simple examples are used to apply SARSA algorithm in practice, and the process and basic structure of SARSA algorithm are mastered and summarized.
The statistical methods of reinforcement learning (including Monte Carlo,td) in the implementation of episode task, there are no exceptions to the two layer of the most basic cycle structure. If we look at each episode task as a game, then the game has a beginning and an end, and the statistical method is to play in one inning after another and then summarize the optimal strategy. The difference between Monte Carlo and TD is that Monte Carlo is finished and summed up once, while the TD algorithm is a summary of the side playing. So the two layers of the basic structure of the outer layer is to play the number of cycles, the interior is the game process for the loop.
Sarsa as the On-policy control algorithm under the TD algorithm, only the game Edge Update Action Value function and policy, so the inner layer of the SARSA algorithm can be refined by the TD algorithm to the following structure:
The basic structure of the "RL series" SARSA algorithm