Markov characteristics:
The state of the next time point is only related to the current time point and is irrelevant to the previous time point. That is, the State Information contains all historical information.
Markov reward process, $ <S, P, R, \ gamma> $:
$ S $ is a finite state set.
$ P $ is the state transfer probability matrix, $ {P _ {ss' }}={ \ RM p} [{S _ {t + 1 }}= s' | {S_T} = s] $
$ R$ is the reward function, $ {R_S }={\ RM e} [{R _ {t + 1 }}|{ S_T} = s] $
$ \ Gamma $ is a discount factor
Why is the discount factor required?
1. Easily define descriptions in Mathematics
2. Avoiding infinite loops in the Markov Process
3. discounts can indicate uncertainty about the future.
4. If the reward is economical, instant rewards will bring more profits than delayed rewards.
5. People/animals are more likely to be rewarded in real time.
6. There is also a non-Discounted Markov Reward Process
Defines $ G $ as the sum of discount prizes starting from $ T $.
$ {G_t }={ R _ {t + 1 }}+ \ gamma {R _ {T + 2 }}+... = \ sum \ limits _ {k = 0} ^ \ infty {\ gamma ^ k} {R _ {T + k + 1 }}$
Defines the value function of the Markov reward process $ V (s) $ is the long-term value under the State $ S $, that is, the expected return under the State $ S $
$ V (S) = {\ RM e} [{g_t} | {S_T} = s] $
In order to reflect the dynamic characteristics of the Markov reward process, we can use the Bellman equation to obtain the following results:
$ V (S) = {\ RM e} [{g_t} | {S_T} = s] \
= {\ RM e} [{R _ {t + 1 }}+ \ gamma {R _ {T + 2 }}+ {\ gamma ^ 2} {R _ {T + 3 }}+... | {S_T} = s] \
={\ RM e} [{R _ {t + 1 }}+ \ gamma ({R _ {T + 2 }}+ \ gamma {R _ {T + 3}} + ...) | {S_T} = s] \
= {\ RM e} [{R _ {t + 1 }}+ \ gamma {g _ {t + 1 }}| {S_T} = s] \
={\ RM e} [{R _ {t + 1 }}+ \ gamma V ({S _ {t + 1}) | {S_T} = s] $
$ V (s) ={\ RM e} [{R _ {t + 1 }}+ \ gamma V ({S _ {t + 1 }}) | {S_T} = s] $
It can be seen that the value function $ V (s) $ in the current status is equal
Real-time rewards for the current status to the next Status $ R_S $, the Value Function expectation of the next state after the discount is added $ \ gamma \ sum \ limits _ {s '\ in S }{{{ \ rm p }_{ ss '}}} V (s ') $
$ V (s) = {R_S} + \ gamma \ sum \ limits _ {s '\ in S }{{{ \ rm p }_{ ss' }}} V (s') $
Markov decision process, $ <S, A, P, R, \ gamma> $:
Introduce limited action space $ A $ to convert the Markov reward process into a Markov decision-making process.
$ S $ is a finite state set.
$ A $ is a finite action set
$ P $ is the state transfer probability matrix, $ {P _ {ss'} ^ A }={\ RM p} [{S _ {t + 1 }}= s' | {S_T} = s, a_T = A] $
$ R$ is the reward function. $ {R_S ^ A }={ \ RM e} [{R _ {t + 1 }}|{ S_T} = s, a_T = A] $
$ \ Gamma $ is a discount factor
Define Policy $ \ Pi $ is the action distribution in a given State, $ \ Pi (A | S) = {\ RM p} [{a_T} = A | {S_T} = s] $
1. The policy completely defines the behavior of the agent.
2. In the Markov decision-making process, decision-making is the choice of action. It depends on the current state and has nothing to do with the historical state.
3. policy $ \ Pi $ is the probability distribution of all actions in a given State, therefore, you can calculate the probability of state transfer $ {\ rm p }_{ ss' }}$ and the expected real-time reward for the Status $ R_S $
$ {\ Rm p }_{ ss' }}=\ sum \ limits _ {A \ In a} {\ Pi (A | S) {\ rm p }_{ ss'} ^ A} $
$ {R_S} = \ sum \ limits _ {A \ In a} {\ Pi (A | S) R_S ^ A} $
Defines the value function of the Markov Decision-making Process $ V _ \ Pi (s) $. In the $ S $ state, follow the policy $ \ Pi $, that is, the expected return of the action is selected according to the policy $ \ Pi $.
$ V _ \ Pi (s) ={\ RM e} [{g_t} | {S_T} = s] $
Defines the action-value function $ q _ \ Pi (s) $ of a Markov decision process. In Status $ S $, select Action $, expected Return from policy $ \ Pi $
$ Q _ \ Pi (S) = {\ RM e} [{g_t} | {S_T} = s, a_T = A] $
For the Markov decision-making process, to reflect its dynamic characteristics, we can use the Bellman equation to obtain the recursive results:
$ {V _ \ PI} (s) ={{ \ RM {e }}_\ PI} [{R _ {t + 1 }}+ \ gamma {v _ \ PI} ({S _ {t + 1 }}) | {S_T} = s] $
$ {Q _ \ PI} (s,) ={{ \ RM {e }}_\ PI} [{R _ {t + 1 }}+ \ gamma {q _ \ PI} ({S _ {t + 1 }}, {A _ {t + 1}) | {S_T} = s, {a_T} = A] $
$ {V _ \ PI} (S) = \ Pi (A | S) \ sum \ limits _ {A \ In a} {q _ \ PI} (S, a)} $ (1)
$ {Q _ \ PI} (s,) = R_S ^ A + \ gamma \ sum \ limits _ {s '\ in S} {P _ {ss'} ^ A {v _ \ PI} (s ')} $ (2)
Bringing formula (2) into formula (1) to obtain the bellman recurrence equation for $ {v _ \ PI} (s) $,
$ {V _ \ PI} (S) = \ sum \ limits _ {A \ In a} {\ Pi (A | S) \ left ({R_S ^ A + \ gamma \ sum \ limits _ {s '\ in S} {P _ {ss'} ^ A {v _ \ PI} (s' )}} \ right)} $
Bringing formula (1) into formula (2) to obtain the bellman recurrence equation for $ {q _ \ PI} (s, A) $,
$ {Q _ \ PI} (s,) = R_S ^ A + \ gamma \ sum \ limits _ {s '\ in S} {P _ {SS'} ^ A \ Pi (a' | s ') \ sum \ limits _ {a' \ In a} {q _ \ PI} (s ', A')} $
Theorem:
For any Markov decision-making process,
1. An optimal policy $ \ PI _ * $ must be better or equal than all other policies. $ \ PI _ * \ Ge \ Pi, \ forall \ Pi $
2. $ {v _ {\ PI _ *} (s) }={ V _ *} (s) $ {q _ {\ PI _ *} (s, A) }={ Q _ *} (s, A) $
If we know $ q _ {\ PI _ *} (s, A) $, we can select the action a corresponding to the maximum value based on the Q value each time, to obtain the optimal policy $ \ Pi $
Since the Q value contains all possible states in the future, you can greedy each step.
Bellman's optimal equation is as follows:
Next, let's talk about how to get the optimal strategy through iteration.
Refer:
1. David Silver Course
2. Reinforcement Learning: An introduction. Richard S. Sutton
Markov decision-making process