Markov decision-making process

Source: Internet
Author: User

 

Markov characteristics:

The state of the next time point is only related to the current time point and is irrelevant to the previous time point. That is, the State Information contains all historical information.

Markov reward process, $ <S, P, R, \ gamma> $:

$ S $ is a finite state set.

$ P $ is the state transfer probability matrix, $ {P _ {ss' }}={ \ RM p} [{S _ {t + 1 }}= s' | {S_T} = s] $

$ R$ is the reward function, $ {R_S }={\ RM e} [{R _ {t + 1 }}|{ S_T} = s] $

$ \ Gamma $ is a discount factor

Why is the discount factor required?

1. Easily define descriptions in Mathematics

2. Avoiding infinite loops in the Markov Process

3. discounts can indicate uncertainty about the future.

4. If the reward is economical, instant rewards will bring more profits than delayed rewards.

5. People/animals are more likely to be rewarded in real time.

6. There is also a non-Discounted Markov Reward Process

 

Defines $ G $ as the sum of discount prizes starting from $ T $.

$ {G_t }={ R _ {t + 1 }}+ \ gamma {R _ {T + 2 }}+... = \ sum \ limits _ {k = 0} ^ \ infty {\ gamma ^ k} {R _ {T + k + 1 }}$

Defines the value function of the Markov reward process $ V (s) $ is the long-term value under the State $ S $, that is, the expected return under the State $ S $

$ V (S) = {\ RM e} [{g_t} | {S_T} = s] $

 

In order to reflect the dynamic characteristics of the Markov reward process, we can use the Bellman equation to obtain the following results:

$ V (S) = {\ RM e} [{g_t} | {S_T} = s] \
= {\ RM e} [{R _ {t + 1 }}+ \ gamma {R _ {T + 2 }}+ {\ gamma ^ 2} {R _ {T + 3 }}+... | {S_T} = s] \
={\ RM e} [{R _ {t + 1 }}+ \ gamma ({R _ {T + 2 }}+ \ gamma {R _ {T + 3}} + ...) | {S_T} = s] \
= {\ RM e} [{R _ {t + 1 }}+ \ gamma {g _ {t + 1 }}| {S_T} = s] \
={\ RM e} [{R _ {t + 1 }}+ \ gamma V ({S _ {t + 1}) | {S_T} = s] $

$ V (s) ={\ RM e} [{R _ {t + 1 }}+ \ gamma V ({S _ {t + 1 }}) | {S_T} = s] $

It can be seen that the value function $ V (s) $ in the current status is equal

Real-time rewards for the current status to the next Status $ R_S $, the Value Function expectation of the next state after the discount is added $ \ gamma \ sum \ limits _ {s '\ in S }{{{ \ rm p }_{ ss '}}} V (s ') $

$ V (s) = {R_S} + \ gamma \ sum \ limits _ {s '\ in S }{{{ \ rm p }_{ ss' }}} V (s') $

 

Markov decision process, $ <S, A, P, R, \ gamma> $:

Introduce limited action space $ A $ to convert the Markov reward process into a Markov decision-making process.

 

$ S $ is a finite state set.

$ A $ is a finite action set

$ P $ is the state transfer probability matrix, $ {P _ {ss'} ^ A }={\ RM p} [{S _ {t + 1 }}= s' | {S_T} = s, a_T = A] $

$ R$ is the reward function. $ {R_S ^ A }={ \ RM e} [{R _ {t + 1 }}|{ S_T} = s, a_T = A] $

$ \ Gamma $ is a discount factor

 

Define Policy $ \ Pi $ is the action distribution in a given State, $ \ Pi (A | S) = {\ RM p} [{a_T} = A | {S_T} = s] $

1. The policy completely defines the behavior of the agent.

2. In the Markov decision-making process, decision-making is the choice of action. It depends on the current state and has nothing to do with the historical state.

3. policy $ \ Pi $ is the probability distribution of all actions in a given State, therefore, you can calculate the probability of state transfer $ {\ rm p }_{ ss' }}$ and the expected real-time reward for the Status $ R_S $

$ {\ Rm p }_{ ss' }}=\ sum \ limits _ {A \ In a} {\ Pi (A | S) {\ rm p }_{ ss'} ^ A} $

$ {R_S} = \ sum \ limits _ {A \ In a} {\ Pi (A | S) R_S ^ A} $

 

Defines the value function of the Markov Decision-making Process $ V _ \ Pi (s) $. In the $ S $ state, follow the policy $ \ Pi $, that is, the expected return of the action is selected according to the policy $ \ Pi $.

$ V _ \ Pi (s) ={\ RM e} [{g_t} | {S_T} = s] $

Defines the action-value function $ q _ \ Pi (s) $ of a Markov decision process. In Status $ S $, select Action $, expected Return from policy $ \ Pi $

$ Q _ \ Pi (S) = {\ RM e} [{g_t} | {S_T} = s, a_T = A] $

 

For the Markov decision-making process, to reflect its dynamic characteristics, we can use the Bellman equation to obtain the recursive results:

$ {V _ \ PI} (s) ={{ \ RM {e }}_\ PI} [{R _ {t + 1 }}+ \ gamma {v _ \ PI} ({S _ {t + 1 }}) | {S_T} = s] $

$ {Q _ \ PI} (s,) ={{ \ RM {e }}_\ PI} [{R _ {t + 1 }}+ \ gamma {q _ \ PI} ({S _ {t + 1 }}, {A _ {t + 1}) | {S_T} = s, {a_T} = A] $

 

 

$ {V _ \ PI} (S) = \ Pi (A | S) \ sum \ limits _ {A \ In a} {q _ \ PI} (S, a)} $ (1)

 

 

$ {Q _ \ PI} (s,) = R_S ^ A + \ gamma \ sum \ limits _ {s '\ in S} {P _ {ss'} ^ A {v _ \ PI} (s ')} $ (2)

Bringing formula (2) into formula (1) to obtain the bellman recurrence equation for $ {v _ \ PI} (s) $,

$ {V _ \ PI} (S) = \ sum \ limits _ {A \ In a} {\ Pi (A | S) \ left ({R_S ^ A + \ gamma \ sum \ limits _ {s '\ in S} {P _ {ss'} ^ A {v _ \ PI} (s' )}} \ right)} $

Bringing formula (1) into formula (2) to obtain the bellman recurrence equation for $ {q _ \ PI} (s, A) $,

 

$ {Q _ \ PI} (s,) = R_S ^ A + \ gamma \ sum \ limits _ {s '\ in S} {P _ {SS'} ^ A \ Pi (a' | s ') \ sum \ limits _ {a' \ In a} {q _ \ PI} (s ', A')} $

 

Theorem:

For any Markov decision-making process,

1. An optimal policy $ \ PI _ * $ must be better or equal than all other policies. $ \ PI _ * \ Ge \ Pi, \ forall \ Pi $

2. $ {v _ {\ PI _ *} (s) }={ V _ *} (s) $ {q _ {\ PI _ *} (s, A) }={ Q _ *} (s, A) $

 

If we know $ q _ {\ PI _ *} (s, A) $, we can select the action a corresponding to the maximum value based on the Q value each time, to obtain the optimal policy $ \ Pi $

Since the Q value contains all possible states in the future, you can greedy each step.

 

Bellman's optimal equation is as follows:

 

Next, let's talk about how to get the optimal strategy through iteration.

 

 

Refer:

1. David Silver Course

2. Reinforcement Learning: An introduction. Richard S. Sutton

Markov decision-making process

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.