Markov decision-making process

Last Update:2018-10-25 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Markov characteristics:

The state of the next time point is only related to the current time point and is irrelevant to the previous time point. That is, the State Information contains all historical information.

Markov reward process, $ <S, P, R, \ gamma> $:

$ S $ is a finite state set.

$ P $ is the state transfer probability matrix, $ {P _ {ss' }}={ \ RM p} [{S _ {t + 1 }}= s' | {S_T} = s] $

$ R$ is the reward function, $ {R_S }={\ RM e} [{R _ {t + 1 }}|{ S_T} = s] $

$ \ Gamma $ is a discount factor

Why is the discount factor required?

1. Easily define descriptions in Mathematics

2. Avoiding infinite loops in the Markov Process

3. discounts can indicate uncertainty about the future.

4. If the reward is economical, instant rewards will bring more profits than delayed rewards.

5. People/animals are more likely to be rewarded in real time.

6. There is also a non-Discounted Markov Reward Process

Defines $ G $ as the sum of discount prizes starting from $ T $.

$ {G_t }={ R _ {t + 1 }}+ \ gamma {R _ {T + 2 }}+... = \ sum \ limits _ {k = 0} ^ \ infty {\ gamma ^ k} {R _ {T + k + 1 }}$

Defines the value function of the Markov reward process $ V (s) $ is the long-term value under the State $ S $, that is, the expected return under the State $ S $

$ V (S) = {\ RM e} [{g_t} | {S_T} = s] $

In order to reflect the dynamic characteristics of the Markov reward process, we can use the Bellman equation to obtain the following results:

$ V (S) = {\ RM e} [{g_t} | {S_T} = s] \
= {\ RM e} [{R _ {t + 1 }}+ \ gamma {R _ {T + 2 }}+ {\ gamma ^ 2} {R _ {T + 3 }}+... | {S_T} = s] \
={\ RM e} [{R _ {t + 1 }}+ \ gamma ({R _ {T + 2 }}+ \ gamma {R _ {T + 3}} + ...) | {S_T} = s] \
= {\ RM e} [{R _ {t + 1 }}+ \ gamma {g _ {t + 1 }}| {S_T} = s] \
={\ RM e} [{R _ {t + 1 }}+ \ gamma V ({S _ {t + 1}) | {S_T} = s] $

$ V (s) ={\ RM e} [{R _ {t + 1 }}+ \ gamma V ({S _ {t + 1 }}) | {S_T} = s] $

It can be seen that the value function $ V (s) $ in the current status is equal

Real-time rewards for the current status to the next Status $ R_S $, the Value Function expectation of the next state after the discount is added $ \ gamma \ sum \ limits _ {s '\ in S }{{{ \ rm p }_{ ss '}}} V (s ') $

$ V (s) = {R_S} + \ gamma \ sum \ limits _ {s '\ in S }{{{ \ rm p }_{ ss' }}} V (s') $

Markov decision process, $ <S, A, P, R, \ gamma> $:

Introduce limited action space $ A $ to convert the Markov reward process into a Markov decision-making process.

$ S $ is a finite state set.

$ A $ is a finite action set

$ P $ is the state transfer probability matrix, $ {P _ {ss'} ^ A }={\ RM p} [{S _ {t + 1 }}= s' | {S_T} = s, a_T = A] $

$ R$ is the reward function. $ {R_S ^ A }={ \ RM e} [{R _ {t + 1 }}|{ S_T} = s, a_T = A] $

$ \ Gamma $ is a discount factor

Define Policy $ \ Pi $ is the action distribution in a given State, $ \ Pi (A | S) = {\ RM p} [{a_T} = A | {S_T} = s] $

1. The policy completely defines the behavior of the agent.

2. In the Markov decision-making process, decision-making is the choice of action. It depends on the current state and has nothing to do with the historical state.

3. policy $ \ Pi $ is the probability distribution of all actions in a given State, therefore, you can calculate the probability of state transfer $ {\ rm p }_{ ss' }}$ and the expected real-time reward for the Status $ R_S $

$ {\ Rm p }_{ ss' }}=\ sum \ limits _ {A \ In a} {\ Pi (A | S) {\ rm p }_{ ss'} ^ A} $

$ {R_S} = \ sum \ limits _ {A \ In a} {\ Pi (A | S) R_S ^ A} $

Defines the value function of the Markov Decision-making Process $ V _ \ Pi (s) $. In the $ S $ state, follow the policy $ \ Pi $, that is, the expected return of the action is selected according to the policy $ \ Pi $.

$ V _ \ Pi (s) ={\ RM e} [{g_t} | {S_T} = s] $

Defines the action-value function $ q _ \ Pi (s) $ of a Markov decision process. In Status $ S $, select Action $, expected Return from policy $ \ Pi $

$ Q _ \ Pi (S) = {\ RM e} [{g_t} | {S_T} = s, a_T = A] $

For the Markov decision-making process, to reflect its dynamic characteristics, we can use the Bellman equation to obtain the recursive results:

$ {V _ \ PI} (s) ={{ \ RM {e }}_\ PI} [{R _ {t + 1 }}+ \ gamma {v _ \ PI} ({S _ {t + 1 }}) | {S_T} = s] $

$ {Q _ \ PI} (s,) ={{ \ RM {e }}_\ PI} [{R _ {t + 1 }}+ \ gamma {q _ \ PI} ({S _ {t + 1 }}, {A _ {t + 1}) | {S_T} = s, {a_T} = A] $

$ {V _ \ PI} (S) = \ Pi (A | S) \ sum \ limits _ {A \ In a} {q _ \ PI} (S, a)} $ (1)

$ {Q _ \ PI} (s,) = R_S ^ A + \ gamma \ sum \ limits _ {s '\ in S} {P _ {ss'} ^ A {v _ \ PI} (s ')} $ (2)

Bringing formula (2) into formula (1) to obtain the bellman recurrence equation for $ {v _ \ PI} (s) $,

$ {V _ \ PI} (S) = \ sum \ limits _ {A \ In a} {\ Pi (A | S) \ left ({R_S ^ A + \ gamma \ sum \ limits _ {s '\ in S} {P _ {ss'} ^ A {v _ \ PI} (s' )}} \ right)} $

Bringing formula (1) into formula (2) to obtain the bellman recurrence equation for $ {q _ \ PI} (s, A) $,

$ {Q _ \ PI} (s,) = R_S ^ A + \ gamma \ sum \ limits _ {s '\ in S} {P _ {SS'} ^ A \ Pi (a' | s ') \ sum \ limits _ {a' \ In a} {q _ \ PI} (s ', A')} $

Theorem:

For any Markov decision-making process,

1. An optimal policy $ \ PI _ * $ must be better or equal than all other policies. $ \ PI _ * \ Ge \ Pi, \ forall \ Pi $

2. $ {v _ {\ PI _ *} (s) }={ V _ *} (s) $ {q _ {\ PI _ *} (s, A) }={ Q _ *} (s, A) $

If we know $ q _ {\ PI _ *} (s, A) $, we can select the action a corresponding to the maximum value based on the Q value each time, to obtain the optimal policy $ \ Pi $

Since the Q value contains all possible states in the future, you can greedy each step.

Bellman's optimal equation is as follows:

Next, let's talk about how to get the optimal strategy through iteration.

Refer:

1. David Silver Course

2. Reinforcement Learning: An introduction. Richard S. Sutton

Markov decision-making process

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Markov decision-making process

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Markov decision-making process

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support