"Cs229-lecture16" Markov decision process

Last Update:2015-04-13 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Before talking about supervised learning and unsupervised learning, the main talk today is "intensive learning".

Markov decision-making process; Markov decision process (MDP)

value functions; Value function

Value iteration; value iteration (algorithm, solving MDP)

Policy iteration (algorithm, solving MDP)

What is reinforcement learning?

Intensive learning (reinforcement learning, also known as re-learning, evaluation learning) is an important machine learning method, which has many applications in intelligent control robot and analysis and prediction fields. But in the traditional machine learning classification, there is no mention of reinforcement learning, and in the connection learning, the learning algorithm is divided into three types, namely unsupervised learning (unsupervised learning), supervised learning (supervised leaning) and intensive learning.

According to the current status of the agent, select Action A, then interact with the environment, the agent observed the next state, and received a certain reward R (good and Bad).
So repeatedly interacting with the environment, under certain conditions, the agent will learn an optimal/suboptimal strategy.

Markov decision making process

Markov decision process is the optimal decision-making process of stochastic dynamic system based on Markov process theory. Markov decision-making process is the main research field of sequential decision-making. It is a combination of Markov process and deterministic dynamic programming, so it is also called Markov stochastic dynamic programming, which belongs to a branch of mathematical programming in operational research.

(The following transfers are from: http://blog.csdn.net/dark_scope/article/details/8252969)

Markov decision-making is a five-tuple, with a robot taking a map example to illustrate their respective roles

S: State set: All the possible states, in the case of a robot walking a map, where all the robots might appear.

A:action, that is, all possible actions. The example of a robot taking a map assumes that the robot can only go in four directions, so a is {n,s,e,w} represents four Directions

P: The probability that a robot takes a action in S state

Gamma: Called discount factor, is a number between 0 and 1, this number determines the effect of the action on the results, on the board of the example is to affect this step

The effect of chess on the results of the most likely to be said to be more vague, through the following explanation may be more clearly explained.

R: is a reward function, which is probably one, or, correspondingly, a weighted value on the map.

With such a decision-making process, the process of the robot's activity on the map can also be expressed as follows:

That is, starting from the initial position, selecting an action to reach another state until the final state is reached, so we define the value of this process:

It can be seen that the earlier decisions have a greater effect on value, and then decay by γ in turn

In fact, it can be seen that given a MDP, because the various elements are fixed value, so there is an optimal strategy (Ploicy), the strategy is to give each State an action, the optimal

The strategy is to reach the final state at the maximum value from any initial state under such a strategy. The strategy is denoted by π. Use

Represents the value that can be obtained by using S as the initial state under the strategy π, and by Bellman equation, the formula equals:

Note that this is a recursive process that must know all s ' value functions before knowing the value function of S. (Value function refers to vπ ())

And we define the optimal strategy for π*, the best value function for the v*, we can find that these two things, each other, can transform each other. Values iteration (value iteration)

This process is actually relatively simple, because we know the value of R, so by constantly updating V, the last V is converge to v*, and then through v* can get the optimal strategy π*, pass

V* can get the best strategy π* is actually to see which action in all actions the last value of the value of the maximum, here is through the Bellman equation, can be obtained by the solution Bellman equation

All of the values of V, here is a method of motion, note that the Markov decision-making process of p in fact refers to the existence of the probability, such as the robot turns may not be accurate to a direction, rather than in the S state

The robot chooses the probability of a operation, just not clear.

In this explanation, that is to say:

is an objective statistic.

Strategy iterations (Policy iteration)

Strategy Iteration Method (Policy iteration method), one of the basic methods to find the optimal strategy in dynamic programming. With the help of dynamic programming basic equations, the two steps of "evaluation calculation" and "strategy improvement" are used alternately to find a successive improved strategy sequence which is finally reached or converges to the optimal strategy.

This is done by optimizing π every time to make Πconverge to Π*,v to v*. But because the value of π is calculated every time, this algorithm is not commonly used

"Cs229-lecture16" Markov decision process

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

"Cs229-lecture16" Markov decision process

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support