"Cs229-lecture16" Markov decision process

Source: Internet
Author: User

Before talking about supervised learning and unsupervised learning, the main talk today is "intensive learning".

  • Markov decision-making process; Markov decision process (MDP)
  • value functions; Value function
  • Value iteration; value iteration (algorithm, solving MDP)
  • Policy iteration (algorithm, solving MDP)

What is reinforcement learning?

Intensive learning (reinforcement learning, also known as re-learning, evaluation learning) is an important machine learning method, which has many applications in intelligent control robot and analysis and prediction fields. But in the traditional machine learning classification, there is no mention of reinforcement learning, and in the connection learning, the learning algorithm is divided into three types, namely unsupervised learning (unsupervised learning), supervised learning (supervised leaning) and intensive learning.

    • According to the current status of the agent, select Action A, then interact with the environment, the agent observed the next state, and received a certain reward R (good and Bad).

    • So repeatedly interacting with the environment, under certain conditions, the agent will learn an optimal/suboptimal strategy.

Markov decision making process

Markov decision process is the optimal decision-making process of stochastic dynamic system based on Markov process theory. Markov decision-making process is the main research field of sequential decision-making. It is a combination of Markov process and deterministic dynamic programming, so it is also called Markov stochastic dynamic programming, which belongs to a branch of mathematical programming in operational research.

(The following transfers are from: http://blog.csdn.net/dark_scope/article/details/8252969)

Markov decision-making is a five-tuple, with a robot taking a map example to illustrate their respective roles

S: State set: All the possible states, in the case of a robot walking a map, where all the robots might appear.

A:action, that is, all possible actions. The example of a robot taking a map assumes that the robot can only go in four directions, so a is {n,s,e,w} represents four Directions

P: The probability that a robot takes a action in S state

Gamma: Called discount factor, is a number between 0 and 1, this number determines the effect of the action on the results, on the board of the example is to affect this step

The effect of chess on the results of the most likely to be said to be more vague, through the following explanation may be more clearly explained.

R: is a reward function, which is probably one, or, correspondingly, a weighted value on the map.

With such a decision-making process, the process of the robot's activity on the map can also be expressed as follows:

That is, starting from the initial position, selecting an action to reach another state until the final state is reached, so we define the value of this process:

It can be seen that the earlier decisions have a greater effect on value, and then decay by γ in turn

In fact, it can be seen that given a MDP, because the various elements are fixed value, so there is an optimal strategy (Ploicy), the strategy is to give each State an action, the optimal

The strategy is to reach the final state at the maximum value from any initial state under such a strategy. The strategy is denoted by π. Use

Represents the value that can be obtained by using S as the initial state under the strategy π, and by Bellman equation, the formula equals:

Note that this is a recursive process that must know all s ' value functions before knowing the value function of S. (Value function refers to vπ ())

And we define the optimal strategy for π*, the best value function for the v*, we can find that these two things, each other, can transform each other. Values iteration (value iteration)

This process is actually relatively simple, because we know the value of R, so by constantly updating V, the last V is converge to v*, and then through v* can get the optimal strategy π*, pass

V* can get the best strategy π* is actually to see which action in all actions the last value of the value of the maximum, here is through the Bellman equation, can be obtained by the solution Bellman equation

All of the values of V, here is a method of motion, note that the Markov decision-making process of p in fact refers to the existence of the probability, such as the robot turns may not be accurate to a direction, rather than in the S state

The robot chooses the probability of a operation, just not clear.

In this explanation, that is to say:

is an objective statistic.

Strategy iterations (Policy iteration)

Strategy Iteration Method (Policy iteration method), one of the basic methods to find the optimal strategy in dynamic programming. With the help of dynamic programming basic equations, the two steps of "evaluation calculation" and "strategy improvement" are used alternately to find a successive improved strategy sequence which is finally reached or converges to the optimal strategy.

This is done by optimizing π every time to make Πconverge to Π*,v to v*. But because the value of π is calculated every time, this algorithm is not commonly used

"Cs229-lecture16" Markov decision process

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.