Enhanced learning-Markov decision making process (MDP), recently because of research needs, to start learning machine learning. Before just understand some CNN what the fur, the overall understanding of machine learning are relatively scarce, I will start from scratch a little foundation, just also use the blog to their own learning process record, if Daniel saw the blog in error, Welcome to correct!
Reinforcement Learning (reinforcement learning,rl) is one of the main methods in the field of machine learning and intelligent control in recent years. There are three concepts in reinforcement learning: status, action, and rewards.
State is a description of the current situation. For a robot that is learning to walk, the state is the position of its two legs. For a Go program, the state is the position of all the pieces on the board.
Action is what a smart body can do in every state. Given the state or position of two legs of a robot, it can take a few steps within a certain distance. Usually a smart body can only take actions that are limited or fixed in scope. For example, a robot's stride can only be 0.01 meters to 1 meters, and go program can only put its pieces on the 19x19 Road chessboard (361 positions) in a position.
Return (reward) is an abstract concept that describes feedback from outside. Returns can be positive or negative. When the return is positive, it corresponds to our regular rewards. When the payoff is negative, it corresponds to what we usually call punishment.
Therefore, the core goal of strengthening learning is to solve the problem: a autonomous agent can perceive the environment, how to learn the optimal action strategy π:s->a, it can be given the current state s set of S, from set A to output a suitable action a.
Before studying the algorithms used to find strategies, we must fully understand the Markov decision process (MDP).
Markov decision making process (MDP)
In the face of many problems, Markov decision process provides us with a form of reasoning for planning and action. We should all know that the Markov chain (Markov Chain), which has a common nature with MDP is no validity, that is, the next state of the system is only related to the current state, and not earlier than the state, this feature for us to enhance learning has laid a theoretical foundation. The difference is that the MDP considers the motion, that the next state of the system is not only related to the current state, but also to the action currently taken.
So MDP can be represented as a tuple (S, A, Psa, R):
S: A collection of all possible states. A: For each state, we have to make the action, and the set of these actions is a. PSA: State transition Distribution (statetransition distribution), if we take action a in state s, the system shifts to a new state, and the state transition distribution describes the probability distribution of which state is transferred. R: Feedback function (rewardfunction), which enhances the core concept of learning and describes the rewards that action can produce. For example, rπ (S,a) describes the return of action A, which uses strategy π under State S, also called immediate return, which can have different forms of expression.
But in the process of selecting the optimal strategy, we only look at immediate returns and cannot determine which strategy is better, and we hope that after the strategy pi is taken, the entire state sequence can be given the greatest discount:
R (S0, A0) +γr (S1, A1) +γ2r (S2, A2) + (1)
where Γ is called a discount factor, the economics explanation is called the risk-free discount rate (risk-freeinterest rate), meaning that the immediate money is more valuable than the money in the future.
Therefore, the above concept constitutes a complete description of enhanced learning: to find a strategy that allows us to s0, S1, S2 according to the state. Take the corresponding action in the strategy A0, A1, A2 ... and maximize the expected value of the formula (1), also known as the values function vπ: S→r, which indicates the long-term effect of the strategy pi in the current state. This function starts with the state s and acts according to π:
vπ (s) =eπ[R (S0, A0) +γr (S1, A1) +γ2r (S2, A2) + ... | s0 = s]
This function is also called the State value function (Statevalue function), because the initial state S and the strategy Pi are given by us, action a =π (s). It corresponds to the Action Value function (Actionvalue function), also known as the Q function:
Qπ (S, a) = eπ[R0 +γr1 +γ2r2 + ... | s0 = s, a0 = a]
The initial state and the initial action are all given by us.
function optimization and solution:
Further, we defined the optimal value function (Optimalvalue function) and the optimal Q function (optimal q-function):
V (s) =maxπvπ (s)
Q* (S, a) = maxπqπ (S, a)
It is not difficult to prove that V and vπ satisfy the following two equations:
The above Berman equation (bellmanequation) gives the recursive definition form of V and vπ, and also the solution of V. The V formula shows that the optimal action A is selected by each step, and the V is obtained by keeping the optimal motion in the back. The implication of vπ is that if we continue to select actions based on the strategy π, then the expected return of the strategy π is the current return plus the expected return on the future.
Similarly, the following equations for the Q function are also established:
This kind of recursive definition form also is advantageous to us in the concrete realization time solution.
After we know V and q*, we can get the best strategy through the following formula π*:
That is, if we know the q*, we can more easily calculate the optimal strategy, and to get the best strategy from V, we must know the state transition distribution of PSA.
To find the optimal strategy of the specific algorithm, including value iterations, strategy iterations, Monte Carlo algorithm and Q learning algorithm, and so on, later I will continue to organize.