Intensive learning and learning notes--Introducing intensive learning (reinforcement learning)

Last Update:2018-07-26 Source: Internet

Author: User

Tags current time

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

As we all know, when Alphago defeated the world go champion Li Shishi, the whole industry is excited, more and more scholars realize that reinforcement learning is a very exciting in the field of artificial intelligence. Here I will share my intensive learning and learning notes. The basic concept of reinforcement learning

Machine learning can be divided into three categories, namely supervised learning,unsupervised learning and reinforcement learning. The difference between reinforcement learning and other machine learning is that there is no teacher signal and no label. Only reward, in fact, reward is the equivalent of a label. Feedback has a delay, and it is not possible to return immediately. The equivalent of the input data is sequence data.

The actions performed by the agent affect subsequent data.

The key elements of reinforcement learning are: environment,reward,action and state. With these elements, we can build a reinforcement learning model. The problem of reinforcement learning is to get an optimal policy for a specific problem, so that the maximum reward is obtained under the strategy. The so-called policy is actually a series of action. That is sequential data.
Reinforcement learning can be depicted in the following diagram by extracting an environment from the task to be completed, abstracting the state, the action, and the instantaneous reward (reward) that is accepted for performing the action.
Reward

Reward are usually recorded as Rt R_{t}, which represents the return reward value of the T-time step. All reinforcement learning is based on the reward hypothesis. Reward is a scalar. Action

The action is from the action space, and the agent determines what action is currently being performed on the state used each time and the reward of the previous status. Execute action to maximize the desired reward until the final algorithm converges, the resulting policy is a series of action sequential data. State

Refers to the state in which the current agent is located. Specifically, such as playing Pong games (Atari games), the state of the game is the current time step under the position of the ball. The Flappy Bird state is the current position of the bird on the plane. Policy

Policy is the agent-only behavior, which is the mapping from State to action, which is divided into deterministic strategy and stochastic strategy, which determines that the strategy is the definite action in a certain condition a=π (s) A = \pi (s), the stochastic strategy is described by probability, that is, the probability of performing this action in a certain state: π (A |s) =p[at=a| St=s] \pi (a|s) =p[a_t = a| s_t = s]. value Function

Because reinforcement learning can be summed up in this book by maximizing reward to get an optimal strategy. But if only the instantaneous reward maximum will cause each time only from the action space to choose reward the largest action, this becomes the simplest greedy strategy (greedy policy), So in order to portray well is to include the future of the current reward value maximum (even if from the current moment until the state achieves the goal of the total reward maximum). So it's early enough. The value function describes this variable. The expression is as follows:

Γ\gamma is the discount factor (value in [0,1] [0,1]) to reduce the impact of future reward on the current action. Then, by selecting the appropriate policy to make the value function maximum, we will deduce the famous Bellman equation later, the Bellman equation is the major algorithm of reinforcement learning (e.g. Value iterations, policy iterations, q-learning). Model

The model is used to predict what the environment is going to do next, that is, what the state of the action will be in this state, and what reward the action will be. So describing a model is to use the motion transfer probability and the action state reward. The specific formula is as follows:
Markov decision Process (MDP)

Everyone should be very familiar with Markov process, in fact, the state and a state of the transfer, the most important is the one-step transfer probability matrix, as long as this one-step transfer probability matrix can depict the entire Markov process.

Here is a description of the Markov decision process (MDP), which is mainly characterized by the following variables, state space S S (is a finite set), action space A A (a finite set of actions), state transition probability matrix p p, reward function r R and discount factor Γ\gamma (γ∈[0,1] \ga MMA \in [0,1]).
The following describes a function that MDP uses to characterize rewards.
1.return Gt g_t
The reward that can be obtained after a set of actions in the future after the T-moment, i.e. t+1,t+2 ... The sum of the reward of all time. (The reward of the future moment is reflected in the present moment), and the latter reward multiplied by the discountγ\gamma coefficients. The expression is as follows:

2. State Value function V (s) v (s)
Defined as the expected return of the T-moment state S, the expression is as follows:

3. Action Value function qπ (s,a) Q_{\pi} (S,a)
T moment state S under Select a specific action to get the expected return, the expression is as follows:

To explain the derivation of the most famous Bellman equation, we first derive how to iterate the value function, that is, the update value function:
1.value function
V (s) =e[gt| St=s]=e[rt+1+γ (R

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More