Read and understand the reinforcement learning behind Alphago _alphago

Source: Internet
Author: User

Author | Joshua Greaves
compiling | Liu Chang, Lin Yu 眄

This paper is the most important content in the book "Reinforcement Learning:an Introduction", which aims to introduce the basic concept and principle of learning reinforcement learning, so that readers can realize the newest model as soon as possible. After all, for any machine-learning practitioner, RL (intensive learning, i.e. reinforcement Learning) is a very useful tool, especially in the Shinang of Alphago.

In the first part, we will specifically understand the MDPs (Markov decision process) and the main components of the reinforcement learning framework. In the second part, we will construct and learn the theoretical knowledge about the value function and Bellman (the Berman equation), which is the most important formula in the reinforcement learning, and we will deduce and explain it step-by-step. To uncover the mystery of reinforcement learning.

Of course, this article just try to use the fastest and most intuitive way to take you to understand the theory behind reinforcement learning, but to deepen their understanding of the topic, Sutton and Barto wrote "Reinforcement Learning:an Introduction" It must be worth your while to read it carefully. Besides, the Alphago David Silver on YouTube is also worthy of your careful study. Supervised learning vs. assessment Learning

For many of the issues of interest, there is no way to provide the flexibility we need to monitor the learning paradigm. The main difference between supervised learning and intensive learning is whether the feedback received is evaluative or instructive. Instructional feedback tells you how to reach your goals, and evaluative feedback tells you how far you will go to achieve your goals. Supervised learning is based on instructional feedback, while intensive learning solves problems on the basis of evaluative feedback. Image classification is a practical example of using supervised learning to solve problems with instructional feedback; When an algorithm tries to classify certain data, it will learn from instructional feedback which is the real category. On the other hand, evaluative feedback only tells you the extent to which you are accomplishing your goals. If you train a classifier with evaluative feedback, your classifier may say, "I think it's a hamster," and then it gets 50 points. However, since there is no contextual information, we do not know what these 50 points are. We need to do other classifications, and exploring 50 points means we are accurate or inaccurate. Maybe 10000 points is a better score, so we still don't know what it is unless we try to classify other data.

Guessed that the hamster could get two golden stars and a smiley face, and guess the gerbil could get a silver star and a thumb

In many of the issues we are interested in, the idea of evaluative feedback is more intuitive and easier to implement. For example, imagine a system that controls the temperature of the data center. Instructional feedback doesn't seem to be useful here, how do you tell your algorithm what the correct settings are for each part in any given time step? Evaluative feedback is here to play its useful role. It's easy to know how much electricity is used at a specific time period, or how much the average temperature is, and even how many machines are overheating. This is actually the way Google uses intensive learning to solve these problems. Let's go straight to study. Markov decision Process

Assuming we know the state S, if the future state condition is independent of the past state, then the state S has Markov properties. This means that s describes all the states of the past up to the present state. If this is difficult to understand, let's use an example to explain the problem and make it a little simpler. Suppose a ball flies over the air, if its state is determined by its position and speed, and enough to describe its current position and next position (regardless of physical model and outside influence). Therefore, this state is Markov in nature. However, if we only know the position of the ball without knowing its speed, its state is no longer Markov. Since the present state is not a generalization of all previous states, we need information from previous time points to construct the appropriate model of the ball.

Reinforcement learning can often be modeled as a Markov decision process, i.e. MDP (Markov Decision process). MDP is a forward graph, it has nodes and edges, can describe the transition between Markov states, the following is a simple example:

A simple Markov decision process

This MDP shows the process of learning Markov decision. In the beginning you are in a "don't understand" state, next, you have two possible movements, learning or not learning. If you choose not to study, there is a 100% chance of returning to the state of not understanding. However, if you choose to learn, only 20% of the possibilities allow you to go back to where you started, where 80% of the possibilities become understood.

In fact, I'm sure that the transition to an understanding state is more likely than the 80%,MDP core. In a state you can take a series of actions, and after you take action, here are some distributions of what you can transform into. This transformation can also be well defined in the case of taking no learning action.

The goal of intensive learning is to learn how to spend more time in a more valuable state, and in order to have a more valuable state, we need MDP to provide more information.

You don't need a MDP to tell yourself you're hungry to eat, but the mechanism of reinforcement learning needs it

This MDP adds a reward mechanism, and each time you turn to a state, you get a reward. In this case, because the next state is starvation, you get a negative reward, and if the next state is starvation, it gets a more negative reward. If you have enough, you will receive a positive reward. Now that our MDP is fully formed, we can begin to think about how to take action to get the highest reward available.

Since this MDP is very simple, it is easy to find a way to stay in a higher reward area, that is, when we are hungry. In this model, when we are in a state of satiety there is not much else to choose from, but we will inevitably be hungry again and then choose to eat immediately. The problem of enhancing interest in learning actually has a larger and more complex Markov decision-making process, and we usually do not know these strategies until we begin to actually explore them. Formal reinforcement of learning problems

Now that we have a lot of the basics that we need, then we need to turn our attention to the terminology of reinforcement learning. The most important components are the agent and the Environment (environment). The agent is indirectly controlled and exists in the environment. Looking back at our Markov decision model, the agent can select an action that has a significant effect on the given state. However, the agent does not fully control the dynamic environment, the environment will receive these actions, and then return to the new state and reward

This graph of the book "Reinforcement Learning:an Introduction" (which is strongly recommended) from Sutton and Barto explains the interaction between the agent and the environment. In a time step T, the agent is in state s_t, taking action a_t. The environment then returns a new state s_t+1 and an award r_t+1. The reward is in t+1 time step because it is returned by the environment in T+1 state s_t+1, so it is more reasonable to keep the two of them consistent (as shown in the figure above).

We now have a framework for strengthening learning problems, and then we are ready to learn how to maximize the reward function. In the next section, we will further study the state value function and the action value function, and the Bellman equation based on the reinforcement learning algorithm, and further explore some simple and effective dynamic programming solutions. Rewards and rewards

As mentioned earlier, intensive learning agents learn how to maximize future cumulative rewards. The term used to describe the cumulative rewards of the future is called return, and is usually expressed in R. We also use subscript t to represent the return value under a certain time step. The expression of the mathematical formula is as follows:

If we allow this series to extend infinitely, then we may get infinite returns, but that makes the definition of the problem meaningless. Therefore, this equation makes sense only if we expect the rewards to be finite. The task of terminating a program is called a situational task. Card games are a good example of situational problems. The beginning of the scene is to each person licensing, and inevitably according to the specific rules of the game to end. Then another scene begins again, and the cards are processed again.

A more common use of future cumulative discounts than using future cumulative rewards:

0<γ<1 here. There are two benefits to defining a return value in this way: not only can you define the return value with an infinite series, but you can also give a better weight to subsequent returns, which means that we are more concerned about the upcoming return than the rewards we will receive in the future. The smaller the gamma value, the more correct it is. In special cases, we make gamma equal to 0 or 1. When gamma equals 1 o'clock, we're back to the first equation, and we're concerned about all the rewards, not how far into the future. On the other hand, when gamma equals 0 o'clock, we are concerned with the current return, regardless of any subsequent return. This will lead to the lack of long-term algorithm. It will learn to take action best suited to the current situation, but will not consider the impact of this action on the future. Strategy

Strategy, which is recorded as π (S,a), describes a way of acting. It is a function that accepts a state and an action, and returns the probability of taking this action in that state. Therefore, for a given state, it must be satisfied. In the following example, when we are hungry, we can make a choice between eating and not eating two actions.

Our strategy should describe how to take action in each state. Therefore, a random strategy of equal probability should be like this: where E represents the action of eating, the action that represents not eating. This means that if you are hungry, you are choosing to eat or not to eat the same probability.

The goal of using intensive learning is to learn an optimal strategy π*, which tells us how to act to maximize returns. This is just a simple example of knowing that the optimal decision in an example is to eat when you are hungry. In this example, as with many MDPs (Markov decision processes), the optimal decision is deterministic. Every best state has a best action. Sometimes this is written

π* (s) =a, which is a mapping of optimal decision action from state to these states. Value function

We use the value function to get the optimal strategy of learning. There are two types of value functions in reinforcement learning: State value function, expressed as V (s), and behavioral value function, expressed as Q (s,a).

The state value function describes the state value when a policy is executed. This is the expected return from the state s starting to execute our strategy pi:

It is noteworthy that even in the same environment, the value function will change according to the policy. This is because the value function of the state depends on how you behave, because your behavior in a particular state affects your expected return. It is also important to note the importance of expectation. (expectations are like an average, the return you expect to see). The reason we use expectations is that when you get to a state, there are random situations. You may have a random strategy, which means we need to combine the results of all the different actions we take. Similarly, the transition function can be random, that is, we cannot end any state with a probability of 100%. Remember the above example: When you choose an action, the environment returns to the next state. There may be multiple states that can be returned, or even an action. More information will be obtained in the Bellman equation (the Berman equation). Expect to take all the randomness into account.

We will use another value function to be the action value function. Action Value function refers to the value of taking an action in a certain state when we take a particular strategy. This is the expected return on a given State and action under the strategy π:

The annotation of the state value function also applies to the action value function. It will take into account the randomness of future action and the randomness of the return state from the environment. The Berman equation

Richard Bellman, an American applied math scientist, derives the following equations that allow us to begin to solve these MDPs (Markov decision processes). In intensive learning, the Berman equation is ubiquitous and it is important to understand how reinforcement learning algorithms work. But before we know about the Berman equation, we need to know some more useful symbols. We define P and R as follows:

is another expression that we start with the state S, take action A, to the state s ' expectation (or average) reward.

Finally, with this knowledge, we are prepared to derive the Bellman equation (the Berman equation). We will consider the state value function within the Bellman equation (the Berman equation). According to the definition of return, we can modify the formula (1) as follows:

If we want to make the first reward from the sum return, the formula can be rewritten as this:

Here the expectation can be described if we take the strategy π, continue to return from the state s expectations. You can describe expectations by summing all possible actions and all possible return states. The next two equations will help us take the next step.

By assigning expectations to these two parts, we can transform our equation into the following form:

It is worth noting that the equation (1) is the same as the ending part of the equation. So we can replace it and get the following:

The action value function of the Bellman equation (the Berman equation) can be deduced in a similar way. Interested people can see the specific steps at the end of the article. The final results are as follows:

The importance of the Bellman equation is that it allows us to express the value of a state as a value in another state. This means that when we know the value of the state st+1, we can easily compute the value of the State St. This opens the door for us to solve the problem of iterative computation for each state value, because if we know the value of the next state, we can know the value of the current state. In this case, it is important to remember the number of the equation. Finally, with the advent of the Bellman equation (the Berman equation), we can begin to study how to compute the optimal strategy and write our first enhanced Learning Agent program. Next: Dynamic planning

In the next article, we'll look at using dynamic programming to compute the optimal strategy, which lays the groundwork for more advanced algorithms. However, this will be the first opportunity to actually write a reinforcement learning algorithm. We'll look at policy iterations and value iterations and their pros and cons. Before this, thank you for your reading.

As promised: Deduce the action Value function of the Bellman equation (the Berman equation)

As we derive the process of Bellman equation State value function, we derive a series of equations using the same derivation process, and we proceed from the equation (2):

RELATED links:
Reinforcement Learning:an Introduction
Http://incompleteideas.net/sutton/book/the-book-2nd.html
RL Course by David Silver
Https://www.youtube.com/watch?v=2pWv7GOvuf0
RL Course by David Silver PPT
Http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html
Everything you Need to Know to the started in reinforcement Learning
https://joshgreaves.com/reinforcement-learning/introduction-to-reinforcement-learning/
Understanding Rl:the Bellman Equations
https://joshgreaves.com/reinforcement-learning/understanding-rl-the-bellman-equations/

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.