Brief introduction of Deep Q network

Source: Internet
Author: User

Original address: https://www.nervanasys.com/demystifying-deep-reinforcement-learning/

Author Profile:

About the Author:tambet MatiisenTambet Matiisen is a PhD student in University of Tartu, Estonia. After working in industry for a while and founding his own SaaS startup, he decided to join academia again. He hates programming and is interested in making the machines learn the same as humans do. He shares his life with dog-obsessed wife and both out of hand kids. At less busy moments he enjoys obscure flashbacks from 90s, like Old-skool breakbeat or MSX home computers.here is my Chinese translation:RL

Reinforcement learning is between supervised and unsupervised learning. In supervised learning, each training sample has a target label, and in unsupervised learning, there is no tag at all, and in reinforcement learning, there is a sparse and delayed label-the reward. Only on the basis of these rewards can an agent learn to act in the environment.

Figure 1:atari Breakout game. Image Credit:deepmind.

Although the idea is straightforward, there are many challenges in practice. For example, when you play bricks in a breakout game and get a reward, it is usually unrelated to the action you did before you won the Reward (paddle action). All the hard work has been done, when you put the racket correctly and bounce the ball back. This is the so-called issue of credit allocation, which means which of the previous actions is responsible for obtaining incentives and to what extent.

Once you've come up with a strategy for collecting a certain amount of rewards, should you stick with it or try something that might lead to greater returns? In the above breakout game, a simple strategy is to move to the left and wait there. When fired, the ball tends to fly more than the right, and you can easily score 10 points before you die. Will you be satisfied or want more? This is called exploring exploit dilemmas-you should take advantage of known work strategies or explore other, possibly better strategies.

Reinforcement learning is an important model for us to learn knowledge. The praise of our parents, the school results, the salary of the work, these are examples of rewards. Both in business and in relationships, the problem of credit allocation and the plight of exploration and development are occurring every day. That's why it's important to study the problem, and the game is a great sandbox to try new ways.

Markov decision Process

  

The question now is, how do you formalize the problem of reinforcement learning so that you can infer it? The most common approach is to represent it as a Markov decision process.

Suppose you are an agent in the environment (ex: Breakout game). The environment is in a state of being. (for example, the position of the paddle, the position and direction of the ball, the presence of each brick). The agent can perform certain operations in the environment (move the paddle to the left or right). These behaviors are sometimes rewarded (such as a score increase). The operation changes the environment and causes a new state, the agent can perform another operation, and so on. The rules for how you choose these actions are called policies. In general, the environment is random, which means that the next state may be random (for example, when you lose a ball and launch a new one, it's heading in a random direction).

Figure 2:left:reinforcement Learning problem. Right:markov decision process.

A set of states and actions, and rules that transition from one state to another, make up the Markov decision process. A manager has a process (a game) that forms a limited set of States, actions, and rewards.

Here, SI represents the state, AI is the action, ri+1 is the reward after the action is over. The end state of this game is SN. (such as the game End screen). Markov decision process relies on Markov hypothesis, the probability of the next state Si + 1 depends only on the current state Si and the action ai, and does not depend on the previous State or action.

  

Discounted Future Reward  

To achieve good long-term performance, we need to consider not only immediate rewards, we also have to consider the future will be rewarded. How should we achieve this effect? Consider running a Markov decision process, we can calculate the total return of such a decision.

Consider that the total return from the T moment can be expressed as follows:

But because our environment is random and uncertain, we can never be sure that we will get the same reward the next time we perform the same action. The more steps we take, the more likely it is that it will become more and more divergent. Therefore, the discounted future award (discounted reward) is usually used instead.

Here gamma is the discount factor between 0 and 1. The more distant the reward, the less we are going to do it. It is easy to see that in time t the discount for future awards can be expressed using a discounted future award in time t+1.

If we set the discount factor γ=0, then our strategy will be short-sighted, we only rely on the immediate return. If we want to strike a balance between instant rewards and future returns, we should set a discount factor of γ=0.9 or this type of number. If our environment is deterministic, and the same action always gets the same reward, then we can set the discount factor γ=1.

A good proxy strategy is to always choose an action that maximizes (discounts) future rewards.

Q-learning

In Q Learning, we define a function Q (s,a) that represents the maximum discount for future rewards when we perform a action in state S, and continues to optimize from that point.

We think that the meaning of Q (s,a) is that the best possible score of the game after the action A in state S is Q (s,a). He is called the Q function because it represents the "mass" of an action in a given state.

This sounds like a confusing definition. If we only know the current state and action, and do not know the State and action after the state and the action, how can we at the end of the game to conduct a rating of this state and action? Yes, we really can't. However, as a theoretical structure, we can assume that there is such a function. Close your eyes and repeat five times: "Q (s,a) exists, Q (s,a) exists, ...". Do you feel that?

If you are not sure, consider what it means to have such a function. Suppose you are in a state of thinking whether you should take action A or B. You want to choose the action that gets the highest score at the end of the game. Once you have the Magic Q function, the answer is simple. Select the action with the highest value!

Here π represents the rules of policy, guiding us on how to choose the rules of action in each state.

OK, how do we get the Q function? Let's just focus on a conversion above <s, A, R, S ' >. Just like the discount bonus in the previous section, we can use the Q function of the state s ' to express the state s and the Q function of action A.

This is what we call the Bertelsmann formula. If you think about it, it's very logical-the biggest future reward for this state and action is the immediate reward and the next state's maximum future reward.

The main idea in reinforcement learning is that we can use the Bellman equation repeatedly to approximate the Q function. In the simplest case, the Q function can be implemented as a table, with states and actions as rows and columns respectively.

The key points of the Q-learning algorithm are as follows:

α takes into account the difference between the algorithm learning rate, the number of previously controlled Q values, and the new Q values proposed.

In particular, when α=1, this is the Bellman equation.

In the early learning phase, we used the Q (S ', a ') maximum value to update Q (s,a) 's approach, which could be wrong. However, this estimate is more accurate as the iteration approaches more. It has been shown that if we perform this update enough times, then the function converges, representing the true Q value.

Deep Q Network

The state of the environment of the breakout game can be defined by several things: the position and direction of the ball, the position of the skateboard below, the presence or absence of each brick. However, this intuitive expression is specific to the game. Can we come up with a more general way of representing all games? The obvious choice is screen pixels, which implicitly contain all the information about the game, in addition to the speed and direction of the ball. These can be computed with two successive screens.

If we use the same preprocessing on the game screen as in the DeepMind paper. Remove the last four images, re-crop them to 84x84, and turn them into a 256 grayscale image. We will get about 25684x84x4 = 1,067,870 states. Such a large number of numbers, more than the number of atoms in the universe to be stored in a q-table, is impossible to do. One might argue that many pixel combinations () do not appear, and we may be able to use a sparse table that contains only the points that can be accessed for acute expression. Even so, most states are rare, but they still exist in q-table for some time. Ideally, it is difficult to predict which states are seldom present.

This is where deep learning is useful. Neural networks are particularly adept at providing good features for highly structured data. We can use a neural network to represent our Q function, with the state (four-game screen) and the action as the input and output corresponding to the Q value. Or we can just play the screen, input and output the Q value of every possible action. The advantage of this approach is that if we are going to perform an update or select the highest Q value action, all we need to do is run this neural network once and then get the Q value of all available actions immediately.  

Figure 3:left:naive Formulation of the deep q-network. Right:more optimized architecture of deep q-network, used in DeepMind paper.

The network architecture that DeepMind used is as follows:

This is a classic convolutional neural network that has three convolution layers, followed by two fully connected layers. People who are familiar with object recognition networks may notice that there is no pool layer. However, if you really think about it, the pool layer will let you have translational invariance, that is, the network will not be sensitive to the movement of the objects in the photo. This is a complete classification task like Imagenet, but the game, the position of the ball is very important, determines the potential return, we do not want to give up this information!

Input to the Network Quad 84x84 grayscale game screen. The output of the network is the Q value of each possible action (18 in the Atari game). The Q value can be any of the true values, which makes it a regression task that allows for simple squared error loss.

Consider a conversion < S, A, R, S ' >. The update rules for the previous q-table algorithm must be replaced with the following:

1, let the current state run a neural network, to obtain all the action of the Q-values.

2, let the next state s ' also run a neural network, calculate this output max a ' Q (s ', a ').

3, then set Q (s,a) Q function for r +γmax a ' Q (S ', a '), for all other actions, set the Q value to the value returned in step 1. For all other actions, the Q-value target is returned as from step 1, making the other output error 0.

4, the use of reverse propagation to update the weight value.

Experience Replay

So far, we know how to use q-learning and use convolutional neural networks to estimate the Q function to make an estimate of future returns. But it turns out that the Q-value approximation using nonlinear functions is not very stable. There is a whole set of tricks that you must use to really make it converge. And it took a long time, almost one weeks on a GPU.

The most important trick is the experience replay (experience replay). In the course of the game, all experience < S, A, R, s ' > are stored in playback memory. In training the network, random small batches (minibatches) from the replay memory are used to replace the most frequently occurring conversions. This will break the similarity of the subsequent training samples, which in turn may drive the network to a local minimum value. Experience playback also makes training tasks much more similar to supervised learning, which simplifies the process of debugging and testing algorithms. People can really collect all these experiences from the human game, and then train the network on this basis.

Exploration-exploitation

Q-learning is trying to solve the credit assignment problem, it will spread the reward in time, until the reward arrives at the key node, the key node is the reason for such a high reward. But we haven't dealt with Exploration-exploitation's dilemma yet.

The first observation is that when a q-table or a q-network is randomly initialized, then its predictions are initially random. If we choose the highest value of an action, the action is actually random and the agent executes about once "explores". As the Q-function converges, it returns a more consistent Q-value, and the "quest" actually slowly decreases. So it can be said that Q-learning takes exploration as part of the algorithm. But this quest is "greedy," and it solves the first effective strategy it finds. A simple and effective way to solve these problems is to explore the Ε-greedy exploration, which is the probability ε, to choose a random action. Under the probability of 1-ε, follow the greedy strategy. In their system the DeepMind actually decreases ε over time, from 1 to 0. At the beginning, the system maximizes the exploration space completely randomly, and then it is fixed to a fixed exploratory rate.

Deep q-learning algorithm

Everything above, combined into the final deep q-learning algorithm, includes the experience replay.

DeepMind actually used more tricks, such as the target network, the wrong clipping (error clipping), the reward tailoring (reward clipping), and so on, but these are beyond the scope of this introduction. The most amazing part of the algorithm is that it can learn anything. Think of it – because our Q function is randomly initialized, it is completely useless to output it at first. And we use these useless things (the optimal value of the next state) as the goal of the network, only sometimes, there will be some small rewards. It sounds crazy, how can you learn anything that makes sense? The truth is, it can really be learned.

Brief introduction of Deep Q network

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.