[Deep Learning-03] DQN for Flappybirld

Last Update:2018-07-26 Source: Internet

Author: User

Tags valid git clone

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

7 mins version:dqn for Flappy Bird Overview

This project follows the description of the "Deep Q Learning algorithm described" Playing Atari with deep reinforcement L Earning [2] and shows that this learning algorithm can is further generalized to the notorious Flappy Bird. installation Dependencies: Python 2.7 or 3 TensorFlow 0.7 pygame Opencv-python How to Run?

git clone https://github.com/yenchenlin1994/DeepLearningFlappyBird.git
cd Deeplearningflappybird
python deep_q_network.py

What's deep q-network?

It is a convolutional neural network, trained with a variant of q-learning, whose input was raw pixels and whose output is A value function estimating the future rewards.

For those who is interested in deep reinforcement learning, I highly recommend to read the following post:

Demystifying Deep Reinforcement Learning deep q-network algorithm

The pseudo-code for the Deep Q learning algorithm, as given in [1], can be found below:

Initialize Replay memory D to size N
Initialize action-value function Q with random weights
for episode = 1, M do< C2/>initialize state s_1
    for t = 1, T does with
        Probabilityϵselect random action a_t
        otherwise select A_t=max_ A  Q (s_t,a;θ_i)
        Execute action a_t in emulator and observe r_t and S_ (t+1)
        Store Transition (S_t,a_t,r_t,s_ (t+ 1)) in D
        Sample a minibatch of Transitions (S_j,a_j,r_j,s_ (j+1)) from D
        Set y_j:=
            r_j for Terminal s_ (j+1) 
  r_j+γ*max_ (a^ ')  Q (S_ (j+1), a '; θ_i) for non-terminal S_ (j+1)
        Perform a gradient step on (Y_j-q (s_j,a_j;θ_i )) ^2 with respect toθ
    end for
end for

Experiments Environment

Since Deep Q-network are trained on the raw pixel values observed from the game screens at each time step, [3] finds that re Move the background appeared in the original game can make it converge faster. This process can being visualized as the following figure:

Network Architecture

According to [1], I first preprocessed the game screens with following Steps:convert image to grayscale Resize image to 8 0x80 Stack last 4 frames to produce a 80x80x4 input array for network

The architecture of the network is shown in the figure below. The first layer convolves the input image with an 8x8x4x32 kernel at a stride size of 4. The output is then put through a 2x2 max pooling layer. The second layer convolves with a 4x4x32x64 kernel at a stride of 2. We then Max Pool again. The third layer convolves with a 3x3x64x64 kernel at a stride of 1. We then Max Pool One more time. The last hidden layer consists of the fully connected ReLU nodes.

The final output layer has the same dimensionality as the number of valid actions which can is performed in the game, wher E The 0th index always corresponds to doing nothing. The values at this output layer represent the Q function given the input state for each valid action. At each time step, the network performs whichever action corresponds to the highest Q value using Aϵgreedy policy. Training

At first, I initialize all weight matrices randomly using a normal distribution with a standard deviation of 0.01, then SE T the replay memory with a max size of 500,00 experiences.

I start training by choosing actions uniformly at random for the first time steps, without updating the network Wei Ghts. This allows the system to populate the replay memory before training begins.

Note that unlike [1], which initializeϵ= 1, I linearly annealϵfrom 0.1-0.0001 over the course of the next 3000,0 XX frames. The reason why I set it this-is that agent can choose a action every 0.03s (fps=30) in our game, Highϵwill make it flap too much and thus keeps itself at the top of the "game screen" and finally bump the pipe in a clumsy w Ay. This condition would make Q function converge relatively slow since it is only the start to look other conditions whenϵis low.
However, in other games, Initializeϵto 1 are more reasonable.

During training time, at all time step, the network samples minibatches of size from the replay memory to train in, an D performs a gradient step on the loss function described above using the Adam optimization algorithm with a learning rate of 0.000001. After Annealing finishes, the network continues to train indefinitely, withϵfixed at 0.001. FAQ Checkpoint not found

Change first line of Saved_networks/checkpoint to

Model_checkpoint_path: "saved_networks/bird-dqn-2920000" How to reproduce?

Comment out these lines

Modify deep_q_network.py ' s parameter as follow:

OBSERVE = 10000
EXPLORE = 3000000
Final_epsilon = 0.0001
Initial_epsilon = 0.1

References

[1] mnih Volodymyr, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin R Iedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen Ki Ng, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level Control through deep reinforcement learning. Nature, 529-33, 2015.

[2] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller . Playing Atari with deep reinforcement learning. NIPS, Deep Learning Workshop

[3] Kevin Chen. Deep Reinforcement Learning for Flappy Bird Report | Youtube result Disclaimer

This are highly based on the following repos: [Sourabhv/flappybird] (Https://github.com/sourabhv/FlapPyBird) ASRIVAT1 /deeplearningvideogames

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More