Deep Q-Learning (Space Invaders)

09 Mar 2016

Ever since I learned about neural networks playing Atari games I wanted to reimplemnted it and learn how it works. Below you can see an AI playing Space Invaders. I trained it during my batch at Recurse Center on little over 50M frames.

It is more awesome if you realize that the AI was trained in a similar way a human would learn: the only inputs are screen and number of gained (or lost) points after each action taken by the AI.

DQN does much better then a best-action strategy (do nothing but shoot) and random strategy.

algorithm	points
DQN	550
Best action	240
Random	150

You can play with my implementation here: Deep Q-Learning. This is first post on the topic, stay tuned for the next ones!

Average game reward (600 games) after N games played. Blue line is random strategy baseline, red line is best-action strategy baseline.

#Algorithm So what is Deep Q-Learning (DQN)? Below you will find a gentle introduction. I omit certain details for the sake of simplicity and I encourage you to read the original paper.

The task for Neural Network in DQN is to learn which action to take based on the screen and previous experience. Obviously the nueral network should choose the best action but how to learn which one is best?

Turns out your neural network can be pretty simple: the input is game screen and hidden layers consists of 3 convolutional layers and a single fully connected layer. The number of neurons in last layer corresponds to number of actions that can be taken. In the case of Space Invaders there were 4 actions (do nothing, shoot, go left, go right), therefore there were 4 neurons in the output layer.

The tricky and crucial part is the loss function. In ‘normal’ neural networks the loss function is straigtforward as for each training example $X$ there is a known expected outcome $y$ . Nothing like that is available in our case but we can deal with it thanks to some insights from Q-Learning!

Loss function

The best measure of how good an action is accumulated future reward

$\sum_{i=t_0} r_i$

where sum is taken over time from $t_0$ until the end of the game and $r_i$ is reward gained at time $i$ . There is a couple of problems with that simplified definition and we’ll deal with them one by one.

For one there is no way to calculate that sum as we don’t know the future. With this, we’ll deal at the end though.

Intuitively speaking the immediate reward $r_{t_0}$ should be more valuable then a very distant one. After all, future is uncertain and we might never get this distant reward at all.
Let’s introduce discounted accumulated future reward.

$\sum_{i=t_0} \gamma^i r_i$

Where $\gamma$ is between 0 and 1. We’ll set $\gamma$ to $0.99$ , though, as the distant rewards are very important.

Finally our game is stochastic (we don’t know when an enemy shoots a laser beam) therefore we should rather think in terms of expected value. Ideally, what we want the neural network to learn is function Q defined as:

$Q(s)(a) = \mathbb{E}\left(\sum_{i=t_0} \gamma^i r_i \right) \quad \quad \text{Expected discounted accumulated future reward}$

where $s$ is the input game screen at time $t_0$ , $a$ indicates the neuron corresponding with action $a$ , $r_{i}$ is reward obtained after action taken at time $i$ . That is what we want each neuron of the output layer to learn.

To do that efficiently we need to realise that $Q(s)(a) = r + \gamma \max_{a'}Q(s')(a')$ where $s'$ is game screen experienced after taking action $a$ after seeing game screen $s$ .

Now if $Q^{*}$ is our neural network we can treat $Q^{*}(s)(a) - \left(r + \gamma \max_{a'}Q^{*}(s')(a')\right)$ as a measure of surprise and therefore a loss function (after squaring).

Note that the loss depends on the neural network itself in an untypical way.

Correlation

Since we play the game online it is tempting to simply update the network after each taken action or in mini-batches of, say, 32 actions taken. Such mini-batches would be highly correlated and any stochastic update algorithm would fail on that.

To deal with that issue we keep previous experiences in memory and after each action taken we draw a mini-batch of experiences from that memory to perform the update step.

Other technicalities

provide only every 4th frame to the neural network
stack 4 frames one on top of the other to make the neural network aware of time. This could be avoided if you used LSTM.
keep a stale network at hand and calculate loss with regards to that stale network
gradient clipping (to avoid blowing up gradients)
DeepMind Rmsprop (instead of normal one) - improved performance by 40% in my case.
I used Arcade Learning Environment to play space invaders.

Lessons Learned

That was my first exposure to training non trivial neural networks so there is plenty of things that I learned.

Nan’s as weights is no good. Seems obvious but it does not mean that it’s easy to track down such problems. Theano provides means of doing that efficiently. Of course an NaN usually means that you divide $\infty$ by $0$ .
Test your Theano code. It’s not so hard! And here is relevant documentation.
If your network converges or diverges to $\infty$ very quickly it’s probably caused by suboptimal learning rates applied in your update function. Little is known about how to correctly choose network’s hyperparameters so trial, error and verification is what’s left.
Update method might play a gigantic role in performance of your neural network. In my case, learning curve of my DQN implementation flattened after 10M frames around 400 points for traditional RMSProp. “DeepMind” RmsProp was learning slower but boosted the performance to over 550 points on average after 50M frames and one can clearly see that it kept learning all the time.