David's Tiny Game: Q-learning and SARSA demo

Java is not working. This has been tested in Firefox, and it works there!

The aim of this applet is to show how Q-learning and SARSA work on a very small fully-observable game with 6 states and 4 actions in each state.

The Game

There are 6 states in a 2x3 grid. The agent gets a reward of +10 for leaving the top-left square to the left. It gets a reward of -100 for leaving the middle left square to the left. It gets a reward of -1 for hitting a wall otherwise.

There are 4 actions available to the agent: up, careful up, left, and right. The action "left", goes 1 step left. The action "right" goes 1 step right. The action "careful up" goes 1 step up, but has a reward of -1. The action "up", has a 80% chance of going up, a 10% chance of going right, and a 10% chance of going left. When the action results in hitting a wall, the agent doesn't move, with the exception of when the agent moves left from the top-left square, it then goes to the bottom-left square.

The blue arrows and the numbers represent the Q-values for the state-action pairs. The bottom number represents the "careful up" action; the others refer to the obvious corresponding action. The blue arrow points to the action with the highest Q-value for each state.

The Controller

You can control the agent yourself (using the up, left, right, careful up buttons) or you can step the agent for a number of times.

There are some parameters you can change:

  • Discount: specifies how much future rewards are discounted.
  • Step count: specifies the number of steps that occur for each press of the "step" button.
  • The greedy exploit percent: specifies what percentage of the time the agent chooses one of best actions; the other times it acts randomly.
  • The alpha gives the learning rate, if the fixed box is checked, otherwise alpha changes to computes the empirical average.
  • When the "use SARSA" box is checked it uses the SARSA algorithm, otherwise it uses Q-learning.
  • The initial value specifies the Q-values when Reset is pressed.

The applet reports the number of steps and the total reward received. It specifies the minimum accumulated reward (which indicates when it has started to learn), and the point at which the accumulated reward changes from negative to positive. Reset initializes these to zero. Trace on console lists the steps and rewards on the console, if you want to plot it.

You can change the size of the font and change the size of the grid.

Other Applets for a related, larger, game:

You can get the code: TGameGUI.java is the GUI. The environment code is at TGameEnv.java. The controller is at TGameController.java, and TGameQController.java. You can also get the javadoc for a number of my applets. You can complete set of the reinforcement learning applets from rl.zip. This applet comes with ABSOLUTELY NO WARRANTY. This is free software, and you are welcome to redistribute it under certain conditions, see the code for more details. Copyright © David Poole, 2010.