David's Simple Game - Model-based reinforcement learning controller

Java is not working. This has been tested in Firefox, and it works there!

This applet shows how a simple game on a 5x5 grid. The agent (shown as a circle) can move up, down, left or right.

The Game

There can be a prize at one of the 4 corners (the prize shown in the color cyan when it is there). When the agent lands on a prize it gets a reward of +10 and the prize disappears. When there is no prize, a prize can appear randomly at one of the 4 corners. The prize stays there until the agent lands on it.

There are 5 locations where monsters can appear randomly. The monsters are shown as red in the square. If a monster appears when the agent is at that location, the agent gets damaged if it wasn't already damaged. If it was damaged, the agent has a penalty of 10 (i.e., a reward of -10). The monsters are at the locations independently at each time. The agent can get repaired by visiting the repair station (second to left location on the top row, shown in magenta). The agent is yellow when it isn't damaged and is pink when it is damaged.

There are 4 actions available to the agent: up, down, left and right. If the agent carries out one of these actions, it have a 0.7 chance of going one step in the desired direction and a 0.1 change in going one step in any of the other three directions. If it bumps into the outside wall or an inside wall (i.e., the square computed as above is outside the grid or through an internal wall), there is a penalty on 1 (i.e., a reward of -1) and the agent doesn't actually move.

This is a difficult game to learn to play well. All that visiting the repair location does is to change the damage status. Initially it does not know that being damaged is bad. It has to learn that being damaged leads to worse outcomes that not being damaged. To learn this it has to be undamaged for long enough to realize that it is good to be undamaged. After this, it has to discover that for the state (1,1), the value of getting reparied is worthwhile. It has to explore enough to try to get repaired!

The Controller

The numbers in the squares represent the Q-values of the state made of that location and the current value of the prize and the current damage. The blue arrows give the value for the optimal action. The agent acts greedily with respect to the Q-values using the percent in the applet, and chooses a random action the rest of the time.

You can control the agent yourself (using the up, left, right, down buttons) or you can step the agent for the number of times specified in the applet.

There are some parameters you can change:

  • Discount: specifies how much future rewards are discounted.
  • Step count: specifies the number of steps that occur for each press of the "step" button.
  • The greedy exploit percent: specifies what percentage of the time the agent chooses one of best actions; the other times it acts randomly.
  • The "updates/step" gives the number of asynchronous value iteration updates per step. The first update is for the state-action pair done, and the others are random.
  • The initial value specifies the Q-values when Reset is pressed.

The applet reports the number of steps and the total reward received. It specifies the minimum accumulated reward (which indicates when it has started to learn), and the point at which the accumulated reward changes from negative to positive. Reset initializes these to zero and the Q-values to the initial value. Trace on console lists the history of steps and rewards on the console, if you want to plot it.

The commands "Brighter" and "Dimmer" change the contrast (the mapping between non-extreme values and colour). "Grow" and "Shrink" change the size of the grid.

Here are some suggestions for changes that make good exercises:

  • The current applet uses a dense representation of the transitions counts, even though many are zero. It may be more efficient to have a sparse representation that only stores the non-zero values.
  • The current applet does not use pseudo-counts. (Instead it needs to check to avoid divide-by-zero errors.) Does it work better with pseudo-counts?
  • Which division of effort between stepping in the environment, and updating the Q-values works best in terms of how quickly it learns and how good the policy is?

Other Applets for the game:

You can get the code: SGameGUI.java is the GUI. The environment code is at SGameEnv.java. The controller is at SGameModelController.java. You can also get the javadoc for my RL applets. This applet comes with ABSOLUTELY NO WARRANTY. This is free software, and you are welcome to redistribute it under certain conditions, see the code for more details. Copyright © David Poole, 2010.