# David's Simple Game - SARSA with an Adversary Controller

This applet shows how a simple game on a 5x5 grid. The agent (shown as a circle) can move up, down, left or right.

## The Game

There can be a prize at one of the 4 corners (the prize shown in the color cyan when it is there). When the agent lands on a prize it gets a reward of +10 and the prize disappears. When there is no prize, a prize can appear at one of the 4 corners. The prize stays there until the agent lands on it. In this version an adversary chooses which corner the prize appears at. Note that the adversary does not explore and acts greedily with respect to the Q-function (i.e., chooses the prize location that has the lowest Q-value).

There are 5 locations where monsters can appear randomly. The monsters are shown as red in the square. If a monster appears when the agent is at that location, the agent gets damaged if it wasn't already damaged. If it was damaged, the agent has a penalty of 10 (i.e., a reward of -10). The monsters are at the locations independently at each time. The agent can get repaired by visiting the repair station (second to left location on the top row, shown in magenta). The agent is yellow when it isn't damaged and is pink when it is damaged.

There are 4 actions available to the agent: up, down, left and right. If the agent carries out one of these actions, it have a 0.7 chance of going one step in the desired direction and a 0.1 change in going one step in any of the other three directions. If it bumps into the outside wall or an inside wall (i.e., the square computed as above is outside the grid or through an internal wall), there is a penalty on 1 (i.e., a reward of -1) and the agent doesn't actually move.

## The Controller

The numbers in the squares represent the Q-values of the state made of that location and the current value of the prize and the current damage. The blue arrows give the value for the optimal action. The agent acts greedily with respect to the Q-values using the percent in the applet, and chooses a random action the rest of the time.

The adversary chooses the corner that results in the state with the lowest Q-value (i.e., the state that is worst for the moving agent.

You can control the agent yourself (using the up, left, right, down buttons) or you can step the agent for the number of times specified in the applet.

There are some parameters you can change:

• Discount: specifies how much future rewards are discounted.
• Step count: specifies the number of steps that occur for each press of the "step" button.
• The greedy exploit percent: specifies what percentage of the time the agent chooses one of best actions; the other times it acts randomly.
• The alpha gives the learnining rate, if the fixed box is checked, otherwise alpha changes to computes the empirical average.
• The initial value specifies the Q-values when Reset is pressed.

The applet reports the number of steps and the total reward received. It specifies the minimum accumulated reward (which indicates when it has started to learn), and the point at which the accumulated reward changes from negative to positive. Reset initializes these to zero and the Q-values to the initial value. Trace on console lists the history of steps and rewards on the console, if you want to plot it.

The commands "Brighter" and "Dimmer" change the contrast (the mapping between non-extreme values and colour). "Grow" and "Shrink" change the size of the grid.

Other Applets for the game:

You can get the code: SGameGUI.java is the GUI. The environment code is at SGameAdvEnv.java. The controller is at SGameAdvController.java. You can also get the javadoc for my RL applets. This applet comes with ABSOLUTELY NO WARRANTY. This is free software, and you are welcome to redistribute it under certain conditions, see the code for more details. Copyright © David Poole, 2006-2007.