Third edition of Artificial Intelligence: foundations of computational agents, Cambridge University Press, 2023 is now available (including the full text).

3.7.6 Dynamic Programming

Dynamic programming is a general method for optimization that involves storing partial solutions to problems, so that a solution that has already been found can be retrieved rather than being recomputed. Dynamic programming algorithms are used throughout AI.

Dynamic programming can be used for finding paths in graphs. Intuitively, dynamic programming for graph searching can be seen as constructing the perfect heuristic function so that A*, even if it keeps only one element of the frontier, is guaranteed to find a solution. This cost-to-goal function represents the exact cost of a minimal-cost path from each node to the goal.

A policy is a specification of which arc to take from each node. The cost-to-goal function can be computed offline and can be used to build an optimal policy. Online, an agent can use this policy to determine what to do at each point.

Let cost_to_goal(n) be the actual cost of a lowest-cost path from node n to a goal; cost_to_goal(n) can be defined as

cost_to_goal(n) = 0   if is_goal(n),
min⟨n,m⟩∈A ( cost(⟨n,m⟩)+cost_to_goal(m)) otherwise.

The general idea is to start at the goal and build a table of the cost_to_goal(n) value for each node. This can be done by carrying out a lowest-cost-first search, with multiple-path pruning, from the goal nodes in the inverse graph, which is the graph with all arcs reversed. Rather than having a goal to search for, the dynamic programming algorithm records the cost_to_goal values for each node found. It uses the inverse graph to compute the costs from each node to the goal and not the costs from the goal to each node. In essence, dynamic programming works backward from the goal by trying to build the lowest-cost paths to the goal from each node in the graph.

For a particular goal, once the cost_to_goal value for each node has been recorded, an agent can use the cost_to_goal value to determine the next arc on an optimal path. From node n it should go to a neighbor m that minimizes cost(⟨n,m⟩)+cost_to_goal(m). Following this policy will take the agent from any node to a goal along a lowest-cost path. Given cost_to_goal, determining which arc is optimal takes constant time with respect to the size of the graph, assuming a bounded number of neighbors for each node. Dynamic programming takes time and space linear in the size of the graph to build the cost_to_goal table.

Dynamic programming is useful when

  • the goal nodes are explicit (the previous methods only assumed a function that recognizes goal nodes);
  • a lowest-cost path is needed;
  • the graph is finite and small enough to be able to store the cost_to_goal value for each node;
  • the goal does not change very often; and
  • the policy is used a number of times for each goal, so that the cost of generating the cost_to_goal values can be amortized over many instances of the problem.

The main problems with dynamic programming are that

  • it only works when the graph is finite and the table can be made small enough to fit into memory,
  • an agent must recompute a policy for each different goal, and
  • the time and space required is linear in the size of the graph, where the graph size for finite graphs is typically exponential in the path length.
Example 3.18: For the graph given in Figure 3.2, the cost from r123 to the goal is 0; thus,
cost_to_goal(r123)=0.

Continuing with a lowest-cost-first search from r123:

cost_to_goal(o123)=4
cost_to_goal(o119)=13
cost_to_goal(o109)=29
cost_to_goal(b4)=36
cost_to_goal(b2)=39
cost_to_goal(o103)=41
cost_to_goal(b3)=43
cost_to_goal(b1)=45

At this stage the backward search halts. Two things can be noticed here. First, if a node does not have a cost_to_goal value, then no path to the goal exists from that node. Second, an agent can quickly determine the next arc on a lowest-cost path to the goal for any node. For example, if the agent is at o103, to determine a lowest-cost path to r123 it compares 4+43 (the cost of going via b3) with 12+29 (the cost of going straight to o109) and can quickly determine to go to o109.

When building the cost_to_goal function, the searcher has implicitly determined which neighbor leads to the goal. Instead of determining at run time which neighbor is on an optimal path, it can store this information.


Optimality of the A* algorithm

A search algorithm is optimal if no other search algorithm uses less time or space or expands fewer nodes, both with a guarantee of solution quality. The optimal search algorithm would be one that picks the correct node at each choice. However, this specification is not effective because we cannot directly implement it. Whether such an algorithm is possible is an open question (as to whether P=NP). There does, however, seem to be a statement that can be proved.

Optimality of A*: Among search algorithms that only use arc costs and a heuristic estimate of the cost from a node to a goal, no algorithm expands fewer nodes than A* and guarantees to find a lowest-cost path.

Proof sketch: Given only the information about the arc costs and the heuristic information, unless the algorithm has expanded each path p, where f(p) is less than the cost of an optimal path, it does not know whether p leads to a lower-cost path. More formally, suppose an algorithm A' found a path for a problem P where some path p was not expanded such that f(p) was less than the solution found. Suppose there was another problem P', which was the same as P, except that there really was a path via p with cost f(p). The algorithm A' cannot tell P' from P, because it did not expand the path p, so it would report the same solution for P' as for P, but the solution found for P would not be optimal for P' because the solution found has a higher cost than the path via p. Therefore, an algorithm is not guaranteed to find a lowest-cost path unless it explores all paths with f-values less than the value of an optimal path; that is, it must explore all the paths that A* explores.

Counterexample: Although this proof seems reasonable, there are algorithms that explore fewer nodes. Consider an algorithm that does a forward A*-like search and a backward dynamic programming search, where the steps are interleaved in some way (e.g., by alternating between the forward steps and the backward steps). The backward search builds a table of cost_to_goal(n) values of the actual discovered cost from n to a goal, and it maintains a bound b, where it has explored all paths of cost less than b to a goal. The forward search uses a priority queue on cost(p)+c(n), where n is the node at the end of the path p, and c(n) is cost_to_goal(n) if it has been computed; otherwise, c(n) is max(h(n),b). The intuition is that, if a path exists from the end of path p to a goal node, either it uses a path that has been discovered by the backward search or it uses a path that costs at least b. This algorithm is guaranteed to find a lowest-cost path and often expands fewer nodes than A* (see Exercise 3.11).

Conclusion: Having a counterexample would seem to mean that the optimality of A* is false. However, the proof does seem to have some appeal and perhaps it should not be dismissed outright. A* is not optimal out of the class of all algorithms, but the proof seems right for the class of algorithms that only do forward search. See Exercise 3.12.


Dynamic programming can be used to construct heuristics for A* and branch-and-bound searches. One way to build a heuristic function is to simplify the problem (e.g., by leaving out some details) until the simplified problem has a small enough state space. Dynamic programming can be used to find the optimal path length to a goal in the simplified problem. This information can then be used as a heuristic for the original problem.