Full text of the second edition of Artificial Intelligence: foundations of computational agents, Cambridge University Press, 2017 is now available.

### 3.7.6 Dynamic Programming

Dynamic programming is a general method for optimization that involves storing partial solutions to problems, so that a solution that has already been found can be retrieved rather than being recomputed. Dynamic programming algorithms are used throughout AI.

Dynamic programming can be
used for finding paths in graphs. Intuitively, **dynamic programming** for graph searching can be seen as constructing the *perfect* heuristic function so that
*A ^{*}*, even if it keeps only one element of the frontier, is
guaranteed to find a solution. This cost-to-goal function represents the exact cost of a minimal-cost
path from each node to the goal.

A **policy** is a specification of which arc to take from each
node.
The cost-to-goal function can be computed offline and can be used to build an optimal policy. Online, an agent can use this policy to determine what to do at each point.

Let *cost_to_goal(n)* be the actual cost of a lowest-cost path from node
*n* to a goal; *cost_to_goal(n)* can be defined as

cost_to_goal(n) = 0 if is_goal(n), min _{⟨n,m⟩∈A}( cost(⟨n,m⟩)+cost_to_goal(m)) otherwise.

The general idea is to start at the goal and build a table of
the *cost_to_goal(n)* value for each node. This can be done by carrying out a lowest-cost-first search, with multiple-path pruning, from
the goal nodes in the *inverse graph*, which is the graph with all arcs
reversed. Rather than having a goal to search for, the dynamic
programming algorithm records the *cost_to_goal* values for each node found.
It uses the inverse graph to compute the costs from each
node to the goal and not the costs from the goal to each node.
In essence, dynamic programming works backward from the goal by
trying to build the lowest-cost paths to the goal from each node in the graph.

For a particular goal, once the *cost_to_goal* value for each node has been
recorded, an agent can use the *cost_to_goal* value to determine the
next arc on an optimal path. From node *n* it should go to
a neighbor *m* that minimizes *cost(⟨n,m⟩)+cost_to_goal(m)*.
Following this policy will take the agent from any node to a goal
along a
lowest-cost path. Given *cost_to_goal*, determining which arc is optimal takes constant
time with respect to the size of the graph, assuming a bounded number
of neighbors for each node. Dynamic programming takes time and space linear in the size
of the graph to build the *cost_to_goal* table.

Dynamic programming is useful when

- the goal nodes are explicit (the previous methods only assumed a function that recognizes goal nodes);
- a lowest-cost path is needed;
- the graph is finite and small enough to be able to store the
*cost_to_goal*value for each node; - the goal does not change very often; and
- the policy is used a number of times for each goal, so that the cost of
generating the
*cost_to_goal*values can be amortized over many instances of the problem.

The main problems with dynamic programming are that

- it only works when the graph is finite and the table can be made small enough to fit into memory,
- an agent must recompute a policy for each different goal, and
- the time and space required is linear in the size of the graph, where the graph size for finite graphs is typically exponential in the path length.

**Example 3.18:**For the graph given in Figure 3.2, the cost from

*r123*to the goal is

*0*; thus,

cost_to_goal(r123)=0.

Continuing with a lowest-cost-first search from *r123*:

*cost_to_goal(o123)=4*

*cost_to_goal(o119)=13*

*cost_to_goal(o109)=29*

*cost_to_goal(b4)=36*

*cost_to_goal(b2)=39*

*cost_to_goal(o103)=41*

*cost_to_goal(b3)=43*

*cost_to_goal(b1)=45*

At this stage the backward search halts. Two things can be noticed here. First, if a node does not have a *cost_to_goal*
value, then no path to the goal exists from that node. Second, an agent
can quickly determine the next arc on a lowest-cost path to the goal for
any node. For example, if the agent is at *o103*, to determine a lowest-cost path to
*r123* it compares *4+43* (the cost of going via *b3*) with
*12+29* (the cost of going straight to *o109*) and can quickly
determine to go to *o109*.

When building the *cost_to_goal* function, the searcher has implicitly determined which
neighbor leads to the goal. Instead of determining at run time which neighbor is
on an optimal path, it can store this information.

**Optimality of the**

*A*algorithm^{*}A search algorithm is **optimal** if no other search algorithm uses less
time or space or expands fewer nodes, both with a guarantee of
solution quality. The optimal search algorithm would be one that picks the correct node at each choice. However, this specification is not
effective because we cannot directly
implement it. Whether such an algorithm is possible is an open
question (as to whether *P=NP*). There does, however, seem to be a
statement that can be proved.

**Optimality of A^{*}:**
Among search algorithms that only use arc costs
and a heuristic estimate of the cost from a node to a goal, no
algorithm expands fewer nodes than

*A*and guarantees to find a lowest-cost path.

^{*}**Proof sketch:**
Given only the information about the arc costs and the heuristic
information, unless the algorithm has expanded each path *p*, where
*f(p)* is less than the cost of an optimal path, it does not know
whether *p* leads to a lower-cost path. More formally, suppose an
algorithm *A'* found a path for a problem *P* where some path *p* was not
expanded such that *f(p)* was less than the solution found. Suppose there
was another problem *P'*, which was the same as *P*, except that there
really was a path via *p* with cost *f(p)*. The algorithm *A'* cannot tell
*P'* from *P*, because it did not expand the path *p*, so it would report
the same solution for *P'* as for *P*, but the solution found for *P*
would not be optimal for *P'* because the solution found has a higher cost than the
path via *p*. Therefore, an algorithm is not guaranteed to find
a lowest-cost path unless it explores all paths with *f*-values less than
the value of an optimal path; that is, it must explore all the paths that
*A ^{*}* explores.

**Counterexample:** Although this proof seems reasonable,
there are algorithms that explore fewer nodes. Consider an algorithm that does a forward *A ^{*}*-like search and
a backward dynamic programming search, where the steps are
interleaved in some way (e.g., by alternating between the forward
steps and the backward steps). The backward search builds a table of

*cost_to_goal(n)*values of the actual discovered cost from

*n*to a goal, and it maintains a bound

*b*, where it has explored all paths of cost less than

*b*to a goal. The forward search uses a priority queue on

*cost(p)+c(n)*, where

*n*is the node at the end of the path

*p*, and

*c(n)*is

*cost_to_goal(n)*if it has been computed; otherwise,

*c(n)*is

*max(h(n),b)*. The intuition is that, if a path exists from the end of path

*p*to a goal node, either it uses a path that has been discovered by the backward search or it uses a path that costs at least

*b*. This algorithm is guaranteed to find a lowest-cost path and often expands fewer nodes than

*A*(see Exercise 3.11).

^{*}**Conclusion:** Having a counterexample would seem to mean
that the optimality of *A ^{*}* is false. However, the proof does seem to
have some appeal and perhaps it should not be dismissed
outright.

*A*is not optimal out of the class of all algorithms, but the proof seems right for the class of algorithms that only do forward search. See Exercise 3.12.

^{*}Dynamic programming can be used to construct heuristics for *A ^{*}* and
branch-and-bound searches. One way to build a heuristic function is to simplify the problem (e.g.,
by leaving out some details) until the simplified problem has a
small enough state space. Dynamic
programming can be used to find the optimal path length
to a goal in
the simplified problem. This information can then be used as a heuristic for the
original problem.