Full text of the second edition of Artificial Intelligence: foundations of computational agents, Cambridge University Press, 2017 is now available.

#### 4.10.2.1 Continuous Domains

For optimization where the domains are continuous, a local search becomes more complicated because it is not obvious what the neighbors of a total assignment are. This problem does not arise with hard constraint satisfaction problems because the constraints implicitly discretize the space.

For optimization, **gradient descent** can be
used to find a minimum value,
and **gradient ascent** can be used to find a maximum value. Gradient
descent is like walking downhill and always taking a step in the
direction that goes down
the most.
The general idea is that the neighbor of a total assignment is to step
downhill in proportion to the slope of the evaluation function *h*. Thus,
gradient descent takes steps in each direction proportional to
the negative of the partial derivative in that direction.

In one dimension, if *X* is a real-valued variable with the current
value of *v*, the next value should be

v-η×(dh/dX)(v),

where

*η*, the**step size**, is the constant of proportionality that determines how fast gradient descent approaches the minimum. If*η*is too large, the algorithm can overshoot the minimum; if*η*is too small, progress becomes very slow.*(d h)/(d X)*, the derivative of*h*with respect to*X*, is a function of*X*and is evaluated for*X=v*.

**Example 4.32:**Figure 4.12 shows a typical one-dimensional example for finding a local minimum of a one-dimensional function. It starts at a position marked as 1. The derivative is a big positive value, so it takes a step to the left to position 2. Here the derivative is negative, and closer to zero, so it takes a smaller step to the right to position 3. At position 3, the derivative is negative and closer to zero, so it takes a smaller step to the right. As it approaches the local minimum value, the slope becomes closer to zero and it takes smaller steps.

For multidimensional optimization, when there are many variables,
gradient descent takes a step in each dimension proportional to the
partial derivative of that dimension.
If *⟨X _{1},..., X_{n}⟩* are the variables that have to be
assigned values, a total assignment corresponds to a tuple of values

*⟨v*. Assume that the evaluation function,

_{1},..., v_{n}⟩*h*, is differentiable. The next neighbor of the total assignment

*⟨v*is obtained by moving in each direction in proportion to the slope of

_{1},..., v_{n}⟩*h*in that direction. The new value for

*X*is

_{i}v_{i}- η×(∂h /∂X_{i})(v_{1},..., v_{n}),

where *η* is the **step
size**. The partial derivative,
*(∂h /∂X _{i})*,
is a function of

*X*. Applying it to the point

_{1},...,X_{n}*(v*gives

_{1},..., v_{n})(∂h /∂X_{i})(v_{1},..., v_{n}) = lim_{ε→∞}(h(v_{1},...,v_{i}+ε,..., v_{n})- h(v_{1},...,v_{i},..., v_{n}))/ε.

If the partial derivative of *h* can be computed analytically, it is
usually good to do so. If not, it can be estimated
using a small value of *ε*.

Gradient descent is used for parameter learning, in which there may be thousands of real-valued parameters to be optimized. There are many variants of this algorithm that we do not discuss. For example, instead of using a constant step size, the algorithm could do a binary search to determine a locally optimal step size.