Artificial Intelligence - foundations of computational agents -- 4.10.2.1 Continuous Domains

Third edition of Artificial Intelligence: foundations of computational agents, Cambridge University Press, 2023 is now available (including the full text).

4.10.2.1 Continuous Domains

For optimization where the domains are continuous, a local search becomes more complicated because it is not obvious what the neighbors of a total assignment are. This problem does not arise with hard constraint satisfaction problems because the constraints implicitly discretize the space.

For optimization, gradient descent can be used to find a minimum value, and gradient ascent can be used to find a maximum value. Gradient descent is like walking downhill and always taking a step in the direction that goes down the most. The general idea is that the neighbor of a total assignment is to step downhill in proportion to the slope of the evaluation function h. Thus, gradient descent takes steps in each direction proportional to the negative of the partial derivative in that direction.

In one dimension, if X is a real-valued variable with the current value of v, the next value should be

v-η×(dh/dX)(v),

where

η, the step size, is the constant of proportionality that determines how fast gradient descent approaches the minimum. If η is too large, the algorithm can overshoot the minimum; if η is too small, progress becomes very slow.
(d h)/(d X), the derivative of h with respect to X, is a function of X and is evaluated for X=v.

Figure 4.12: Gradient descent

Example 4.32: Figure 4.12 shows a typical one-dimensional example for finding a local minimum of a one-dimensional function. It starts at a position marked as 1. The derivative is a big positive value, so it takes a step to the left to position 2. Here the derivative is negative, and closer to zero, so it takes a smaller step to the right to position 3. At position 3, the derivative is negative and closer to zero, so it takes a smaller step to the right. As it approaches the local minimum value, the slope becomes closer to zero and it takes smaller steps.

For multidimensional optimization, when there are many variables, gradient descent takes a step in each dimension proportional to the partial derivative of that dimension. If ⟨X₁,..., X_n⟩ are the variables that have to be assigned values, a total assignment corresponds to a tuple of values ⟨v₁,..., v_n⟩. Assume that the evaluation function, h, is differentiable. The next neighbor of the total assignment ⟨v₁,..., v_n⟩ is obtained by moving in each direction in proportion to the slope of h in that direction. The new value for X_i is

v_i - η×(∂h /∂X_i)(v₁,..., v_n),

where η is the step size. The partial derivative, (∂h /∂X_i), is a function of X₁,...,X_n. Applying it to the point (v₁,..., v_n) gives

(∂h /∂X_i)(v₁,..., v_n) = lim_ε→∞ (h(v₁,...,v_i+ε,..., v_n)- h(v₁,...,v_i,..., v_n))/ε.

If the partial derivative of h can be computed analytically, it is usually good to do so. If not, it can be estimated using a small value of ε.

Gradient descent is used for parameter learning, in which there may be thousands of real-valued parameters to be optimized. There are many variants of this algorithm that we do not discuss. For example, instead of using a constant step size, the algorithm could do a binary search to determine a locally optimal step size.