Artificial Intelligence - foundations of computational agents -- 11.2.4 Structure Learning

Third edition of Artificial Intelligence: foundations of computational agents, Cambridge University Press, 2023 is now available (including the full text).

11.2.4 Structure Learning

Suppose we have complete data and no hidden variables, but we do not know the structure of the belief network. This is the setting for structure learning of belief networks.

There are two main approaches to structure learning:

The first is to use the definition of a belief network in terms of conditional independence. Given a total ordering of variables, make the parents of a variable X be a subset of the variables that are predecessors of X in the total ordering that render the other variables independent of X. This approach has two main challenges: the first is to determine the best total ordering; the second is to find a way to measure independence. It is difficult to determine conditional independence when there is limited data.
The second method is to have a score for networks, for example, using the MAP model, which takes into account fit to the data and model complexity. Given such a measure, you can search for the structure that minimizes this error.

In this section we concentrate on the second method, often called a search and score method.

Assume that the data is a set E of examples, where each example has a value for each variable.

The aim of the search and score method is to choose a model that maximizes

P(model|data)∝P(data|model) P(model).

The likelihood, P(data|model), is the product of the probability of each example. Using the product decomposition, the product of each example given the model is the product of the probability of each variable given its parents in the model. Thus,

P(data|model) P(model)

= (∏_e∈E P(e|model)) P(model)

= (∏_e∈E ∏_{X_i} P_model^e(X_i|par(X_i,model))) P(model),

where par(X_i,model) denotes the parents of X_i in the model, and P_model^e(·) denotes the probability of example e as specified in the model.

This is maximized when its logarithm is maximized. When taking logarithms, products become sums:

log P(data|model) + log P(model)

= ∑_e∈E ∑_{X_i} log P_model^e(X_i|par(X_i,model))) + log P(model).

To make this approach feasible, assume that the prior probability of the model decomposes into components for each variable. That is, we assume probability of the model decomposes into a product of probabilities of local models for each variable. Let model(X_i) be the local model for variable X_i.

Thus, we want to maximize the following:

∑_e∈E ∑_{X_i} log P_model^e(X_i|par(X_i,model))) + ∑_{X_i} log P(model(X_i))

= ∑_{X_i} (∑_e∈E log P_model^e(X_i|par(X_i,model))) + ∑_{X_i} log P(model(X_i))

= ∑_{X_i} (∑_e∈E log P_model^e(X_i|par(X_i,model)) + log P(model(X_i)).

We could optimize this by optimizing each variable separately, except for the fact that the parent relation is constrained by the acyclic condition of the belief network. However, given a total ordering of the variables, we have a classification problem in which we want to predict the probability of each variable given the predecessors in the total ordering. To represent P(X_i|par(X_i,model)) we could use, for example, a decision tree with probabilities of the leaves [as described in Section 7.5.1.1] or learn a squashed linear function. Given the preceding score, we can search over total orderings of the variables to maximize this score.

P(data\|model) P(model)
	=	(∏_e∈E P(e\|model)) P(model)
	=	(∏_e∈E ∏_{X_i} P_model^e(X_i\|par(X_i,model))) P(model),

log P(data\|model) + log P(model)
	=	∑_e∈E ∑_{X_i} log P_model^e(X_i\|par(X_i,model))) + log P(model).

∑_e∈E ∑_{X_i} log P_model^e(X_i\|par(X_i,model))) + ∑_{X_i} log P(model(X_i))
	=	∑_{X_i} (∑_e∈E log P_model^e(X_i\|par(X_i,model))) + ∑_{X_i} log P(model(X_i))
	=	∑_{X_i} (∑_e∈E log P_model^e(X_i\|par(X_i,model)) + log P(model(X_i)).