# 15.2.1 Structure Learning: Inductive Logic Programming

The task of predicting which relations are true based on the truth of other relations has mainly been carried out in the framework of logic programming, and so is typically called inductive logic programming.

###### Example 15.12.

Suppose a trading agent has a data set, in terms of individual–property–value triples, about which resorts a person likes, as in Figure 15.1.

The agent wants to learn what Joe likes. What is important is not the value of the $likes$ property, which is just a meaningless name, but the properties of the individual denoted by the name. Feature-based representations cannot do anything with this data set beyond learning that Joe likes $resort\_14$, but not $resort\_35$.

The theory the agent should learn is that, for example, Joe likes resorts near sandy beaches. This theory can be expressed as a logic program:

 $\displaystyle{{prop}(joe,likes,R)\leftarrow\mbox{}}$ $\displaystyle\ \ \ \ {{prop}(R,type,resort)\wedge\mbox{}}$ $\displaystyle\ \ \ \ {{prop}(R,near,B)\wedge\mbox{}}$ $\displaystyle\ \ \ \ {{prop}(B,type,beach)\wedge\mbox{}}$ $\displaystyle\ \ \ \ {{prop}(B,covered\_in,S)\wedge\mbox{}}$ $\displaystyle\ \ \ \ {{prop}(S,type,sand).}$

Logic programs provide the ability to be able to represent deterministic theories about individuals and relations. This rule can be applied to resorts and beaches that Joe has not yet visited.

The input to an inductive logic programming learner includes the following:

• $A$ is a set of atoms whose definitions the agent is learning.

• $E^{+}$ is a set of ground instances of elements of $A$, called the positive examples, that are observed to be true.

• $E^{-}$ is a set of ground instances of elements of $A$, called the negative examples, that are observed to be false.

• $B$, the background knowledge, is a set of clauses that define relations that can be used in the learned logic programs.

• $H$ is a space of possible hypotheses. $H$ is often represented implicitly as a set of operators that can generate the possible hypotheses. Each hypothesis is a logic program.

###### Example 15.13.

In Example 15.12, suppose the agent wants to learn what Joe likes. In this case, the inputs are

• $A=\{prop(joe,likes,R)\}$

• $E^{+}=\{prop(joe,likes,resort\_14),\dots\}$. For the example that follows, assume there are many such items that Joe likes.

• $E^{-}=\{prop(joe,likes,resort\_35),\dots\}$. These are written in the positive form; they have been observed to be false.

• $B=\{prop(resort\_14,type,resort),prop(resort\_14,near,beach\_18),\dots\}$. This set contains all of the background facts about the world that are not instances of $A$. The agent is not learning about these.

• $H$ is a set of logic programs defining $prop(joe,likes,R)$. The heads of the clauses unify with $prop(joe,likes,R)$. $H$ is too big to enumerate.

All of these, except for $H$, are given explicitly in the formulation of the problem.

The aim is to find a simplest hypothesis $h\in H$ such that

 $\displaystyle{B\wedge h\models E^{+}\hbox{ and }}$ $\displaystyle{B\wedge h\not\models E^{-}.}$

That is, the hypothesis implies the positive evidence and does not imply the negative evidence. It must be consistent with the negative evidence being false.

The aim is to find an element of the version space, where the elements of the version space are logic programs. This is similar to the definition of abduction, in which the knowledge base in abduction corresponds to the background knowledge. The hypothesis space of inductive logic programming is the set of logic programs. The second condition corresponds to consistency.

Assume that there is a single target $A=\{t(X_{1},\dots,X_{n})\}$ The hypothesis space consists of possible definitions for this relation using a logic program.

There are two main strategies used in inductive logic programming:

• The first strategy is to start with the simplest hypotheses and make them more complicated to fit the data. Because the logic program only states positive facts, the simplest hypothesis is the empty program, which specifies that $t(X_{1},\dots,X_{n})$ is always false. This is is the most specific hypothesis, but is not correct unless $E^{+}$ is empty. The second simplest hypothesis is the most general hypothesis, which is simply that $t(X_{1},\dots,X_{n})$ is always true. This hypothesis implies the positive examples, but it also implies the negative examples, if any exist. One strategy involves a general-to-specific search. It tries to find the simplest hypothesis that fits the data by searching the hypothesis space from the most general hypothesis to more complex hypotheses, always implying the positive examples, until a hypothesis that does not imply the negative examples is found.

• The second strategy is to start with a hypothesis that fits the data and to make it simpler while still fitting the data. A hypothesis that fits the data is the set of positive examples. This strategy involves a specific-to-general search: start with the very specific hypothesis in which only the positive examples are true, and then generalize the clauses, avoiding the negative cases.

Here we expand on the general-to-specific search for definite clauses. The initial hypothesis contains a single clause:

 $\{t(X_{1},\dots,X_{n})\leftarrow\}.$

A specialization operator takes a set $G$ of clauses and returns a set $S$ of clauses that specializes $G$. To specialize means that $S\models G$.

The following are three primitive specialization operators:

• Split a clause in $G$ on condition $c$. Clause $a\leftarrow\mbox{}b$ in $G$ is replaced by two clauses: $a\leftarrow\mbox{}b\wedge c$ and $a\leftarrow\mbox{}b\wedge\neg c$.

• Split a clause $a\leftarrow\mbox{}b$ in $G$ on a variable $X$ that appears in $a$ or $b$. Clause $a\leftarrow\mbox{}b$ is replaced by the clauses

 $\displaystyle{a\leftarrow\mbox{}b\wedge\mbox{}X=t_{1}.}$ $\displaystyle{\dots}$ $\displaystyle{a\leftarrow\mbox{}b\wedge\mbox{}X=t_{k}.}$

where the $t_{i}$ are terms.

• Remove a clause that is not necessary to prove the positive examples.

The last operation changes the predictions of the clause set. Those cases no longer implied by the clauses are false.

These primitive specialization operators are used together to form the operators of $H$. The operators of $H$ are combinations of primitive specialization operations designed so that progress can be evaluated using a greedy look-ahead. That is, the operators are defined so that one step is adequate to evaluate progress.

The first two primitive specialization operations should be carried out judiciously to ensure that the simplest hypothesis is found. An agent should carry out the splitting operations only if they make progress. Splitting is only useful when combined with clause removal. For example, adding an atom to the body of a clause is equivalent to splitting on the atom and then removing the clause containing the negation of the atom. A higher-level specialization operator may be to split on the atom ${prop}(X,type,T)$ for some variable $X$ that appears in the clause, split on the values of $T$, and remove the resulting clauses that are not required to imply the positive examples. This operator makes progress on determining which types are useful.

Figure 15.2 shows a nondeterministic algorithm for the top-down induction of a logic program. It maintains a single hypothesis that it iteratively improves until finding a hypothesis that fits the data or until it fails to find such a hypothesis. The “choose” of line 15 can be implemented by search.

At each time step it chooses an operator to apply with the constraint that every hypothesis entails the positive examples. This algorithm glosses over two important details:

• which operators to consider and

• which operator to select.

The operators should be at a level so that they can be evaluated according to which one is making progress toward a good hypothesis. In a manner similar to decision tree learning, the algorithm can perform a myopically optimal choice. That is, it chooses the operator that makes the most progress in minimizing the error. The error of a hypothesis can be the number of negative examples that are implied by the hypothesis.

###### Example 15.14.

Consider Example 15.12, in which the agent must learn about Joe’s likes and dislikes.

The first hypothesis is that Joe likes everything:

 $\{prop(joe,likes,R)\leftarrow{}\},$

which is inconsistent with the negative evidence.

It can split the clause on conditions or split on a variable. The only way for the specialization to make progress – to prove fewer negative examples while implying all positive examples – is for the created rules to contain the variable $R$ in the body.

• It could consider splitting on the property $type$, splitting on the value of $type$, and keeping only those that are required to prove the positive examples. This results in the following clause (assuming the positive examples only include resorts):

 $\{prop(joe,likes,R)\leftarrow prop(R,type,resort)\}.$

There can be other clauses if the positive examples include things other than resorts that Joe likes. If the negative examples include non-resorts, this split would be useful in reducing the error.

• It could consider splitting on other properties that can be used in proofs of the positive examples, such as $near$, resulting in

 $\{prop(joe,likes,R)\leftarrow prop(R,near,B)\}$

if all of the positive examples are near something. If some of the negative examples are not near anything, this specialization could be useful in reducing the error.

• It could consider splitting on the variable $R$, considering different constants for $R$. This does not allow for generalization.

It could then choose whichever of these splits makes the most progress. In the next step it could add the other one, other properties of $R$, or properties of $B$.