15.4 Datalog: A Relational Rule Language

Datalog expands the language of propositional definite clauses to include individuals and relations. It can be seen as a restricted form of the predicate logic where

• The only formulas allowed are definite clauses, of the form $h\leftarrow\mbox{}a_{1}\wedge\mbox{}\ldots\wedge\mbox{}a_{m}$, where $h$ is the head and $a_{1}\wedge\mbox{}\ldots\wedge\mbox{}a_{m}$ is the body. If $m>0$, the clause is called a rule. If $m=0$, the “$\leftarrow\mbox{}$” is ignored and the clause is called a fact.

• There is no explicit quantification; all variables are assumed to be universally quantified at the outside of the clause.

The clause $h(X)\leftarrow\mbox{}b(X,Y)$ thus means $\forall X~{}\forall Y~{}(h(X)\leftarrow\mbox{}b(X,Y))$, which is equivalent to $\forall X~{}(h(X)\leftarrow\mbox{}\exists Y~{}b(X,Y))$.

Datalog is of interest because it is a database language for defining and querying relations.

Example 15.10.

A relation is often written as a table:

Course Section Time Room
cs111 7 830 dp101
cs422 2 1030 cc208
cs502 1 1230 dp202

This can be represented using a relation $scheduled(Course,Section,Time,Room)$, with the knowledge base containing the facts

 $\begin{array}[]{l}scheduled(cs111,7,830,dp101).\\ scheduled(cs422,2,1030,cc208).\\ scheduled(cs502,1,1230,dp202).\end{array}$

Suppose the relation $enrolled(StudentNum,Course,Section)$ is also defined by facts. The relation $busy(StudentNum,Time)$, which is the join of these relations, projected into student number and time, can be represented using the rule

 $busy(Student,Time)\leftarrow\mbox{}$ $enrolled(Student,Course,Section)\wedge\mbox{}$ $scheduled(Course,Section,Time,Room).$

A student is busy at a time if they are enrolled in a section of a course that is scheduled at that time.

Example 15.11.

Example 5.8 represented the electrical environment of Figure 5.2 using propositions. Using individuals and relations can make the representation more intuitive, because the general knowledge about how switches work can be clearly separated from the knowledge about a specific house.

To represent this domain, the first step is to decide what individuals exist in the domain. In what follows, assume that each switch, each light, and each power outlet is an individual. Each wire between two switches and between a switch and a light is also an individual. Someone may claim that, in fact, there are pairs of wires joined by connectors and that the electricity flow must obey Kirchhoff’s laws. Someone else may decide that even that level of abstraction is inappropriate because the flow of electrons should be modeled. However, an appropriate level of abstraction is one that is useful for the task at hand. A resident of the house may not know the whereabouts of the connections between the individual strands of wire or even the voltage. Therefore, let’s assume a flow model of electricity, where power flows from the outside of the house through wires to lights. This model is appropriate for the task of determining whether a light should be lit or not, but it may not be appropriate for other tasks.

Next, give names to each individual to which you want to refer. This is done in Figure 5.2 by writing the name next to the component. For example, the individual $w_{0}$ is the wire between light $l_{1}$ and switch $s_{2}$.

Next, choose which relationships to represent. Assume the following predicates with their associated intended interpretations:

• $light(L)$ is true if the individual denoted by $L$ is a light.

• $lit(L)$ is true if the light $L$ is lit and emitting light.

• $live(W)$ is true if there is power coming into $W$; that is, $W$ is live.

• $up(S)$ is true if switch $S$ is up.

• $down(S)$ is true if switch $S$ is down.

• $ok(E)$ is true if $E$ is not faulty; $E$ can be either a circuit breaker or a light.

• $connected\_to(X,Y)$ is true if component $X$ is connected to $Y$ such that current would flow from $Y$ to $X$.

At this stage, the computer has not been told anything. It does not know what the predicates are, let alone what they mean. It does not know which individuals exist or their names.

Before anything about the particular house is known, the system can be told general rules such as

 $lit(L)\leftarrow\mbox{}light(L)\wedge\mbox{}live(L)\wedge\mbox{}ok(L).$

This means that a light is lit if it has electricity coming in and is not broken.

Recursive rules let you state what is live from what is connected to what:

 $live(X)\leftarrow\mbox{}connected\_to(X,Y)\wedge\mbox{}live(Y).$ $live(outside).$

This specifies that there is electricity coming into the building, and if there is electricity coming into $Y$ and $X$ is connected to $Y$, then there will be electricity coming into $X$. Note that the order of clauses and the order of the elements of the body has no effect on the semantics.

For the particular house and configuration of components and their connections, the following facts about the world can be told to the computer:

 $\begin{array}[]{lll}{light(l_{1}).}&{light(l_{2}).}&{down(s_{1}).}\\ {up(s_{2}).}&{ok(cb_{1}).}\\ {connected\_to(w_{0},w_{1})\leftarrow\mbox{}up(s_{2}).}\\ {connected\_to(w_{0},w_{2})\leftarrow\mbox{}down(s_{2}).}\\ {connected\_to(w_{1},w_{3})\leftarrow\mbox{}up(s_{1}).}\\ {connected\_to(w_{3},outside)\leftarrow\mbox{}ok(cb_{1}).}\end{array}$

These rules and atomic clauses are all that the computer is told. It does not know the meaning of these symbols. However, it can now answer queries about this particular house.

15.4.1 Queries with Variables

Queries are used to ask whether some statement is a logical consequence of a knowledge base. With propositional queries, a user can only ask yes-or-no queries. Queries with variables allow users to ask for the individuals that make the query true.

An instance of a query is obtained by substituting terms for the variables in the query. Each occurrence of a distinct variable in a query must be replaced by the same term; if variable $X$ appears in a formula, each occurrence of $X$ must be replaced by the same term.

Given a query with free variables, an answer is either an instance of the query that is a logical consequence of the knowledge base, or “$no$”, meaning that no instances of the query logically follow from the knowledge base. Instances of the query are specified by providing values for the variables in the query. Determining which instances of a query follow from a knowledge base is known as answer extraction.

An answer of “no” does not mean that the query is false in the intended interpretation; it simply means that there is no instance of the query that is a logical consequence.

Example 15.12.

Consider the facts about some of the counties in South America in Figure 15.2. The computer knows nothing about geography or South America. All it knows are the clauses it is given, however it can compute logical consequences. Note that the constants denoting the countries and languages start with a lower-case letter, as that is the convention of the language used; English has the opposite convention, where proper nouns start with an upper-case letter.

The user can ask the following query:

 $language(chile,spanish).$

 $language(venezuela,spanish).$

and the answer is no. This means it is not a logical consequence, not that it is false. There is not enough information in the database to determine the principal language of $venezuela$.

The query

 $language(C,spanish).$

has four answers. The answer with $X{\,{=}\,}chile$ means $language(chile,spanish)$ is a logical consequence of the clauses.

The borders relation is true of two countries when they share a border. This relation is symmetric; $borders(X,Y)$ if and only if $borders(Y,X)$. To represent this, Figure 15.2 represents one of the pairs (arbitrarily, with the alphabetically first one as the first argument) using $borders0$, and then $borders$ is derived from this. This uses half as many facts as would be if $borders$ were represented directly, with the extra cost of two rules.

The query

 $\mbox{{ask}~{}}~{}borders0(paraguay,X).$

 $\mbox{{ask}~{}}~{}borders(paraguay,X).$

has two answers: $X{\,{=}\,}argentina$ and $X{\,{=}\,}brazil$.

The query

 $\mbox{{ask}~{}}~{}borders(X,Y).$

To ask for a country that borders Chile with an area greater than two million square kilometers, one could give the query

 $\mbox{{ask}~{}}~{}borders(chile,Y)\land area(Y,A)\land A>2000000.$

where $>$ is a predicate that is true if its arguments are numbers and its left argument is greater than its right argument. As is traditional in mathematics, this predicate is written using infix notation.

Example 15.13.

Figure 15.3 gives more facts about some South American countries and their capitals. Note how it distinguishes the name of the country from the country itself. The constant chile denotes the county, and the constant ”Chile” denotes the string that is the name of the country. These are very different, for example ”Chile” is five characters long, whereas chile is 4270 kilometers long (from North to South).

To ask for the name of the capital city of Chile, one would ask

 $\mbox{{ask}~{}}~{}capital(chile,C)\land name(C,N).$