15.7 Applications in Natural Language

Natural language processing is an interesting and difficult domain in which to develop and evaluate representation and reasoning theories. Many of the problems of AI arise in this domain; solving “the natural language problem” is almost as difficult as solving “the AI problem” because any domain can be expressed in natural language. The field of computational linguistics has a wealth of techniques and knowledge. This book only gives an overview.

 

Two Traditions for Natural Language Processing

Natural language processing systems in AI have generally followed two traditions:

  • Systems with broad coverage, trying to understand language in the wild, but being content with some errors.

  • Systems in a constrained domain where the users can be expected to use controlled natural language, and the results can be unambiguous.

Systems of the first type are used for predictive typing in smartphones and for broad-coverage language translation. Systems of the second type are used when the same people interact with the system on an ongoing basis, such as database queries for supermarket inventories, games, or when the language is stylized – such as parsing the question in the game of Jeopardy! (as was done by the Watson system; see references).

One difference is how to treat ungrammatical sentences or questions, or unfamiliar words. In the first type, these are treated like grammatical ones. However, systems of the second type may ask for clarification or rephrasing, or even fail for such sentences.

Translation using the systems for broad coverage are good for casual use with low cost for being wrong or when the person you are communicating with can ask a question and interact to find the appropriate meaning. However, you might not want to use such a system when there is a large cost for errors or misunderstanding, such as for legal contracts.

The state of the art for systems with broad coverage is to learn them from data, for example using neural networks and deep learning.

This chapter considers systems of the second type, where the techniques presented here are still used.

 

There are at least three reasons for studying natural language processing:

  • Users want to communicate on their own terms and many prefer natural language to some artificial language or a graphical user interface. This is particularly important for casual users and those users, such as managers and children, who have neither the time nor the inclination to learn new interaction skills.

  • There is a vast store of information recorded in natural language that could be accessible using computers. Information is constantly generated in the form of tweets, blogs, books, news, business and government reports, and scientific papers, many of which are available online. A system requiring a great deal of information must be able to process natural language to retrieve much of the information available on computers.

  • Many of the problems of AI arise in a very clear and explicit form in natural language processing and, thus, it is a good domain in which to experiment with general theories.

There are at least three major aspects of natural language:

Syntax

The syntax describes the form of the language. Natural language is much more complicated than the formal languages used for logics and computer programs. Syntax is usually specified by a grammar. Some natural language models are represented using neural networks that predict each word from its context [see Section 8.5], without explicitly building a parse tree.

Semantics

The semantics provides the meaning of utterances or sentences of the language. Although general semantic theories exist, when natural language processing systems are built for a particular application, it is typical to use the simplest representation available. For example, in the development that follows, there is a fixed mapping between words and concepts in the knowledge base, which is inappropriate for many domains but simplifies development. There can be different senses for the same word (such as a bank of a river and a bank to keep money). Neural models typically represent words using fixed-length vectors called embeddings, which represent enough semantics to predict each word in context or other tasks they are trained on.

Pragmatics

The pragmatic component explains how the utterances relate to the world. To understand language, an agent should consider more than the sentence; it has to take into account the context of the sentence, the state of the world, the goals of the speaker and the listener, special conventions, and the like.

To understand the difference among these aspects, consider the following sentences, which might appear at the start of an AI textbook:

  • This book is about artificial intelligence.

  • The green frogs sleep soundly.

  • Colorless green ideas sleep furiously.

  • Furiously sleep ideas green colorless.

The first sentence would be quite appropriate at the start of such a book; it is syntactically, semantically, and pragmatically well formed. The second sentence is syntactically and semantically well formed, but it would appear very strange at the start of an AI book; it is not pragmatically well formed for that context. The last two sentences are by the linguist Noam Chomsky [1957]. The third sentence is syntactically well formed, but it is semantically nonsensical. The fourth sentence is syntactically ill formed; it does not make any sense – syntactically, semantically, or pragmatically.

The next section shows how to write a natural language query answering system that is applicable to very narrow domains using stylized natural language that users have to adhere to. This approach may be adequate for domains in which little, if any, ambiguity exists. At the other extreme are shallow but broad systems, such as the help system presented in Example 9.36 and Example 10.5.

15.7.1 Using Definite Clauses for Context-Free Grammars

This section shows how to use definite clauses to represent aspects of the syntax and semantics of natural language.

Languages are defined by their legal sentences. A sentence is a sequence of tokens, which typically represent words in the language, common phrases (such as “artificial intelligence”), and often include punctuation. Some models represent words in terms of their parts, for example, splitting off the ending such as “ing” and “er” as separate tokens or using sequences of characters (as in Example 8.11). Sentences are represented here using lists, and tokens as strings written in double quotes.

The legal sentences are specified by a grammar. A context-free grammar is defined by a set of rewrite rules, with non-terminal symbols transforming into a sequence of terminal and non-terminal symbols. A sentence of the language is a sequence of terminal symbols generated by such rewriting rules. For example, the grammar rule

sentencenoun_phrase,verb_phrase

means that a non-terminal symbol sentence can be a noun_phrase followed by a verb_phrase. The symbol “” means “can be rewritten as.”

A context-free grammar provides a first approximation of the grammar of some natural languages. For natural languages, the terminal symbols are the tokens of the language. If a sentence of natural language is represented as a list of tokens, the following definite clause means that a list of words is a sentence if it is a noun phrase followed by a verb phrase:

sentence(S)noun_phrase(N)verb_phrase(V)append(N,V,S).

To say that the word “country” is a noun, you could write

noun([country]).

An alternative, simpler, representation of grammar rules, known as a definite-clause grammar (DCG), uses definite clauses without requiring an explicit append. To represent a context-free grammar, each non-terminal symbol s becomes a predicate with two arguments, s(L1,L2), which is true when list L2 is an ending of list L1 such that all of the words in L1 before L2 form a sequence of words of the category s. Lists L1 and L2 together form a difference list of words that make the class given by the non-terminal symbol, because it is the difference of these that forms the syntactic category.

Example 15.32.

Under this representation, noun_phrase(L1,L2) is true if list L2 is an ending of list L1 such that all of the words in L1 before L2 form a noun phrase. L2 is the part of L1 after the noun phrase.

noun_phrase([large,country,bordering,Paraguay,borders,Chile],
       [bordering,Paraguay,borders,Chile])

is true in the intended interpretation because “large country” forms a noun phrase.

noun_phrase([large,country,bordering,Paraguay,borders,Chile],
       [borders,Chile])

is also true because “large country bordering Paraguay” forms a noun phrase.

The grammar rule

sentencenoun_phrase,verb_phrase

means that there is a sentence between some L0 and L2 if there exists a noun phrase between L0 and L1 and a verb phrase between L1 and L2:

L0noun_phraseL1verb_phrasesentenceL2

This grammar rule can be specified as the clause

sentence(L0,L2)
       noun_phrase(L0,L1)
       verb_phrase(L1,L2).

In general, the rule

hb1,b2,,bn

says that h is composed of a b1 followed by a b2, …, followed by a bn, and is written as the definite clause

h(L0,Ln)
       b1(L0,L1)
       b2(L1,L2)
       
       bn(Ln1,Ln).

using the interpretation

L0b1L1b2L2Ln1bnhLn

where the Li are unique variables.

To say that non-terminal h gets mapped to the terminal symbols, t1,,tn, one would write

h([t1,,tnT],T)

using the interpretation

t1,,tnhT

Thus, h(L1,L2) is true if L1=[t1,,tnL2].

Example 15.33.

The rule that specifies that the non-terminal h can be rewritten to the non-terminal a followed by the non-terminal b followed by the terminal symbols c and d, followed by the non-terminal symbol e followed by the terminal symbol f and the non-terminal symbol g, can be written as

ha,b,[c,d],e,[f],g

and can be represented as

h(L0,L6)
       a(L0,L1)
       b(L1,[c,dL3])
       e(L3,[fL5])
       g(L5,L6).

Note that the translations L2=[c,dL3] and L4=[fL5] were done manually.

% A noun phrase is a determiner (or article, such as “the” or “a”) followed by adjectives followed by a noun followed by an optional modifying phrase

noun_phrase(L0,L4)
       det(L0,L1)
       adjectives(L1,L2)
       noun(L2,L3)
       omp(L3,L4).

% Adjectives consist of a (possibly empty) sequence of adjectives

adjectives(L,L).
adjectives(L0,L2)
       adj(L0,L1)
       adjectives(L1,L2).

% A modifying phrase is a relation (verb or preposition) followed by a noun phrase

mp(L0,L2)
       reln(L0,L1)
       noun_phrase(L1,L2).

% An optional modifying phrase is a modifying phrase or nothing

omp(L,L).
omp(L0,L1)mp(L0,L1).

% Some simple questions are “What” and “What is” questions

question([What|L0],L2)
       noun_phrase(L0,L1)
       mp(L1,L2).
question([What,is|L0],L1)
       mp(L0,L1).
Figure 15.8: A context-free grammar for simple English questions
det(L,L).adj([largeL],L).det([aL],L).reln([borderingL],L).det([theL],L).reln([bordersL],L).noun([country|L],L).reln([next,toL],L).noun([city|L],L).reln([the,name,ofL],L).noun([N|L],L):name(E,N).

Figure 15.9: A simple dictionary

Figure 15.8 shows a simple grammar of English questions. Note that % indicates that the rest of the line is a comment. Figure 15.9 gives a simple dictionary of words and their parts of speech, which can be used with this grammar. The first rule for question allows for questions such as “What country borders Chile?” or “What large county bordering Paraguay borders Chile?” while the second rules allows for “What is bordering Chile?”

Example 15.34.

Consider the question “What large county bordering Paraguay borders Chile?” This is represented as a list of strings, one for each word.

For the grammar of Figure 15.8, the dictionary of Figure 15.9, and the definition of name of Figure 15.3, the query

ask noun_phrase([large,country,bordering,
       Paraguay,borders,Chile],R).

has an answer

R=[bordering,Paraguay,borders,Chile]

meaning “large country” is a noun phrase.

Another answer is

R=[borders,Chile]

meaning “large country bordering Paraguay” is a noun phrase.

15.7.2 Augmenting the Grammar

A context-free grammar does not adequately express the complexity of the grammar of natural languages, such as English. Two mechanisms can be added to this grammar to make it more expressive:

  • extra arguments to the non-terminal symbols

  • arbitrary constraints on the rules.

The extra arguments enable us to do several things, including constructing a parse tree and representing a query to a database. The use of arbitrary arguments and conditions means that a definite-clause grammar can represent much more than a context-free grammar; it can represent anything computable by a Turing machine (see Exercise 15.13).

15.7.3 Building a Natural Language Interface to a Database

The preceding grammar can be augmented to implement a simple natural language interface to a database. Instead of transforming sub-phrases into parse trees, you can transform them directly into the entity the part of speech is about. To do this, let’s make the following simplifying assumptions, which are not always true, but form a useful first approximation:

  • Proper nouns (such as “Chile”) correspond to individuals.

  • Nouns (e.g., “country”) and adjectives (e.g., “large”) correspond to properties.

  • Verbs (e.g., “borders”) and prepositions (e.g., “next to”) correspond to a binary relation between two individuals, the subject and the object.

In this case, a noun phrase represents an individual with a set of properties defining it. To answer a question, the system can find an individual that has these properties. A modifying phrase (such as a propositional phrase or relative clause) describes an individual in terms of a relation with another individual. The following assumes only the linguistic structure required to implement database queries from limited natural language.

Example 15.35.

In the sentence “What large country borders Chile?” the phrase “large country” is the subject of the verb ‘borders” and “Chile” is the object. Assume the geographic database of Figure 15.2. For the individual S that is the subject, large(S) and country(S) are true. The object is the individual chile and the verb specifies borders(S,chile). Thus, the question “‘What large country borders Chile?” can be converted into the query

ask large(S)country(S)borders(S,chile)

where large is a predicate that might be true of countries larger than two million square kilometers.

The question “What is the name of the capital of a Spanish-speaking country that borders Argentina?” could be translated into asking for a value of S for the query

ask name(C,S)capital(X,C)language(X,spanish)
       borders(X,argentina).

% A noun phrase is a determiner followed by adjectives followed by a noun followed by an optional modifying phrase, all about the same individual

noun_phrase(L1,L4,Ind)
       adjectives(L1,L2,Ind)
       noun(L2,L3,Ind)
       omp(L3,L4,Ind).

% Adjectives consist of a sequence of adjectives about the same individual

adjectives(L0,L2,Ind)
       adj(L0,L1,Ind)
       adjectives(L1,L2,Ind).
adjectives(L,L,Ind).

% A modifying phrase/relative clause is a relation (verb or preposition) followed by a noun phrase

mp(L0,L2,Subject)
       reln(L0,L1,Subject,Object)
       noun_phrase(L1,L2,Object).

% An optional modifying phrase is either a modifying phrase or nothing

omp(L0,L1,Ind)mp(L0,L1,Ind).
omp(L,L,Ind).

% adj(L0,L1,Ind) is true if L0L1 is an adjective that is true of Ind

adj([large|L],L,Ind)large(Ind).
adj([LangName,speaking|L],L,Ind)
       name(Lang,LangName)language(Ind,Lang).
adj([aL],L,Ind).

% noun(L0,L1,Ind) is true if L0L1 is a noun that is true of Ind

noun([country|L],L,Ind)country(Ind).
noun([N|L],L,E):name(E,N).

% reln(L0,L1,Sub,Obj) is true if L0L1 is a relation on individuals Sub and Obj

reln([borders|L],L,Sub,Obj)borders(Sub,Obj).
reln([bordering|L],L,Sub,Obj)borders(Sub,Obj).
reln([the,capital,of|L],L,Sub,Obj)capital(Obj,Sub).
reln([the,name,of|L],L,Sub,Obj)name(Obj,Sub).
Figure 15.10: Simple grammar that directly answers a question

Figure 15.10 shows a simple grammar that parses an English question and answers it at the same time. This ignores most of the grammar of English, such as the differences between prepositions and verbs or between determiners and adjectives, and makes a guess at the meaning, even if the question is not grammatical. Adjectives, nouns, and noun phrases refer to an individual. The extra argument to the predicates is an individual which satisfies the adjectives and nouns. Here an mp is a modifying phrase, which could be a prepositional phrase or a relative clause. A reln, either a verb or a preposition, is a relation between two individuals, the subject and the object, so these are extra arguments to the reln predicate.

Example 15.36.

Suppose question(Q,A) means A is an answer to question Q, where a question is a list of words. The following provides some ways questions can be asked from the clauses of Figure 15.10, even given the very limited vocabulary used there. The following ignores punctuation.

The following rule is used to answer questions such as “What is next to Chile?”

question([What,isL0],Ind)
       mp(L0,[],Ind).

Note that [] means that in the question there is nothing after the modifying phrase.

The following rule is used to answer questions such as “What is a large Spanish-speaking country next to Chile?”

question([What,isL],Ind)
       noun_phrase(L,[],Ind).

The following rule allows it to answer questions, such as “What large country bordering Paraguay borders Chile?”

question([WhatL0],L2,Ind)
       noun_phrase(L0,L1,Ind)
       mp(L1,L2,Ind).

The preceding grammar directly found an answer to the natural language question. Two problems with this way of answering questions are:

  • It is difficult to separate the cases where the program could not parse the language from the case where there were no answers; in both cases the answer is “no”.

  • When there is ambiguity, meaning multiple legal parses, and there are possibly multiple answers for each parse, it gives all of the answers for all of the parses, without distinguishing them.

These properties make it difficult to debug the programs that use this style.

An alternative is, instead of directly querying the knowledge base while parsing, to build a logical form of the natural language – a logical proposition that conveys the meaning of the utterance – before asking it of the knowledge base. The semantic form can be used for other tasks, such as telling the system knowledge, paraphrasing natural language, or even translating it into a different language.

You can construct a query by allowing noun phrases to return an individual and a list of constraints imposed by the noun phrase on the individual. Appropriate grammar rules are specified in Figure 15.11, and they are used with the dictionary of Figure 15.12.

% A noun phrase is a determiner followed by adjectives followed by a noun followed by an optional modifying phrase

noun_phrase(L0,L4,Ind,C0,C4)
       det(L0,L1,Ind,C0,C1)
       adjectives(L1,L2,Ind,C1,C2)
       noun(L2,L3,Ind,C2,C3)
       omp(L3,L4,Ind,C3,C4).

% Adjectives consist of a sequence of adjectives.

adjectives(L,L,Ind,C,C).
adjectives(L0,L2,Ind,C0,C2)
       adj(L0,L1,Ind,C0,C1)
       adjectives(L1,L2,Ind,C1,C2).

% A modifying phrase is a relation followed by a noun phrase

mp(L0,L2,Sub,C0,C2)
       relation(L0,L1,Sub,Obj,C0,C1)
       noun_phrase(L1,L2,Obj,C1,C2).

% An optional modifying phrase is either nothing or a modifying phrase

omp(L,L,Ind,C,C).
omp(L0,L1,Ind,C0,C1)mp(L0,L1,Ind,C0,C1).
Figure 15.11: Part of a grammar that constructs a query
noun_phrase(L0,L1,Ind,C0,C1)

means that list L1 is an ending of list L0, and the words in L0 before L1 form a noun phrase. This noun phrase refers to the individual Ind. C1 is an ending of C0, and the formulas in C0 before C1 are the constraints on the individual Ind imposed by the noun phrase.

Procedurally, L0 is the list of words to be parsed, and L1 is the list of remaining words after the noun phrase. C1 is the list of conditions coming into the noun phrase, and C0 is C1 with the extra conditions imposed by the noun-phrase added.

det(L,L,O,C,C).
det([aT],T,O,C,C).
det([theT],T,O,C,C).
noun([country|L],L,Ind,[country(Ind)|C],C).
noun([city|L],L,Ind,[city(Ind)|C],C).
noun([N|L],L,Ind,C,C):name(Ind,N).
adj([large|L],L,Ind,[large(Ind)|C],C).
adj([LangName,speaking|L],L,Ind,
       [language(Ind,Lang),name(Lang,LangName)|C],C).
reln([borders|L],L,Sub,Obj,[borders(Sub,Obj)|C],C).
reln([bordering|L],L,Sub,Obj,[borders(Sub,Obj)|C],C).
reln([next,to|L],L,Sub,Obj,[borders(Sub,Obj)|C],C).
reln([the,capital,of|L],L,Sub,Obj,[capital(Obj,Sub)|C],C).
reln([the,name,of|L],L,Sub,Obj,[name(Obj,Sub)|C],C).
Figure 15.12: A dictionary for constructing a query
Example 15.37.

The query

ask noun_phrase([a,Spanish,speaking,country,
       bordering,Argentina],[],E1,C0,[]).

returns

C0=[language(E1,A),name(A,Spanish),country(E1),
       borders(E1,argentina)].

The query

ask mp([the,name,of,the,capital,of,a,
       country,bordering,Argentina],[],Ind,C0,[]).

returns

C0=[name(A,Ind),capital(B,A),country(B),borders(B,argentina)].

If the elements of list C0 are queried against a database that uses these relations and constants, as in Figure 15.2, they can return a Spanish-speaking country that borders Argentina or the name of the capital of a country bordering Argentina. In Prolog, the built-in predicate call queries an atom.

15.7.4 Comparison with Large Language Models

The preceding grammars provide a different way to answer natural language queries than the large language models discussed in Section 8.5.5. The differences in aims were discussed in the box.

The large language models, such as the GPT family, can be used to answer the questions about the geography of South America, for example. The main differences are:

  • GPT models give a probabilistic prediction of the next word. The next word can be sampled, and the prediction can be repeated to give a natural language answer. The grammar of the previous section directly answers the question, giving a symbolic answer, designed for subsequent reasoning.

  • The preceding grammar can be used to give all the answers and then stop when there are no more answers, whereas GPT models predict answers with a probability or sample from the distribution of answers.

  • GPT models provide each answer for each parse without distinguishing the parses, similar to the grammar that directly answers questions, shown in Figure 15.10. The grammar of Figure 15.11 can be used to separately provide the parses and the answers for each parse, which might be useful for some applications.

  • Modern large language models can output realistic-looking queries in, for example, SQL or Datalog. To actually use a generated query to interface to a database requires the language model to access the database schema to find out the predicates, their arity and meaning, etc., as well as to the database itself, to disambiguate constants such as “Chile” (which can mean the country, the football team, or is a spelling of the chili pepper).

  • The preceding grammars can fail if the user does not use the appropriate grammar and vocabulary, whereas GPT models are very forgiving in the use of language.

  • GPT models have a much broader coverage and can converse on arbitrary topics, whereas the preceding grammars need to be engineered for each domain.

Depending on your needs, on whether you are prepared to engineer a system for each domain, and whether the users can be trained to use the assumed grammar, either one of these techniques might be more useful.