17 Relational Learning and Probabilistic Reasoning 17.7 References and Further Reading 18 The Social Impact of Artificial Intelligence

17.8 Exercises

Exercise 17.1.

A variant of the gradient descent algorithm for collaborative filtering (Figure 17.2) can be used to predict $P(rating>threshold)$ for various values of threshold in $\{1,2,3,4\}$ . Modify the code so that it learns such a probability. [Hint: Make the prediction the sigmoid of the linear function as in logistic regression.] Does this modification work better for the task of recommending the top- $n$ movies, for, say $n=10$ , where the aim is to have the maximum number of movies rated 5 in the top- $n$ list? Which threshold works best? What if the top- $n$ is judged by the number of movies rated 4 or 5?

Exercise 17.2.

An alternative regularization for collaborative filtering is to minimize

\displaystyle\sum_{\left<u,i,r\right>\in D}

\displaystyle\left(\left(\widehat{r}(u,i)-r\right)^{2}+\lambda(ib[i]^{2}+ub[u]% ^{2}+\sum_{p}(ip[i,p]^{2}+up[u,p]^{2}))\right).

(a)

How doe this differ from the regularization of Formula 17.1? [Hint: Compare the regularization for the items or users with few ratings with those with many ratings.]
(b)

How does the code of Figure 17.2 need to be modified to implement this regularization?
(c)

Which works better on test data? [Hint: You will need to set $\lambda$ to be different for each method; for each method, choose the value of $\lambda$ by cross validation.]

Exercise 17.3.

Change the stochastic gradient descent algorithm of Figure 17.2 that minimizes Formula 17.1 so it adjusts the parameters, including regularization, after a batch of examples. How does the complexity of this algorithm differ from the algorithm of Figure 17.2? Which one works better in practice? [Hint: Think about whether you need to regularize all of the parameters or just those used in the batch.]

Exercise 17.4.

Suppose Boolean parameterized random variables $young(Person)$ and $cool(Item)$ are parents of Boolean $buys(Person,Item)$ . Suppose there are 3000 people and 200 items.

(a)

Draw this in plate notation.
(b)

How many random variables are in the grounding of this model?
(c)

How many numbers need to be specified for a tabular representation of this model. (Do not include any numbers that are functions of other specified numbers.)
(d)

Draw the grounding belief network assuming the population of $P e r s o n$ is $\{sam,chris\}$ and the population of $I t e m$ is $\{iwatch,mortgage,spinach\}$ .
(e)

What could be observed to make $cool(iwatch)$ and $cool(mortgage)$ probabilistically dependent on each other given the observations?

Exercise 17.5.

Consider Example 17.8. Suppose that the motive for $X$ to shoot $Y$ is that they are being paid by someone else, $Z$ , who needs to have motive to kill $Y$ . Draw the corresponding plate model. (If the plates become complicated to draw, just use the arguments of the parametrized random variables). It must be clear what logical variables the parametrized random variables depend on.

Exercise 17.6.

Suppose you have a relational probabilistic model for movie prediction, which represents

P(likes(P,M)\mid age(P),genre(M))

where $age(P)$ and $genre(M)$ are a priori independent.

(a)

What is the treewidth of the ground belief network (after pruning irrelevant variables) for querying $age(Sam)$ given $a g e$ and $g e n r e$ are not observed and given the observations for $l i k e s$ in Figure 17.14.

Person Movie Likes

Sam Hugo yes

Chris Hugo no

Sam Avatar no

Sam Harry Potter 6 yes

Chris Harry Potter 6 yes

Chris AI no

Chris Avatar no

David AI yes

David Avatar yes

Figure 17.14: Data for Example 17.6
(b)

For the same probabilistic model, for $m$ movies, $n$ people, and $r$ ratings, what is the worst-case treewidth of the corresponding graph (after pruning irrelevant variables), where only ratings are observed? [Hint: The treewidth depends on the structure of the observations; think about how the observations can be structured to maximize the treewidth.]
(c)

For the same probabilistic model, for $m$ movies, $n$ people, and $r$ ratings, what is the worst-case treewidth of the corresponding graph, where only some of the ratings but all of the genres are observed?

Person	Movie	Likes
Sam	Hugo	yes
Chris	Hugo	no
Sam	Avatar	no
Sam	Harry Potter 6	yes
Chris	Harry Potter 6	yes
Chris	AI	no
Chris	Avatar	no
David	AI	yes
David	Avatar	yes

Exercise 17.7.

Consider diagnosing the errors school students make when adding multi-digit numbers. Consider adding two multi-digit numbers to form a third multi-digit number, as in

\begin{array}[]{ccc}&A_{1}&A_{0}\\ +&B_{1}&B_{0}\\ \hline\cr C_{2}&C_{1}&C_{0}\end{array}

where $A_{i}$ , $B_{i}$ , and $C_{i}$ are all digits.

(a)

Suppose you want to model whether students know the skills of single-digit addition and carrying. If students know how, they usually get the correct answer, but sometimes make mistakes. Students who do not know simply guess. Draw a belief network for a single student and problem involving adding two 2-digit numbers to get a 3-digit number. [Hint: Each of the $A_{i}$ , $B_{i}$ , and $C_{i}$ , and the carries, are variables, and there is a variable for each skill of the student.]
(b)

Suppose there are multiple students and multiple problems. Give a plate representation, either drawing plates or specifying the parametrized random variables. There should be plates for students and problems. Think about what variables depend on the student, the problem, or both.
(c)

Suppose now you want to also represent multiple digits and multiple times. Assume that the students’ knowledge can change through time. Draw the plate model. What challenge arises in plate models for times and digits that did not arise for students and problems?

Exercise 17.8.

Represent the electrical domain of previous chapters as a probabilistic logic program, so that it will run in AIPython (aipython.org) or Problog. The representation should include the probabilistic dependencies of Example 9.15 and the relations of Example 15.11.

Artificial Intelligence 3E