foundations of computational agents
A variant of the gradient descent algorithm for collaborative filtering (Figure 17.2) can be used to predict $P(rating>threshold)$ for various values of threshold in $\{1,2,3,4\}$. Modify the code so that it learns such a probability. [Hint: Make the prediction the sigmoid of the linear function as in logistic regression.] Does this modification work better for the task of recommending the top-$n$ movies, for, say $n=10$, where the aim is to have the maximum number of movies rated 5 in the top-$n$ list? Which threshold works best? What if the top-$n$ is judged by the number of movies rated 4 or 5?
An alternative regularization for collaborative filtering is to minimize
$$ | $\left({\left(\widehat{r}(u,i)-r\right)}^{2}+\lambda (ib{[i]}^{2}+ub{[u]}^{2}+{\displaystyle \sum _{p}}(ip{[i,p]}^{2}+up{[u,p]}^{2}))\right).$ |
How doe this differ from the regularization of Formula 17.1? [Hint: Compare the regularization for the items or users with few ratings with those with many ratings.]
How does the code of Figure 17.2 need to be modified to implement this regularization?
Which works better on test data? [Hint: You will need to set $\lambda $ to be different for each method; for each method, choose the value of $\lambda $ by cross validation.]
Change the stochastic gradient descent algorithm of Figure 17.2 that minimizes Formula 17.1 so it adjusts the parameters, including regularization, after a batch of examples. How does the complexity of this algorithm differ from the algorithm of Figure 17.2? Which one works better in practice? [Hint: Think about whether you need to regularize all of the parameters or just those used in the batch.]
Suppose Boolean parameterized random variables $young(Person)$ and $cool(Item)$ are parents of Boolean $buys(Person,Item)$. Suppose there are 3000 people and 200 items.
Draw this in plate notation.
How many random variables are in the grounding of this model?
How many numbers need to be specified for a tabular representation of this model. (Do not include any numbers that are functions of other specified numbers.)
Draw the grounding belief network assuming the population of $Person$ is $\{sam,chris\}$ and the population of $Item$ is $\{iwatch,mortgage,spinach\}$.
What could be observed to make $cool(iwatch)$ and $cool(mortgage)$ probabilistically dependent on each other given the observations?
Consider Example 17.8. Suppose that the motive for $X$ to shoot $Y$ is that they are being paid by someone else, $Z$, who needs to have motive to kill $Y$. Draw the corresponding plate model. (If the plates become complicated to draw, just use the arguments of the parametrized random variables). It must be clear what logical variables the parametrized random variables depend on.
Suppose you have a relational probabilistic model for movie prediction, which represents
$$P(likes(P,M)\mid age(P),genre(M))$$ |
where $age(P)$ and $genre(M)$ are a priori independent.
What is the treewidth of the ground belief network (after pruning irrelevant variables) for querying $age(Sam)$ given $age$ and $genre$ are not observed and given the observations for $likes$ in Figure 17.14.
Person | Movie | Likes |
---|---|---|
Sam | Hugo | yes |
Chris | Hugo | no |
Sam | Avatar | no |
Sam | Harry Potter 6 | yes |
Chris | Harry Potter 6 | yes |
Chris | AI | no |
Chris | Avatar | no |
David | AI | yes |
David | Avatar | yes |
For the same probabilistic model, for $m$ movies, $n$ people, and $r$ ratings, what is the worst-case treewidth of the corresponding graph (after pruning irrelevant variables), where only ratings are observed? [Hint: The treewidth depends on the structure of the observations; think about how the observations can be structured to maximize the treewidth.]
For the same probabilistic model, for $m$ movies, $n$ people, and $r$ ratings, what is the worst-case treewidth of the corresponding graph, where only some of the ratings but all of the genres are observed?
Consider diagnosing the errors school students make when adding multi-digit numbers. Consider adding two multi-digit numbers to form a third multi-digit number, as in
$$\begin{array}{ccc}& {A}_{1}& {A}_{0}\\ +& {B}_{1}& {B}_{0}\\ & & \\ {C}_{2}& {C}_{1}& {C}_{0}\end{array}$$ |
where ${A}_{i}$, ${B}_{i}$, and ${C}_{i}$ are all digits.
Suppose you want to model whether students know the skills of single-digit addition and carrying. If students know how, they usually get the correct answer, but sometimes make mistakes. Students who do not know simply guess. Draw a belief network for a single student and problem involving adding two 2-digit numbers to get a 3-digit number. [Hint: Each of the ${A}_{i}$, ${B}_{i}$, and ${C}_{i}$, and the carries, are variables, and there is a variable for each skill of the student.]
Suppose there are multiple students and multiple problems. Give a plate representation, either drawing plates or specifying the parametrized random variables. There should be plates for students and problems. Think about what variables depend on the student, the problem, or both.
Suppose now you want to also represent multiple digits and multiple times. Assume that the students’ knowledge can change through time. Draw the plate model. What challenge arises in plate models for times and digits that did not arise for students and problems?
Represent the electrical domain of previous chapters as a probabilistic logic program, so that it will run in AIPython (aipython.org) or Problog. The representation should include the probabilistic dependencies of Example 9.15 and the relations of Example 15.11.