17.2 Embedding-Based models

17.2.1 Learning a Binary Relation

Suppose you are given a database of a binary relation where both arguments are entities, and the aim is to predict whether other tuples are true. The machine learning methods of Chapters 7 and 8 are not directly applicable as the entities are represented using meaningless names (identifiers). An algorithm that does this is defined below, for the case where there are no other relations defined, but where there can also be a numeric prediction for the pair. This is explained using a particular example of predicting a user’s ratings for an item such as a movie.

In a recommender system, users are given personalized recommendations of items they may like. One technique for recommender systems is to predict the rating of a user on an item from the ratings of similar users or similar items, by what is called collaborative filtering.

Example 17.1.

MovieLens (https://movielens.org/) is a movie recommendation system that acquires movie ratings from users. The rating is from 1 to 5 stars, where 5 stars is better. The first few tuples of one dataset are shown in Table 17.1,

where each user is given a unique number, each item (movie) is given a unique number, and the timestamp is the Unix standard of seconds since 1970-01-01 UTC. Such data can be used in a recommendation system to make predictions of other movies a user might like.

One way for a recommender system to personalize recommendations is to present the user with the top-n items, where $n$ is a positive integer, say 10. This can be done by first estimating the rating of each item for a user and then presenting the $n$ items with the highest predicted rating. This is not a good solution in general, as all of the top items might be very similar. The system should also take diversity into account when choosing the set of items.

In the example above, the items were movies, but they could also be consumer goods, restaurants, holidays, or other items.

Netflix Prize

There was a considerable amount of research on collaborative filtering with the Netflix Prize to award \$1,000,000 to the team that could improve the prediction accuracy of Netflix’s proprietary system, measured in terms of sum-of-squares, by 10%. Each rating gives a user, a movie, a rating from 1 to 5 stars, and a date and time the rating was made. The dataset consisted of approximately 100 million ratings from 480,189 anonymized users on 17,770 movies that was collected over a 7-year period. The prize was won in 2009 by a team that averaged over a collection of hundreds of predictors, some of which were quite sophisticated. After 3 years of research, the winning team beat another team by just 20 minutes to win the prize. They both had solutions which had essentially the same error, which was just under the threshold to win. Interestingly, an average of the two solutions was better than either alone.

The algorithm presented here is the basic algorithm that gave the most improvement.

The Netflix dataset is no longer available because of privacy concerns. Although users were only identified by a number, there was enough information, if combined with other information, to potentially identify some of the users.

Suppose $E$ is a dataset of $\left$ triples, where $\left$ means user $u$ gave item $i$ a rating of $r$ (ignoring the timestamp). Let ${\widehat{r}(u,i)}$ be the predicted rating of user $u$ on item $i$. The aim is to optimize the sum-of-squares error

 $\sum_{\left\in E}({\widehat{r}(u,i)}-r)^{2}.$

As with most machine learning, the aim is to optimize for the test examples, not for the training examples.

Following is a sequence of increasingly sophisticated models for ${\widehat{r}(u,i)}$, the prediction of the rating of a user for an item. Each model adds more terms to the model, in a similar way to how features are added in boosting and residuals in deep learning.

Make a single prediction

The simplest case is to predict the same rating for all users and items: ${\widehat{r}(u,i)}=\mu$, where $\mu$ is the mean rating. Recall that when predicting the same value for every instance, predicting the mean minimizes the sum-of-squares error.

Some users might give higher ratings than other users, and some movies may have higher ratings than other movies. You can take this into account using

 ${\widehat{r}(u,i)}=\mu+b_{1}[u]+b_{2}[i]$

where user $u$ has bias $b_{1}[u]$ and item $i$ has bias $b_{2}[i]$. The parameters $b_{1}[u]$ and $b_{2}[i]$ are chosen to minimize the sum-of-squares error. If there are $n$ users and $m$ items, there are $n+m$ parameters to tune (assuming $\mu$ is fixed). Finding the best parameters is an optimization problem that can be done with a method like gradient descent, as was used for the linear learner.

One might think that $b_{1}[u]$ should be directly related to the average rating for user $u$ and $b_{2}[i]$ should be directly related to the average rating for item $i$. However, it is possible that $b_{1}[u]<0$ even if all of the ratings for user $u$ were above the mean $\mu$. This can occur if user $u$ only rated popular items and rated them lower than other people.

Optimizing the $b_{1}[u]$ and $b_{2}[i]$ parameters can help get better estimates of the ratings, but it does not help in personalizing the recommendations, because the movies are still ordered the same for every user.

You could hypothesize that there is an underlying property of each user and each movie that enables more accurate predictions, such as the age of each user, and the age-appeal of each movie. A real-valued latent property or hidden property for each entity can be tuned to fit the data.

Suppose the latent property for user $u$ has a value $f_{1}[u]$, and the latent property for item $i$ has value $f_{2}[i]$. The product of these is used to offset the rating of that item for that user:

 ${\widehat{r}(u,i)}=\mu+b_{1}[u]+b_{2}[i]+f_{1}[u]*f_{2}[i].$

If $f_{1}[u]$ and $f_{2}[i]$ are both positive or both negative, the property will increase the prediction. If one of $f_{1}[u]$ or $f_{2}[i]$ is negative and the other is positive, the property will decrease the prediction.

Example 17.2.

Figure 17.1 shows a plot of the ratings as a function of a single latent property. This uses the subset of the MovieLens 100k dataset, containing the 20 movies that had the most ratings, and the 20 users who had the most ratings for these movies. It was trained for 1000 iterations of gradient descent, with a single latent property.

On the $x$-axis are the users, ordered by their value, $f_{1}[u]$, on the property. On the $y$-axis are the movies, ordered by their value, $f_{2}[i]$ on the property. Each rating is then plotted against the user and the movie, so that triple $\left$ is depicted by plotting $r$ at the $(x,y)$ position $(f_{1}[u],f_{2}[i])$. Thus each vertical column of numbers corresponds to a user, and each horizontal row of numbers is a movie. The columns overlap if two users have very similar values on the property. The rows overlap for movies that have very similar values on the property. The users and the movies where the property values are close to zero are not affected by this property, as the prediction uses the product of values of the properties.

In general, high ratings are in the top-right and the bottom-left, as these are the ratings that are positive in the product, and low ratings in the top-left and bottom-right, as their product is negative. Note that what is high and low is relative to the user and movie biases; what is high for one movie may be different from what is high for another movie.

Add $k$ latent properties

Instead of using just one property, consider using $k$ latent properties for each user and item. There is a value $E_{1}[u][f]$ (using Python notation for matrices) for each user $u$ and property $f\in\{0,\dots,k-1\}$. The vector $E_{1}[u]$ of length $k$ is called the user embedding for user $u$, analogous to the word embeddings used in deep learning. Similarly, there is a value $E_{2}[i][f]$ for every item $i$ and property $f$, where $E_{2}[i]$ is the item embedding for item $i$. The contributions of the separate properties are added. This gives the prediction

 ${\widehat{r}(u,i)}=\mu+b_{1}[u]+b_{2}[i]+\sum_{f}{E_{1}}[u][f]*{E_{2}}[i][f].$

This is often called a matrix factorization method as the summation corresponds to matrix multiplication of $E_{1}$ and the transpose – flipping the arguments – of $E_{2}$.

Regularize

To avoid overfitting, a regularization term can be added to prevent the parameters from growing too much and overfitting the given data. Here an $L_{2}$ regularizer for each of the parameters is added to the optimization. The goal is to choose the parameters to

 $\displaystyle minimize$ $\displaystyle\left(\sum_{\left\in E}\left(\widehat{r}(u,i)-r% \right)^{2}\right)$ $\displaystyle+\lambda\left(\sum_{u}(b_{1}[u]^{2}+\sum_{f}E_{1}[u][f]^{2})+\sum% _{i}(b_{2}[i]^{2}+\sum_{f}E_{2}[i][f]^{2})\right)$ (17.1)

where $\lambda$ is a regularization parameter, which can be tuned by cross validation.

The parameters $b_{1}$, $b_{2}$, $E_{1}$, and $E_{2}$ can be optimized using stochastic gradient descent, as shown in Figure 17.2. The parameters there are adjusted in each iteration, but regularized after going through the dataset, which might not be the best choice; see Exercise 17.3. Note that the elements of $E_{1}[i][f]$ and $E_{2}[u][f]$ need to be initialized randomly (and not to the same value) to force each property to be different.

To understand what is being learned, consider the following thought experiment:

Example 17.3.

In the rating running example, suppose there is a genre of movies that some people really like and others hate. Suppose property $f$ has 1 for movies of that genre, and 0 for other movies. Users who like that genre more than would be expected from other properties have a high value for property $f$, those that dislike it more have a negative value, and others have values close to zero. This can be seen as a soft clustering – as in expectation maximization (EM) – for users into the categories of likes, dislikes, and indifferent. Suppose now that the user property $f$ is fixed to these values, but the corresponding property for the movie is learned, so it might not exactly correspond to the genre. The movies that have a high value for the property are the movies that are liked by the people who like the genre and disliked by those who hate the genre. Thus it learns which movies those people like or dislike. Again it is providing a soft clustering of movies. With multiple latent properties, it can find multiple soft clusterings, with the values added together in a similar way to boosting and residuals in deep learning.

The alternating least squares algorithm acts like the above description, alternating fixing user or item properties and optimizing the other, but starting with random assignments. Optimizing Equation 17.1 with user or item parameters fixed gives a linear regression on the other parameters. The adjustments in Figure 17.2 are more fine grained, but the results are similar.

This algorithm can be evaluated by how well it predicts future ratings by training it with the data up to a certain time, and testing it on future data.

The algorithm can be modified to take into account observed attributes of items and users by fixing some of the properties. For example, the Movielens dataset includes 19 genres, which can be represented using 19 of the embedding positions, one position for each genre (forming a one-hot encoding for part of the embedding space). If position $g$ represents a particular genre, a movie has 1 at position $g$ if the movie has the corresponding genre and 0 otherwise. Each user has a learnable parameter at position $g$ to indicate whether the genre is positive or negative for that user, as in Example 17.3. Different positions are used for user properties. Using explicit properties is useful when users or items have few ratings from which to learn, or for the cold-start problem, which is how to make recommendations about new items or for new users; predictions can be made just on the observed properties, without any ratings needed.

This algorithm ignores the timestamp, but using the timestamp may help as users’ preferences may change and items may come in and out of fashion. The dynamics of preferences becomes important when predicting future ratings – which is usually what is wanted – but there is usually very little data to learn the dynamics for each user.

The above was for regression, predicting the numerical rating of an item for a user. For Boolean classification, for example, whether the rating is greater than 3, the sigmoid of a linear function provides a probabilistic prediction. When trained to optimize binary log loss, the algorithm can remain unchanged, except for an added sigmoid in the definition of $\widehat{r}$ on line 11 of Figure 17.2.

The Boolean function of whether a rating is greater than 3 is atypical because both positive and negative examples are provided. Most relations do not contain negative examples. For some, the closed-world assumption – everything not stated is false – may be appropriate, but then there is nothing to learn. For relations where only positive examples are provided, for example predicting $rated$, which is true when a rating is provided, the above algorithm will conclude that everything is true, as that prediction minimizes Equation 17.1. The mean value ($\mu$ in Figure 17.2) is not defined. To handle this case, either $\mu$ can be treated as a parameter which must be regularized, or negative examples need to be provided. The ratio of positive to negative examples and the regularization parameter both provide a prior assumption of the probability, which is not something that can be estimated from the data. Because the algorithm can handle noise, it is reasonable to just choose random cases as negative examples, even though they could be positive. The absence of negative examples also makes evaluation challenging and is one reason why relational prediction methods are often judged using ranking, such as predicting the top-n items for each user.

When used for recommendations, it might be better to take the diversity of the recommendations into account and recommend very different items than to recommend similar items. Thus, while predicting the ratings is useful, it does not lead directly to useful recommendations.

17.2.2 Learning Knowledge Graphs

Suppose the aim is to learn the triples of a knowledge graph. Consider predicting the truth of the triple $(s,r,o)$, where $s$ is the subject, $r$ is the relation (verb), and $o$ the object of the triple. A straightforward extension to matrix factorization to predict triples is the polyadic decomposition, which decomposes a tensor into matrices. It can be used to predict the probability of a triple – an element of a three-dimensional tensor – as

 ${\widehat{p}{((s,r,o))}}=sigmoid(\mu+b_{1}[s]+b_{2}[r]+b_{3}[o]+\sum_{f}E_{1}[% s][f]*E_{2}[r][f]*E_{3}[o][f])$

using the tunable parameters

• a global bias $\mu$

• two biases for each entity $e$, namely $b_{1}[e]$, used when $e$ is in the first position, and $b_{3}[e]$, used when $e$ is in the third position

• a bias for each relation $r$, given by $b_{2}[r]$

• a matrix ${E_{1}}$, indexed by entities and properties, so that $E_{1}[e]$ is the vector of length $k$, called the subject embedding for $e$, used when $e$ appears as the first item of a triple, and a matrix $E_{3}$, providing the object embedding for each entity when it appears as the third item of a triple

• a matrix $E_{2}$, indexed by relations and properties, where $E_{2}[r]$ is the relation embedding for relation $r$.

All of the embeddings $E_{1}[e]$, $E_{2}[r]$, and $E_{3}[e]$ are the same length.

Training this model on a dataset of triples, minimizing categorical log loss with L2 regularization, is like the algorithm of Figure 17.2, but with a different predictor and more parameters to tune and regularize. With only a knowledge graph of triples with no negative examples, as with the binary case above, without regularizing $\mu$, all triples being true minimizes the error. Either negative examples need to be added or regularization of $\mu$ is required. The ratio of positive to negative examples, and the regularization of $\mu$, convey prior information about the proportion of tuples expected to be true, which cannot be obtained from a dataset of only positive triples.

The polyadic decomposition, however, does not work well because the subject and object embeddings are independent of each other.

Example 17.4.

Suppose user $u$ likes movies directed by a particular person $d$, and the knowledge is represented as $(u,likes,m)$ and $(m,directed\_by,d)$. Whether someone likes a movie only depends on the object embedding of the movie, but the directorship depends only on the subject embedding of the movie. As these embeddings do not interact, there is no way for the above embedding model to represent the interdependence. A solution to this is to also represent the inverse of the relations where $(m,likes^{-1},u)\leftrightarrow(u,likes,m)$ and $(d,directed\_by^{-1},m)\leftrightarrow(m,directed\_by,d)$.

The polyadic decomposition with inverses has embeddings for each relation $r$ and its inverse $r^{-1}$, where $(s,r,o)=(o,r^{-1},s)$, to make the prediction that is the average of the prediction from the two ways of representing the same tuple:

 $\widehat{p}(s,r,o)=\frac{1}{2}(\widehat{pd}{(s,r,o)}+\widehat{pd}{(o,r^{-1},s)})$

where $\widehat{pd}$ is the prediction from the polyadic decomposition. This allows the interdependence of the subject and object embeddings, which can solve the problem of Example 17.4.

The polyadic decomposition with inverses is fully expressive, meaning it can approximate any relation with a log loss less than and $\epsilon>0$, even if all embeddings are restricted to be non-negative, and the value of all embedding values is bounded by a constant (which may depend on $\epsilon$). This means that it can memorize any relation, which is the best it can do in general; see the no-free-lunch theorem. It also means that there are no inherent restrictions on what can be represented. If the same embedding is used for the subject and the object, only symmetric relations can be represented or approximated.

Example 17.3 explains how the user and items embedding used in prediction can be used to understand what is being learned. That explanation holds for a relation, for each embedding position where the relation has a non-zero value. In particular, for the case where all embeddings are non-negative:

• For the subject embedding, each embedding position is a soft clustering of entities. Those entities in the cluster have a large value for that position, where large is relative to the other values. Similarly, for the object embedding.

• For each position, the product of the subject, relation, and object embeddings is only large if all three are large. When the relation embedding is large at a position, the entities in the subject clustering for that position are related to the entities in the corresponding object clustering.

• The addition ($\sum_{f}$) provides for multiple clusterings to be added together.

In the general case, when an even number of the values being multiplied are negative, the result is positive and it acts like the above. When an odd number of the values is negative (one or three of the subject, relation, and object is negative), the product is negative and this position can act like exceptions.

This method – and related methods – is clustering entities in various ways to provide the best predictions. What is learned is not general knowledge that can be applied to other populations. If the goal is to model a single population, for example in a social network, that might be adequate. However, if the aim is to learn knowledge that can be applied to new populations, this method does not work.

Decomposing a relation into matrices does not work well for relations with multiple arguments. If the relation is converted to triples using reification, the introduced entity has very few instances – the number of arguments to the relation – which makes learning challenging. If the polyadic deposition is used directly, the problem of independent positions – which motivated the inverses above – occurs for every pair of positions. Modeling the interaction between every pair of argument predictions leads to overfitting.

Embedding-based models that decompose a relation do not work for predicting properties with a fixed range, such as the age of users, because they rely on a lower-dimensional representation and there isn’t one for properties. The vector for a user can be used to memorize the age in one or more of the embedding positions, which means it does not generalize.