# 10.3 Unsupervised Learning

This chapter has so far considered supervised learning, where target features are observed in the training data. In unsupervised learning, the target features are not given in the training examples.

One general method for unsupervised learning is clustering, which partitions the examples into clusters. Each cluster predicts feature values for the examples in the cluster. The best clustering is the one that minimizes the prediction error, such as squared error or log loss.

Often the term class is used as a semantically meaningful term, but while you might want the clusters to be semantically meaningful, they are not always.

###### Example 10.8.

A diagnostic assistant may want to group treatments into groups that predict the desirable and undesirable effects of the treatment. The assistant may not want to give a patient a drug because similar drugs may have had disastrous effects on similar patients.

A tutoring agent may want to cluster students’ learning behavior so that strategies that work for one member of a cluster may work for other members.

In hard clustering, each example is placed definitively in a cluster. The cluster is then used to predict the feature values of the example. The alternative to hard clustering is soft clustering, in which each example has a probability distribution over clusters. The prediction of the values for the features of an example is the weighted average of the predictions of the clusters the example is in, weighted by the probability of the example being in the cluster. Soft clustering is described in Section 10.3.2.

## 10.3.1 k-Means

The k-means algorithm is used for hard clustering. The training examples, $E$, and the number of clusters, $k$, are given as input. It requires the value of each feature to be a (real) number, so that differences in values make sense.

The algorithm constructs $k$ clusters, a prediction of a value for each feature for each cluster, and an assignment of examples to clusters.

Suppose the input features, $X_{1},\dots,X_{n}$, are observed for each example. Let ${X_{j}}({e})$ be the value of input feature $X_{j}$ for example $e$. Associate a cluster with each integer $c\in\{1,\dots,k\}$.

The $k$-means algorithm constructs

• a function $cluster:E\rightarrow\{1,\dots,k\}$ that maps each example to a cluster (if $cluster(e)=c$, example $e$ is said to be in cluster $c$)

• a function $prediction(j,c)$ that returns the predicted value of each element of cluster $c$ on feature $X_{j}$.

Example $e$ is thus predicted to have value $prediction(j,cluster(e))$ for feature $X_{j}$.

The aim is to find the functions $cluster$ and $prediction$ that minimize the sum of squared loss:

 $\sum_{e\in E}\sum_{j=1}^{n}\left(prediction(j,cluster(e))-{X_{j}}({e})\right)^% {2}.$

To minimize the squared loss, the prediction of a cluster should be the mean of the prediction of the examples in the cluster; see Figure 7.5. Finding an optimal clustering is NP-hard. When there are only a few examples, it is possible to enumerate the assignments of examples to clusters. For more than a few examples, there are too many partitions of the examples into $k$ clusters for exhaustive search to be feasible.

The $k$-means algorithm iteratively improves the squared loss. Initially, it randomly assigns the examples to clusters. Then it carries out the following two steps:

• For each cluster $c$ and feature $X_{j}$, make $prediction(j,c)$ be the mean value of ${X_{j}}({e})$ for each example $e$ in cluster $c$:

 $\frac{\displaystyle\sum_{e:cluster(e)=c}{X_{j}}({e})}{\left|\{e:cluster(e)=c\}% \right|}$

where the denominator is the number of examples in cluster $c$.

• Reassign each example to a cluster: assign each example $e$ to a cluster $c$ that minimizes

 $\sum_{j=1}^{n}\left(prediction(j,c)-{X_{j}}({e})\right)^{2}.$

These two steps are repeated until the second step does not change the assignment of any example.

An algorithm that implements $k$-means is shown in Figure 10.4. It constructs sufficient statistics to compute the mean of each cluster for each feature, namely

• $cc[c]$ is the number of examples in cluster $c$

• $fs[j,c]$ is the sum of the value for $X_{j}(e)$ for examples in cluster $c$.

These are sufficient statistics because they contain all of the information from the data necessary to compute $cluster(e)$ and $prediction(j,c)$. The current values of $fs$ and $cc$ are used to determine the next values (in $fs\_new$ and $cc\_new$).

The random initialization could be assigning each example to a cluster at random, selecting $k$ points at random to be representative of the clusters, or assigning some, but not all, of the examples to construct the initial sufficient statistics. The latter two methods may be more useful if the dataset is large, as they avoid a pass through the whole dataset for initialization.

An assignment of examples to clusters is stable if an iteration of $k$-means does not change the assignment. Stability requires that $arg~{}min$ in the definition of $cluster$ gives a consistent value for each example in cases where more than one cluster is minimal. This algorithm has reached a stable assignment when each example is assigned to the same cluster in one iteration as in the previous iteration. When this happens, $fs$ and $cluster\_count$ do not change, and so the Boolean variable $stable$ becomes $true$.

This algorithm will eventually converge to a stable local minimum. This is easy to see because the sum-of-squares error keeps reducing and there are only a finite number of reassignments. This algorithm often converges in a few iterations.

###### Example 10.9.

Suppose an agent has observed the $\left$ pairs

$\left<0.7,5.1\right>$, $\left<1.5,6.1\right>$, $\left<2.1,4.5\right>$, $\left<2.4,5.5\right>$, $\left<3.1,4.4\right>$, $\left<3.5,5.1\right>$, $\left<4.5,1.5\right>$, $\left<5.2,0.7\right>$, $\left<5.3,1.8\right>$, $\left<6.2,1.7\right>$, $\left<6.7,2.5\right>$, $\left<8.5,9.2\right>$, $\left<9.1,9.7\right>$, $\left<9.5,8.5\right>$.

These data points are plotted in Figure 10.5(a). The agent wants to cluster the data points into two clusters ($k=2$).

In Figure 10.5(b), the points are randomly assigned into the clusters; one cluster is depicted as $+$ and the other as $*$. The mean of the points marked with $+$ is $\left<4.6,3.65\right>$, shown with $\oplus$. The mean of the points marked with $*$ is $\left<5.2,6.15\right>$, shown with $\circledast$.

In Figure 10.5(c), the points are reassigned according to the closer of the two means. After this reassignment, the mean of the points marked with $+$ is then $\left<3.96,3.27\right>$. The mean of the points marked with $*$ is $\left<7.15,8.34\right>$.

In Figure 10.5(d), the points are reassigned to the closest mean. This assignment is stable in that no further reassignment will change the assignment of the examples.

A different initial assignment to the points can give different clustering. One clustering that arises in this dataset is for the lower points (those with a $Y$-value less than 3) to be in one cluster, and for the other points to be in another cluster.

Running the algorithm with three clusters ($k=3$) typically separates the data into the top-right cluster, the left-center cluster, and the lower cluster. However, there are other possible stable assignments that could be reached, such as having the top three points in two different clusters, and the other points in another cluster.

Some stable assignments may be better, in terms of sum-of-squares error, than other stable assignments. To find the best assignment, it is often useful to try multiple starting configurations, using a random restart and selecting a stable assignment with the lowest sum-of-squares error. Note that any permutation of the labels of a stable assignment is also a stable assignment, so there are invariably multiple local and global minima.

One problem with the $k$-means algorithm is that it is sensitive to the relative scale of the dimensions. For example, if one feature is $height$ in centimeters, another feature is $age$, and another is a binary ($\{0,1\}$) feature, the different values need to be scaled so that they can be compared. How they are scaled relative to each other affects the clusters found. It is common to scale the dimensions to between 0 and 1 or with a mean of 0 and a variance of 1, but this assumes that all dimensions are relevant and independent of each other.

Finding an appropriate number of clusters is a classic problem in trading off model complexity and fit to data. One solution is to use the Bayesian information criteria (BIC) score, similar to its use in decision trees where the number of clusters is used instead of the number of leaves. While it is possible to construct $k+1$ clusters from $k$ clusters, the optimal division into three clusters, for example, may be quite different from the optimal division into two clusters.

## 10.3.2 Expectation Maximization for Soft Clustering

A hidden variable or latent variable is a probabilistic variable that is not observed in a dataset. A Bayes classifier can be the basis for unsupervised learning by making the class a hidden variable.

The expectation maximization (EM) algorithm can be used to learn probabilistic models with hidden variables. Combined with a naive Bayes classifier, it does soft clustering, similar to the $k$-means algorithm, but where examples are probabilistically in clusters.

As in the $k$-means algorithm, the training examples and the number of clusters, $k$, are given as input.

Given the data, a naive Bayes model is constructed where there is a variable for each feature in the data and a hidden variable for the class. The class variable is the only parent of the other features. This is shown in Figure 10.6. The class variable has domain $\{1,2,\dots,k\}$, where $k$ is the number of classes. The probabilities needed for this model are the probability of the class $C$ and the probability of each feature given $C$. The aim of the EM algorithm is to learn probabilities that best fit the data.

The EM algorithm conceptually augments the data with a class feature, $C$, and a count column. Each original example gets mapped into $k$ augmented examples, one for each class. The counts for these examples are assigned so that they sum to 1.

For four features and three classes, the example $\left$ is mapped into the three tuples, shown in the table on the left of Figure 10.7. EM works by iteratively determining the count from the model, and the model from the count.

The EM algorithm repeats the following two steps:

• E step. Update the augmented counts based on the probability distribution. For each example $\left$ in the original data, the count associated with $\left$ in the augmented data is updated to

 $P(C{\,{=}\,}c\mid X_{1}{\,{=}\,}v_{1},\dots,X_{n}{\,{=}\,}v_{n}).$

This step involves probabilistic inference. If multiple examples have the same values for the input features, they can be treated together, with the probabilities multiplied by the number of examples. This is an expectation step because it computes the expected values.

• M step. Infer the probabilities for the model from the augmented data. Because the augmented data has values associated with all the variables, this is the same problem as learning probabilities from data in a naive Bayes classifier. This is a maximization step because it computes the maximum likelihood estimate or the maximum a posteriori probability (MAP) estimate of the probability.

The EM algorithm starts with random probabilities or random counts. EM will converge to a local maximum of the likelihood of the data.

This algorithm returns a probabilistic model, which is used to classify an existing or new example. An example is classified using

 $\displaystyle{P(C{\,{=}\,}c\mid X_{1}{\,{=}\,}v_{1},\dots,X_{n}{\,{=}\,}v_{n})}$ $\displaystyle\mbox{}=\frac{P(C{\,{=}\,}c)*\prod_{i=1}^{n}P(X_{i}{\,{=}\,}v_{i}% \mid C{\,{=}\,}c)}{\sum_{c^{\prime}}P(C{\,{=}\,}c^{\prime})*\prod_{i=1}^{n}P(X% _{i}{\,{=}\,}v_{i}\mid C{\,{=}\,}c^{\prime})}~{}.$

The algorithm does not need to store the augmented data, but can maintain a set of sufficient statistics, which is enough information to compute the required probabilities. Assuming categorical features, sufficient statistics for this algorithm are

• $cc$, the class count, a $k$-valued array such that $cc[c]$ is the sum of the counts of the examples in the augmented data with $class{\,{=}\,}c$

• $fc$, the feature count, a three-dimensional array; for $i$ from 1 to $n$, for each value $v$ in $domain(X_{i})$, and for each class $c$, $fc[i,v,c]$ is the sum of the counts of the augmented examples $t$ with $X_{i}(t){\,{=}\,}v$ and $class(t){\,{=}\,}c$.

In each iteration, it sweeps through the data once to compute the sufficient statistics. The sufficient statistics from the previous iteration are used to infer the new sufficient statistics for the next iteration.

The probabilities required of the model can be computed from $cc$ and $fc$:

 $P(C{\,{=}\,}c)=\frac{cc[c]}{\left|E\right|}$

where $\left|E\right|$ is the number of examples in the original dataset (which is the same as the sum of the counts in the augmented dataset).

 $P(X_{i}{\,{=}\,}v\mid C{\,{=}\,}c)=\frac{fc[i,v,c]}{cc[c]}.$

The algorithm of Figure 10.8 computes the sufficient statistics. Evaluating $P(C{\,{=}\,}c\mid X_{1}{\,{=}\,}v_{1},\dots,X_{n}{\,{=}\,}v_{n})$ in line 17 relies on the counts in $cc$ and $fc$. This algorithm has glossed over how to initialize the counts. One way is for $P(C\mid X_{1}{\,{=}\,}v_{1},\dots,X_{n}{\,{=}\,}v_{n})$ to return a random distribution for the first iteration, so the counts come from the data. Alternatively, the counts can be assigned randomly before seeing any data. See Exercise 10.7.

The algorithm will eventually converge when $cc$ and $fc$ do not change significantly in an iteration. The threshold for the approximately equal in line 21 can be tuned to trade off learning time and accuracy. An alternative is to run the algorithms for a fixed number of iterations.

###### Example 10.10.

Consider Figure 10.7.

When example $\left$ is encountered in the dataset, the algorithm computes

 $\displaystyle{P(C{\,{=}\,}c\mid x_{1}\wedge\neg x_{2}\wedge x_{3}\wedge x_{4})}$ $\displaystyle\mbox{}\propto\mbox{}$ $\displaystyle P(X_{1}{\,{=}\,}1\mid C{\,{=}\,}c)*P(X_{2}{\,{=}\,}0\mid C{\,{=}% \,}c)*P(X_{3}{\,{=}\,}1\mid C{\,{=}\,}c)$ $\displaystyle\mbox{}*P(X_{4}{\,{=}\,}1\mid C{\,{=}\,}c)*P(C{\,{=}\,}c)$ $\displaystyle\mbox{}=\mbox{}$ $\displaystyle\frac{fc[1,1,c]}{cc[c]}*\frac{fc[2,0,c]}{cc[c]}*\frac{fc[3,1,c]}{% cc[c]}*\frac{fc[4,1,c]}{cc[c]}*\frac{cc[c]}{\left|E\right|}$ $\displaystyle\mbox{}\propto\mbox{}$ $\displaystyle\frac{fc[1,1,c]*fc[2,0,c]*fc[3,1,c]*fc[4,1,c]}{cc[c]^{3}}$

for each class $c$ and normalizes the results. Suppose the value computed for class $1$ is 0.4, class 2 is 0.1, and class 3 is 0.5 (as in the augmented data in Figure 10.7). Then $cc\_new[1]$ is incremented by $0.4$, $cc\_new[2]$ is incremented by $0.1$, etc. Values $fc\_new[1,1,1]$, $fc\_new[2,0,1]$, etc. are each incremented by $0.4$. Next, $fc\_new[1,1,2]$, $fc\_new[2,0,2]$ are each incremented by $0.1$, etc.

Notice the similarity with the $k$-means algorithm. The E step (probabilistically) assigns examples to classes, and the M step determines what the classes predict.

As long as $k>1$, EM, like $k$-means, virtually always has multiple local and global maxima. In particular, any permutation of the class labels will give the same probabilities. To try to find a global maximum, multiple restarts can be tried, and a model with the lowest log-likelihood returned.