10.2 Unsupervised Learning 10.2 Unsupervised Learning 10.2.2 Expectation Maximization for Soft Clustering

The third edition of Artificial Intelligence: foundations of computational agents, Cambridge University Press, 2023 is now available (including full text).

10.2.1 k-Means

The k-means algorithm is used for hard clustering. The training examples and the number of classes, $k$ , are given as input. The algorithm assumes that the domain of each feature defining the examples is cardinal (so that differences in values make sense).

The algorithm constructs $k$ classes, a prediction of a value for each feature for each class, and an assignment of examples to classes.

Suppose $E$ is the set of all training examples, and the input features are $X_{1},\dots,X_{n}$ . Let ${X_{j}}({e})$ be the value of input feature $X_{j}$ for example $e$ . We assume that these are observed. We will associate a class with each integer $c\in\{1,\dots,k\}$ .

The $k$ -means algorithm constructs:

•

a function $class:E\rightarrow\{1,\dots,k\}$ , which maps each example to a class. If $class(e)=c$ , we say that $e$ is in class $c$ .
•

for each feature $X_{j}$ , a function $\widehat{X_{j}}$ from classes into the domain of $X_{j}$ , where $\widehat{X_{j}}({c})$ gives the prediction for every member of class $c$ of the value of feature $X_{j}$ .

Example $e$ is predicted to have value $\widehat{X_{j}}({class(e)})$ for feature $X_{j}$ .

The sum-of-squares error is:

\sum_{e\in E}\sum_{j=1}^{n}\left(\widehat{X_{j}}({class(e)})-{X_{j}}({e})% \right)^{2}.

The aim is to find the functions $c l a s s$ and $\widehat{X_{j}}$ that minimize the sum-of-squares error.

As shown in Proposition 7.1, to minimize the sum-of-squares error, the prediction of a class should be the mean of the prediction of the examples in the class. When there are only a few examples, it is possible to search over assignments of examples to classes to minimize the error. Unfortunately, for more than a few examples, there are too many partitions of the examples into $k$ classes for exhaustive search to be feasible.

The $k$ -means algorithm iteratively improves the sum-of-squares error. Initially, it randomly assigns the examples to the classes. Then it carries out the two steps:

•

For each class $c$ and feature $X_{j}$ , assign to $\widehat{X_{j}}({c})$ the mean value of ${X_{j}}({e})$ for each example $e$ in class $c$ :

$\widehat{X_{j}}({c})\leftarrow\frac{\displaystyle\sum_{e:class(e)=c}{X_{j}}({e% })}{\left|\{e:class(e)=c\}\right|}$

where the denominator is the number of examples in class $c$ .
•

Reassign each example to a class: assign each example $e$ to a class $c$ that minimizes

$\sum_{j=1}^{n}\left(\widehat{X_{j}}({c})-{X_{j}}({e})\right)^{2}.$

These two steps are repeated until the second step does not change the assignment of any example.

1: procedure k-means(

X s, E s, k

)

2: Inputs

X s

set of features,

X=\{X_{1},\dots,X_{n}\}

E s

set of training examples

k

number of classes

6: Output

c l a s s

: function from examples to classes

p r e d n

: function from feature and class to a value for that feature

9: Local

10: integer

cc[c]

cc\_new[c]

\triangleright

old and new class count for class

c

11: real

fs[j,c]

fs\_new[j,c]

\triangleright

sum of feature

X_{j}

for class

c

12: Boolean

s t a b l e

13: Initialize

f s

and

c c

randomly based on data

14: define

predn(j,c)={fs[j,c]}/{cc[c]}

\triangleright

estimate of

\widehat{X_{j}}({c})

15: define

class(e)=\arg\min_{c}\sum_{j=1}^{n}\left(predn(j,c)-{X_{j}}({e})\right)^{2}

16: repeat

17:

fs\_new

and

cc\_new

initialized to be all zero

18: for each example

e\in Es

19:

c\;{:}{=}\;\mbox{}class(e)

20:

cc\_new[c]+=1

21: for each feature

X_{j}\in Xs

22:

fs\_new[j,c]+=X_{j}(e)

23:

stable\;{:}{=}\;\mbox{}(fs\_new=fs)

and

(cc\_new=cc)

24:

fs\;{:}{=}\;\mbox{}fs\_new

25:

cc\;{:}{=}\;\mbox{}cc\_new

26: until

s t a b l e

27: return

c l a s s

p r e d n

Figure 10.2: k-means for unsupervised learning

An algorithm that implements $k$ -means is shown in Figure 10.2. It constructs the sufficient statistics to compute the mean of each class for each feature, namely $cc[c]$ is the number of examples in class $c$ , and $fs[j,c]$ is the sum of the value for $X_{j}(e)$ for examples in class $c$ . These are then used in the function $predn(j,c)$ which is the latest estimate of $\widehat{X_{j}}({c})$ and $class(e)$ the class of example $e$ . It uses the current values of $f s$ and $c c$ to determine the next values (in $fs\_new$ and $cc\_new$ ).

The random initialization could be assigning each example to a class at random, selecting $k$ points at random to be representative of the classes, or assigning some, but not all, of the examples to construct the initial sufficient statistics. The latter two methods may be more useful if the data set is large, as they avoid a pass through the whole data set for initialization.

An assignment of examples to classes is stable if an iteration of $k$ -means does not change the assignment. Stability requires that $arg~{}min$ in the definition of $c l a s s$ gives a consistent value for each example in cases where more than one class is minimal. This algorithm has reached a stable assignment when each example is assigned to the same class in one iteration as in the previous iteration. When this happens, $f s$ and $class\_count$ do not change, and so the Boolean variable $s t a b l e$ becomes $t r u e$ .

This algorithm will eventually converge to a stable local minimum. This is easy to see because the sum-of-squares error keeps reducing and there are only a finite number of reassignments. This algorithm often converges in a few iterations.

Example 10.9.

Suppose an agent has observed the $\left<X,Y\right>$ pairs:

$\left<0.7,5.1\right>$ , $\left<1.5,6.1\right>$ , $\left<2.1,4.5\right>$ , $\left<2.4,5.5\right>$ , $\left<3.1,4.4\right>$ , $\left<3.5,5.1\right>$ , $\left<4.5,1.5\right>$ , $\left<5.2,0.7\right>$ , $\left<5.3,1.8\right>$ , $\left<6.2,1.7\right>$ , $\left<6.7,2.5\right>$ , $\left<8.5,9.2\right>$ , $\left<9.1,9.7\right>$ , $\left<9.5,8.5\right>$ .

These data points are plotted in Figure 10.3(a). The agent wants to cluster the data points into two classes ( $k=2$ ).

Figure 10.3: A trace of the

k

-means algorithm for

k=2

for Example 10.9

In Figure 10.3(b), the points are randomly assigned into the classes; one class is depicted as $+$ and the other as $*$ . The mean of the points marked with $+$ is $\left<4.6,3.65\right>$ , shown with $\oplus$ . The mean of the points marked with $*$ is $\left<5.2,6.15\right>$ , shown with $\circledast$ .

In Figure 10.3(c), the points are reassigned according to the closer of the two means. After this reassignment, the mean of the points marked with $+$ is then $\left<3.96,3.27\right>$ . The mean of the points marked with $*$ is $\left<7.15,8.34\right>$ .

In Figure 10.3(d), the points are reassigned to the closest mean. This assignment is stable in that no further reassignment will change the assignment of the examples.

A different initial assignment to the points can give different clustering. One clustering that arises in this data set is for the lower points (those with a $Y$ -value less than 3) to be in one class, and for the other points to be in another class.

Running the algorithm with three classes ( $k=3$ ) typically separates the data into the top-right cluster, the left-center cluster, and the lower cluster. However, there are other possible stable assignments that could be reached, such as, having the top three point in two different classes, and the other points in another class. It is even possible for a class to contain no examples.

Some stable assignments may be better, in terms of sum-of-squares error, than other stable assignments. To find the best assignment, it is often useful to try multiple starting configurations, using a random restart, and selecting a stable assignment with the lowest sum-of-squares error. Note that any permutation of the labels of a stable assignment is also a stable assignment, so there are invariable multiple local minima.

One problem with the $k$ -means algorithm is that it is sensitive to the relative scale of the dimensions. For example, if one feature is $h e i g h t$ , another feature is $a g e$ , and another is a binary feature, the different values need to be scaled so that they can be compared. How they are scaled relative to each other affects the classification.

To find an appropriate number of classes (the $k$ ), an agent could search over the number of classes. Note that $k+1$ classes can always result in a lower error than $k$ classes as long as more than $k$ different values are involved. A natural number of classes would be a value $k$ when there is a large reduction in error going from $k-1$ classes to $k$ classes, but there is only gradual reduction in error for larger values. While it is possible to construct $k+1$ classes from $k$ classes, the optimal division into three classes, for example, may be quite different from the optimal division into two classes.

Artificial Intelligence 2E

10.2.1 k-Means

Example 10.9.

Artificial
Intelligence 2E