10.2 Unsupervised Learning

The third edition of Artificial Intelligence: foundations of computational agents, Cambridge University Press, 2023 is now available (including full text).

10.2.1 k-Means

The k-means algorithm is used for hard clustering. The training examples and the number of classes, k, are given as input. The algorithm assumes that the domain of each feature defining the examples is cardinal (so that differences in values make sense).

The algorithm constructs k classes, a prediction of a value for each feature for each class, and an assignment of examples to classes.

Suppose E is the set of all training examples, and the input features are X1,,Xn. Let Xj(e) be the value of input feature Xj for example e. We assume that these are observed. We will associate a class with each integer c{1,,k}.

The k-means algorithm constructs:

  • a function class:E{1,,k}, which maps each example to a class. If class(e)=c, we say that e is in class c.

  • for each feature Xj, a function Xj^ from classes into the domain of Xj, where Xj^(c) gives the prediction for every member of class c of the value of feature Xj.

Example e is predicted to have value Xj^(class(e)) for feature Xj.

The sum-of-squares error is:

eEj=1n(Xj^(class(e))-Xj(e))2.

The aim is to find the functions class and Xj^ that minimize the sum-of-squares error.

As shown in Proposition 7.1, to minimize the sum-of-squares error, the prediction of a class should be the mean of the prediction of the examples in the class. When there are only a few examples, it is possible to search over assignments of examples to classes to minimize the error. Unfortunately, for more than a few examples, there are too many partitions of the examples into k classes for exhaustive search to be feasible.

The k-means algorithm iteratively improves the sum-of-squares error. Initially, it randomly assigns the examples to the classes. Then it carries out the two steps:

  • For each class c and feature Xj, assign to Xj^(c) the mean value of Xj(e) for each example e in class c:

    Xj^(c)e:class(e)=cXj(e)|{e:class(e)=c}|

    where the denominator is the number of examples in class c.

  • Reassign each example to a class: assign each example e to a class c that minimizes

    j=1n(Xj^(c)-Xj(e))2.

These two steps are repeated until the second step does not change the assignment of any example.

1: procedure k-means(Xs,Es,k)
2:      Inputs
3:          Xs set of features, X={X1,,Xn}
4:          Es set of training examples
5:          k number of classes      
6:      Output
7:          class: function from examples to classes
8:          predn: function from feature and class to a value for that feature      
9:      Local
10:          integer cc[c], cc_new[c] old and new class count for class c
11:          real fs[j,c], fs_new[j,c] sum of feature Xj for class c
12:          Boolean stable      
13:      Initialize fs and cc randomly based on data
14:      define predn(j,c)=fs[j,c]/cc[c] estimate of Xj^(c)
15:      define class(e)=argmincj=1n(predn(j,c)-Xj(e))2
16:      repeat
17:          fs_new and cc_new initialized to be all zero
18:          for each example eEs do
19:               c:=class(e)
20:               cc_new[c]+=1
21:               for each feature XjXs do
22:                   fs_new[j,c]+=Xj(e)                        
23:          stable:=(fs_new=fs) and (cc_new=cc)
24:          fs:=fs_new
25:          cc:=cc_new
26:      until  stable
27:      return class, predn
Figure 10.2: k-means for unsupervised learning

An algorithm that implements k-means is shown in Figure 10.2. It constructs the sufficient statistics to compute the mean of each class for each feature, namely cc[c] is the number of examples in class c, and fs[j,c] is the sum of the value for Xj(e) for examples in class c. These are then used in the function predn(j,c) which is the latest estimate of Xj^(c) and class(e) the class of example e. It uses the current values of fs and cc to determine the next values (in fs_new and cc_new).

The random initialization could be assigning each example to a class at random, selecting k points at random to be representative of the classes, or assigning some, but not all, of the examples to construct the initial sufficient statistics. The latter two methods may be more useful if the data set is large, as they avoid a pass through the whole data set for initialization.

An assignment of examples to classes is stable if an iteration of k-means does not change the assignment. Stability requires that argmin in the definition of class gives a consistent value for each example in cases where more than one class is minimal. This algorithm has reached a stable assignment when each example is assigned to the same class in one iteration as in the previous iteration. When this happens, fs and class_count do not change, and so the Boolean variable stable becomes true.

This algorithm will eventually converge to a stable local minimum. This is easy to see because the sum-of-squares error keeps reducing and there are only a finite number of reassignments. This algorithm often converges in a few iterations.

Example 10.9.

Suppose an agent has observed the X,Y pairs:

0.7,5.1, 1.5,6.1, 2.1,4.5, 2.4,5.5, 3.1,4.4, 3.5,5.1, 4.5,1.5, 5.2,0.7, 5.3,1.8, 6.2,1.7, 6.7,2.5, 8.5,9.2, 9.1,9.7, 9.5,8.5.

These data points are plotted in Figure 10.3(a). The agent wants to cluster the data points into two classes (k=2).

0

2

4

6

8

10

0

2

4

6

8

10

(a) the examples

0

2

4

6

8

10

0

2

4

6

8

10

(b) random assignment

0

2

4

6

8

10

0

2

4

6

8

10

(c) first reassignment

0

2

4

6

8

10

0

2

4

6

8

10

(d) stable assignment found

Figure 10.3: A trace of the k-means algorithm for k=2 for Example 10.9

In Figure 10.3(b), the points are randomly assigned into the classes; one class is depicted as + and the other as *. The mean of the points marked with + is 4.6,3.65, shown with . The mean of the points marked with * is 5.2,6.15, shown with .

In Figure 10.3(c), the points are reassigned according to the closer of the two means. After this reassignment, the mean of the points marked with + is then 3.96,3.27. The mean of the points marked with * is 7.15,8.34.

In Figure 10.3(d), the points are reassigned to the closest mean. This assignment is stable in that no further reassignment will change the assignment of the examples.

A different initial assignment to the points can give different clustering. One clustering that arises in this data set is for the lower points (those with a Y-value less than 3) to be in one class, and for the other points to be in another class.

Running the algorithm with three classes (k=3) typically separates the data into the top-right cluster, the left-center cluster, and the lower cluster. However, there are other possible stable assignments that could be reached, such as, having the top three point in two different classes, and the other points in another class. It is even possible for a class to contain no examples.

Some stable assignments may be better, in terms of sum-of-squares error, than other stable assignments. To find the best assignment, it is often useful to try multiple starting configurations, using a random restart, and selecting a stable assignment with the lowest sum-of-squares error. Note that any permutation of the labels of a stable assignment is also a stable assignment, so there are invariable multiple local minima.

One problem with the k-means algorithm is that it is sensitive to the relative scale of the dimensions. For example, if one feature is height, another feature is age, and another is a binary feature, the different values need to be scaled so that they can be compared. How they are scaled relative to each other affects the classification.

To find an appropriate number of classes (the k), an agent could search over the number of classes. Note that k+1 classes can always result in a lower error than k classes as long as more than k different values are involved. A natural number of classes would be a value k when there is a large reduction in error going from k-1 classes to k classes, but there is only gradual reduction in error for larger values. While it is possible to construct k+1 classes from k classes, the optimal division into three classes, for example, may be quite different from the optimal division into two classes.