Artificial Intelligence - foundations of computational agents -- 11.1.3 EM for Soft Clustering

Third edition of Artificial Intelligence: foundations of computational agents, Cambridge University Press, 2023 is now available (including the full text).

11.1.3 EM for Soft Clustering

The EM algorithm can be used for soft clustering. Intuitively, for clustering, EM is like the k-means algorithm, but examples are probabilistically in classes, and probabilities define the distance metric. We assume here that the features are discrete.

As in the k-means algorithm, the training examples and the number of classes, k, are given as input.

When clustering, the role of the categorization is to be able to predict the values of the features. To use EM for soft clustering, we can use a naive Bayesian classifier, where the input features probabilistically depend on the class and are independent of each other given the class. The class variable has k values, which can be just {1,...,k}.

Model

Data

→

Probabilities

X₁ X₂ X₃ X₄

t f t t

f t t f

f f t t

···

P(C)

P(X₁|C)

P(X₂|C)

P(X₃|C)

P(X₄|C)

Figure 11.2: EM algorithm: Bayesian classifier with hidden class

Given the naive Bayesian model and the data, the EM algorithm produces the probabilities needed for the classifier, as shown in Figure 11.2. The class variable is C in this figure. The probability distribution of the class and the probabilities of the features given the class are enough to classify any new example.

To initialize the EM algorithm, augment the data with a class feature, C, and a count column. Each original tuple gets mapped into k tuples, one for each class. The counts for these tuples are assigned randomly so that they sum to 1. For example, for four features and three classes, we could have the following:

X₁	X₂	X₃	X₄
...	...	...	...
t	f	t	t
...	...	...	...

-->

X₁	X₂	X₃	X₄	C	Count
...	...	...	...	...	...
t	f	t	t	1	0.4
t	f	t	t	2	0.1
t	f	t	t	3	0.5
...	...	...	...	...	...

If the set of training examples contains multiple tuples with the same values on the input features, these can be grouped together in the augmented data. If there are m tuples in the set of training examples with the same assignment of values to the input features, the sum of the counts in the augmented data with those feature values is equal to m.

Figure 11.3: EM algorithm for unsupervised learning

The EM algorithm, illustrated in Figure 11.3, maintains both the probability tables and the augmented data. In the E step, it updates the counts, and in the M step it updates the probabilities.

1: Procedure EM(X,D,k)
2:           Inputs
3:                     X set of features X={X₁,...,X_n}
4:                     D data set on features {X₁,...,X_n}
5:                     k number of classes Output
6:                     P(C), P(X_i|C) for each i∈{1:n}, where C={1,...,k}. Local
7:                     real array A[X₁,...,X_n,C]
8:                     real array P[C]
9:                     real arrays M_i[X_i,C] for each i∈{1:n}
10:                     real arrays P_i[X_i,C] for each i∈{1:n}
11:           s← number of tuples in D
12:           Assign P[C], P_i[X_i,C] arbitrarily
13:           repeat
14:           // E Step
15:                     for each assignment ⟨X₁=v₁,...,X_n=v_n⟩∈D do
16:                               let m ←|⟨X₁=v₁,...,X_n=v_n⟩∈D|
17:                               for each c ∈{1:k} do
18:                                         A[v₁,...,v_n,c]←m×P(C=c|X₁=v₁,...,X_n=v_n)
19:
20:
21:                     // M Step
22:                     for each i∈{1:n} do
23:                               M_i[X_i,C]=∑_{X₁,...,X_i-1,X_i+1,...,X_n} A[X₁,...,X_n,C]
24:                               P_i[X_i,C]=(M_i[X_i,C])/(∑_C M_i[X_i,C])
25:
26:                     P[C]=∑_{X₁,...,X_n} A[X₁,...,X_n,C]/s
27:           until termination

Figure 11.4: EM for unsupervised learning

The algorithm is presented in Figure 11.4. In this figure, A[X₁,...,X_n,C] contains the augmented data; M_i[X_i,C] is the marginal probability, P(X_i,C), derived from A; and P_i[X_i,C] is the conditional probability P(X_i|C). It repeats two steps:

E step: Update the augmented data based on the probability distribution. Suppose there are m copies of the tuple ⟨X₁=v₁,...,X_n=v_n⟩ in the original data. In the augmented data, the count associated with class c, stored in A[v₁,...,v_n,c], is updated to
m ×P(C=c|X₁=v₁,...,X_n=v_n).

Note that this step involves probabilistic inference, as shown below.
M step: Infer the maximum-likelihood probabilities for the model from the augmented data. This is the same problem as learning probabilities from data.

The EM algorithm presented starts with made-up probabilities. It could have started with made-up counts. EM will converge to a local maximum of the likelihood of the data. The algorithm can terminate when the changes are small enough.

This algorithm returns a probabilistic model, which can be used to classify an existing or new example. The way to classify a new example, and the way to evaluate line 18, is to use the following:

P(C=c|X₁=v₁,...,X_n=v_n)

= (P(C=c) ×∏_i=1ⁿ P(X_i=v_i|C=c))/(∑_c'P(C=c') ×∏_i=1ⁿ P(X_i=v_i|C=c')) .

The probabilities in this equation are specified as part of the model learned.

Notice the similarity with the k-means algorithm. The E step (probabilistically) assigns examples to classes, and the M step determines what the classes predict.

Example 11.3: Consider Figure 11.3. Let E' be the augmented examples (i.e., with C and the count columns). Suppose there are m examples. Thus, at all times the sum of the counts in E' is m.

In the M step, P(C=i) is set to the proportion of the counts with C=i, which is

(∑_{X₁,...,X_n} A[X₁,...,X_n,C=i])/(m) ,

which can be computed with one pass through the data.

M₁[X₁,C], for example, becomes

∑_{X₂,X₃ ,X₄} A[X₁,...,X₄,C].

It is possible to update all of the M_i[X_i,C] arrays with one pass though the data. See Exercise 11.3. The conditional probabilities represented by the P_i arrays can be computed from the M_i arrays by normalizing.

The E step updates the counts in the augmented data. It will replace the 0.4 in Figure 11.3 with

P(C=1|x₁ ∧¬x₂ ∧x₃ ∧x₄)

= (P(x₁|C=1) P(¬x₂|C=1)P(x₃|C=1)P(x₄|C=1)P(C=1))/(∑_i=1³ P(x₁|C=i) P(¬x₂|C=i)P(x₃|C=i)P(x₄|C=i)P(C=i)) .

Each of these probabilities is provided as part of the learned model.

Note that, as long as k>1, EM virtually always has multiple local maxima. In particular, any permutation of the class labels of a local maximum will also be a local maximum.

P(C=c\|X₁=v₁,...,X_n=v_n)
	=	(P(C=c) ×∏_i=1ⁿ P(X_i=v_i\|C=c))/(∑_c'P(C=c') ×∏_i=1ⁿ P(X_i=v_i\|C=c')) .

P(C=1\|x₁ ∧¬x₂ ∧x₃ ∧x₄)
	=	(P(x₁\|C=1) P(¬x₂\|C=1)P(x₃\|C=1)P(x₄\|C=1)P(C=1))/(∑_i=1³ P(x₁\|C=i) P(¬x₂\|C=i)P(x₃\|C=i)P(x₄\|C=i)P(C=i)) .