11.2.3 Missing Data

Data can be incomplete in ways other than having an unobserved variable. A data set can simply be missing the values of some variables for some of the tuples. When some of the values of the variables are missing, one must be very careful in using the data set because the missing data may be correlated with the phenomenon of interest.

Example 11.6: Suppose you have a (claimed) treatment for a disease that does not actually affect the disease or its symptoms. All it does is make sick people sicker. If you were to randomly assign patients to the treatment, the sickest people would drop out of the study, because they become too sick to participate. The sick people who took the treatment would drop out at a faster rate than the sick people who did not take the treatment. Thus, if the patients for whom the data is missing are ignored, it looks like the treatment works; there are fewer sick people in the set of those who took the treatment and remained in the study!

If the data is missing at random, the missing data can be ignored. However, "missing at random" is a strong assumption. In general, an agent should construct a model of why the data is missing or, preferably, it should go out into the world and find out why the data is missing.