foundations of computational agents
When data is missing some values for some features, the missing data cannot be ignored. Example 10.13 gives an example where ignoring missing data leads to wrong conclusions. Making inference from missing data is a causality problem, as it cannot be solved by observation, but requires a causal model.
A missingness graph, or m-graph for short, is used to model data where some values might be missing. Start with a belief network model of the domain. In the m-graph, all variables in the original graph exist with the same parents. For each variable $V$ that could be observed with some values missing, the m-graph contains two extra variables:
$M\mathrm{\_}V$, a Boolean variable that is true when $V$’s value is missing. The parents of this node can be whatever variables the missingness is assumed to depend on.
A variable ${V}^{\ast}$, with domain $dom(V)\cup \{missing\}$, where $missing$ is a new value (not in the domain of $V$). The only parents of ${V}^{\ast}$ are $V$ and $M\mathrm{\_}V$. The conditional probability table contains only 0 and 1, with the 1s being
$P({V}^{\ast}=missing\mid M\mathrm{\_}V=true)=1$ | ||
$P({V}^{\ast}=v\mid M\mathrm{\_}V=false\wedge V=v)=1.$ |
If the value of $V$ is observed to be $v$, then ${V}^{\ast}=v$ is conditioned on. If the value for $V$ is missing, ${V}^{\ast}=missing$ is conditioned on. Note that ${V}^{\ast}$ is always observed and conditioned on, and $V$ is never conditioned on, in this augmented model. When modeling a domain, the parents of $M\mathrm{\_}V$ specify what the missingness depends on.
Example 10.13 gives a problematic case of a drug that just makes people sicker and so drop out, giving missing data. A graphical model for it is shown in Figure 11.5.
Assume $Take\mathrm{\_}drug$ is Boolean and the domains of $Sick\mathrm{\_}before$ and $Sick\mathrm{\_}after$ are $\{well$, $sick$, $very\mathrm{\_}sick\}$. Then the domain of $Sick\mathrm{\_}afte{r}^{\ast}$ is $\{well$, $sick$, $very\mathrm{\_}sick$, $missing\}$. The variable $M\mathrm{\_}Sick\mathrm{\_}after$ is Boolean.
Suppose there is a dataset from which to learn, with $Sick\mathrm{\_}before$ and $Take\mathrm{\_}drug$ observed for each example, and some examples have $Sick\mathrm{\_}after$ observed and some have it missing. To condition the $m$-graph on an example, all of the variables except $Sick\mathrm{\_}after$ are conditioned on. $Sick\mathrm{\_}afte{r}^{\ast}$ has the value of $Sick\mathrm{\_}after$ when it is observed, and has value $missing$ otherwise.
You might think that you can learn the missing data using expectation maximization (EM), with $Sick\mathrm{\_}after$ as a hidden variable. There are, however, many probability distributions that are consistent with the data. All of the missing cases could have value $well$ for $Sick\mathrm{\_}after$, or they all could be $very\mathrm{\_}sick$; you can’t tell from the data. EM can converge to any one of these distributions that are consistent with the data. Thus, although EM may converge, it does not converge to something that makes predictions that can be trusted.
To determine appropriate model parameters, one should find some data about the relationship between $Sick\mathrm{\_}after$ and $M\mathrm{\_}Sick\mathrm{\_}after$. When doing a human study, the designers of the study need to try to find out why people dropped out of the study. These cases cannot just be ignored.
A distribution is recoverable or identifiable from missing data if the distribution can be accurately measured from the data, even with parts of the data missing. Whether a distribution is recoverable is a property of the underlying graph. A distribution that is not recoverable cannot be reconstructed from observational data, no matter how large the dataset. The distribution in Example 11.5 is not recoverable.
Data is missing completely at random (MCAR) if $V$ and $M\mathrm{\_}V$ are independent. If the data is missing completely at random, the examples with missing values can be ignored. This is a strong assumption that rarely occurs in practice, but is often implicitly assumed when missingness is ignored.
A weaker assumption is that a variable $Y$ is missing at random (MAR), which occurs when $Y$ is independent of $M\mathrm{\_}Y$ given the observed variables ${V}_{o}$. That is, when $P(Y\mid {V}_{o},M\mathrm{\_}Y)=P(Y\mid {V}_{o})$. This occurs when the reason the data is missing can be observed. The distribution over $Y$ and the observed variables is recoverable by $P(Y,{V}_{o})=P(Y\mid {V}_{o},M\mathrm{\_}Y=false)P({V}_{o})$. Thus, the non-missing data is used to estimate $P(Y\mid {V}_{o})$ and all of the data is used to estimate $P({V}_{o})$.
Suppose you have a dataset of education and income, where the income values are often missing, and have modeled that income depends on education. You want to learn the joint probability of $Income$ and $Education$.
If income is missing completely at random, shown in Figure 11.6(a), the missing data can be ignored when learning the probabilities:
$$P(Income,Education)=P(Incom{e}^{\ast},Education\mid M\mathrm{\_}Income=false)$$ |
since $M\mathrm{\_}Income$ is independent of $Income$ and $Education$.
If income is missing at random, shown in Figure 11.6(b), the missing data cannot be ignored when learning the probabilities, however
$P($ | $Income,Education)$ | ||
$=P(Income\mid Education)\ast P(Education)$ | |||
$=P(Income\mid Education\wedge M\mathrm{\_}Income=false)\ast P(Education)$ | |||
$=P(Incom{e}^{\ast}\mid Education\wedge M\mathrm{\_}Income=false)\ast P(Education).$ |
Both of these can be estimated from the data. The first probability can ignore the examples with $Income$ missing, and the second cannot.
If $Income$ is missing not at random, as shown in Figure 11.6(c), which is similar to Figure 11.5, the probability $P(Income,Education)$ cannot be learned from data, because there is no way to determine whether those who don’t report income are those with very high income or very low income. While algorithms like EM converge, what they learn is fiction, converging to one of the many possible hypotheses about how the data could be missing.
The main points to remember are:
You cannot learn from missing data without making modeling assumptions.
Some distributions are not recoverable from missing data, and some are. It depends on the independence structure of the underlying graph.
If the distribution is not recoverable, a learning algorithm still may be able to learn parameters, but the resulting distributions should not be trusted.