8.5 Sequential Probability Models

The third edition of Artificial Intelligence: foundations of computational agents, Cambridge University Press, 2023 is now available (including full text).

8.5.1 Markov Chains

A Markov chain is a belief network with random variables in a sequence, where each variable only directly depends on its predecessor in the sequence. Markov chains are used to represent sequences of values, such as the sequence of states in a dynamic system or the sequence of words in a sentence. Each point in the sequence is called stage.

Figure 8.11: A Markov chain as a belief network

Figure 8.11 shows a generic Markov chain as a belief network. The network has 5 stages, but does not have to stop at stage 4; it can extend indefinitely. The belief network conveys the independence assumption

P(Si+1S0,,Si)=P(Si+1Si),

which is called the Markov assumption.

Often the sequences are in time and, St represents the state at time t. Intuitively, St conveys all of the information about the history that could affect the future states. The independence assumption of the Markov chain can be seen as “the future is conditionally independent of the past given the present.”

A Markov chain is a stationary model or time-homogenous model if the variables all have the same domain, and the transition probabilities are the same for each stage, i.e.,

for all i0,P(Si+1Si)=P(S1S0)

To specify a stationary Markov chain, two conditional probabilities are provided:

  • P(S0) specifies the initial conditions

  • P(Si+1Si) specifies the dynamics, which is the same for each i0.

Stationary Markov chains are of interest for the following reasons:

  • They provide a simple model that is easy to specify.

  • The assumption of stationarity is often the natural model, because the dynamics of the world typically does not change in time. If the dynamics does change in time, it is usually because of some other feature that could also be modeled.

  • The network extends indefinitely. Specifying a small number of parameters gives an infinite network. You can ask queries or make observations about any arbitrary points in the future or the past.

Pagerank

Google’s initial search engine [Brin and Page, 1998] was based on Pagerank. Pagerank [Page et al., 1999] is a probability measure over web pages where the most influential web pages have the highest probability. It is based on a Markov chain of a random web surfer who starts on a random page, and with some probability d picks a random page that is linked from the current page, and otherwise (if the current page has no outgoing links or with probability 1-d) picks a page at random. The Markov chain is defined as follows:

  • The domain of Si is the set of all web pages.

  • P(S0) is the uniform distribution of web pages: P(S0=pj)=1/N for each web page pj, where N is the number of web pages.

  • The transition is defined as follows:

    P(Si+1 =pjSi=pk)
    =(1-d)/N+d*{1/nkif pk links to pj1/Nif pk has no links0otherwise

    where there are N web pages and there nk links on page pk. The way to think about this is that pk is the current web page, and pj is the next web page. If pk has no outgoing links, then pj is a page at random, which is the effect of the middle case. If pk has outgoing links, with probability d the surfer picks a random page linked from pk, and otherwise picks a page at random.

  • d0.85 is the probability someone picks a link on the current page.

This Markov chain converges to a distribution over web pages. Page et al. [1999] reported the search engine had converged to “a reasonable tolerance” for i=52 with 322 million links.

Pagerank provides a measure of influence. To get a high Pagerank, a web page should be linked from other pages with a high Pagerank. It is difficult, yet not impossible, to manipulate Pagerank for selfish reasons. One could try to artificially boost Pagerank for a specific page, by creating many pages that point to that page, but it is difficult for those referring pages to also have a high pagerank.

In the initial reported version, Brin and Page [1998] used 24 million web pages and 76 million links. The web is more complex now with many pages being dynamically generated, and search engines use much more sophisticated algorithms.

To determine the probability distribution of state Si, variable elimination can be used to sum out the preceding variables. The variables after Si are irrelevant to the probability of Si and need not be considered. To compute P(SiSk), where i>k, only the variables between Si and Sk need to be considered, and if i<k, only the variables less than k need to be considered.

A stationary distribution of a Markov chain is a distribution of the states such that if it holds at one time, it holds at the next time. Thus P is a stationary distribution if for each state s, P(Si+1=s)=P(Si=s). Thus,

P(Si=s)=siP(Si+1=sSi)*P(Si)

A Markov chain is ergodic if, for any two states s1 and s2 in the domain of Si, there is a non-zero probability of eventually reaching s2 from s1. A Markov chain is periodic with period p if the difference between the times when it visits the same state is always divisible by p. For example, consider the Markov chain with states the integers 0 to 9, and at each time it either adds 1 or adds 9 (modulo 10), each with probability 0.5. This Markov chain is periodic with period 2; if it starts in an even state at time 0, it will be in an even state at even times, and in an odd state at odd times. If the only period of a Markov chain is a period of 1, then the Markov chain is aperiodic.

If a Markov chain is ergodic and aperiodic, then there is a unique stationary distribution, and this is the equilibrium distribution that will be approached from any starting state. Thus for any distribution over S0, the distribution over Si will get closer and closer to the equilibrium distribution, as i gets larger.