When do Markov chains converge?

Foreword: This post was written for SoME4. The goal of this post is to present some standard Markov chain concepts, but make them genuinely intuitive through interactive visualizations that let you explore the ideas hands-on. This post is written for undergraduates who have some probability background. You'll get the most out of this if you're familiar with basic Markov chain concepts (states, transition matrices, etc.), but even if you're not, the interactive elements should help fill in gaps, and a quick Google search or ChatGPT query can clarify anything unfamiliar. One important note: please read this on a computer or laptop since some of the animations won't work properly on mobile devices.

At the start of the 20th century, it was commonly believed that the strong law of large numbers only applied to systems with independent events (think repeatedly flipping a coin or rolling a die). However, Russian mathematician Andrey Markov believed that even in systems where events influence each other, the system would converge to predictable averages over time, so long as the system satisfied certain criteria.^[1]

For those unfamiliar with the strong law of large numbers (SLLN), here is a simple example of what it tells us: if you flip a fair coin repeatedly, the fraction of flips that land on heads will eventually converge to $1/2$, the probability of getting heads for an arbitrary flip. It relates the limiting fraction of times an event occurs to the probability of it happening for an arbitrary independent trial. Pretty intuitive, right?

Now consider the following somewhat strange variation: the first time you flip the coin there is a 100% chance of heads and a 0% chance of tails. For every flip thereafter, there is a 90% chance that the coin lands on the same face as the previous flip. This coin-flipping process is represented by the Markov chain below: the black dot starts at heads and each time you advance the chain by pressing the arrow button, there is a 90% chance the dot stays in the same state and a 10% chance it switches states.

Since each coin flip now depends on the previous one, the SLLN does not apply. However, Markov believed that, even without independence, some analogue of the SLLN should still hold: that is, the fraction of times each event occurs will eventually converge to a predictable quantity. Is this true for the above Markov chain? Observe how the fraction of heads and tails evolves as the above chain advances (this is the percentage shown under each state). Do they seem like they are converging to something? If so, do you think they will always converge to these quantities, regardless of how many times the Markov chain is restarted?

An independence-free law of large numbers

Markov eventually succeeded and proved what is now sometimes referred to as the Fundamental Theorem of Markov Chains (FTMC). When using the SLLN, we had the fraction of heads converge to the probability of heads for an arbitrary coin flip. However, there is no longer a clear definition for an arbitrary coin flip: the heads probability can change between flips. To resolve this, we will instead consider the heads probability of a coin flip very far in the future. To measure this, we can imagine doing the following:

Flip the coin a large number of times, say 1000 (i.e., advance the above Markov chain 999 steps).
Record whether the 1000th flip is heads or tails.
Restart and repeat a large number of times.

The result of this process is the distribution of the 1000th coin flip. We call such a distribution an ensemble-average distribution since it is an average over many sample paths (many runs of the Markov chain). This is distinct from a time-average distribution which is the fraction of times an event occurs over a single sample path (a single run of the Markov chain). If we wanted to instead measure the time-average distribution up to the 1000th coin flip (for a single sample path), we would do the following:

Flip the coin 1000 times.
Record the number of heads and tails for each of the 1000 flips.
Stop.

The FTMC tells us that the time-average distribution of the first $n$ coin flips will converge to the ensemble-average distribution of the $n$-th coin flip as $n$ gets larger and larger. If each coin flip was independent, then the ensemble-average distribution of the $n$-th coin flip is identical to that of an arbitrary flip, and we get back the SLLN. Thus, the FTMC can be viewed as a generalization of the SLLN to certain settings where the events are not independent.

To see this in action, we can simulate the ensemble-average distribution after each flip. The Markov chain below is identical to the previous one, except that now the percentages show the ensemble-average distribution after $n$ flips (whereas they showed the time-average distribution in the previous one). In this case, the small black dots are only a visual aid to remind us that the ensemble-average distribution comes from averaging over many sample paths, and to help us better understand how the distribution evolves over time.

Does it seem like the ensemble-average distribution in this simulation and the time-average distribution in the previous simulation are converging? In this case, the answer should be yes (if it doesn't look like it, let them both run for longer). Systems where time and ensemble-average distributions converge are called ergodic.^[2] This is a desirable property since it means that we can learn the statistical properties of a system by observing any single (long) trajectory, rather than needing to find a way to restart the system over and over, which can be hard or impossible in the real world.

In this post, we will see that a Markov chain is ergodic if:

a unique limiting ensemble-average distribution exists, and
a unique limiting time-average distribution exists.

Here, by unique, we mean that the limiting distribution does not depend on the initial state, nor on the particular sample path. However, it is not obvious by looking at a Markov chain whether either of these conditions holds. This is where the FTMC comes in. The FTMC provides us two structural properties and tells us that if the Markov chain satisfies both these properties, then both the above conditions hold and the Markov chain is ergodic.

The purpose of the remainder of this post is to understand what these two properties are and why they are sufficient for the Markov chain to be ergodic. In particular, we will explore the following questions:

For which Markov chains does a unique limiting ensemble-average distribution exist?
For which Markov chains does a unique limiting time-average distribution exist?
If both distributions exist, why must they be the same?

To keep things simple, we will restrict ourselves to considering finite Markov chains.

One more example of an ergodic Markov chain

Before jumping into answering these questions, let's look at one more, slightly more complicated Markov chain. While this example won't necessarily teach us anything new, it can be helpful to see more than one example, especially if the first example is very simple (e.g. a two-state Markov chain).

For this example, imagine the following scenario. There is a tiny cafe and we are trying to model the number of people waiting in line. At this cafe:

Each minute, the customer at the front of the line has a 50% chance of being served and leaving.
Each minute, there's a 50% chance that a new customer arrives at the cafe.
New customers wait until the end of the minute and then join the line if there's space; otherwise, they leave.

Since this cafe is tiny, there is only space for 5 customers in the line. We can model this with the following Markov chain.

In this case we are assuming that the cafe is initially full (5 customers). If you let this run for long enough, you will see the time-average distribution (very slowly) converge. Will the time-average distribution converge to the same thing if the cafe is initially empty? You can test this out by pressing on state "0" after resetting the chain and then letting it run again.

Like in the coin flip example, we can also simulate the ensemble-average distribution.

You can experiment with how the ensemble-average distribution changes as you change the initial state. You should observe that both the time and ensemble-average distributions for this chain do not depend on the initial state, and that the distributions are eventually roughly the same.

Part 1: When does a unique limiting ensemble-average distribution exist?

To answer this question, we will:

Look for counterexamples: Markov chains where either a limiting ensemble-average distribution does not exist, or where it exists but is not unique.
Show that the counterexamples we found are truly the only barriers to convergence. We do this by proving convergence for all finite Markov chains that aren't one of these counterexamples.

Properties preventing existence or uniqueness (ensemble-average)

Can we construct a Markov chain whose limiting ensemble-average distribution either doesn't exist or isn't unique? If so, which properties of the chain are responsible for preventing existence or uniqueness?

I could give you the answers directly, but you'll learn more by exploring on your own first! Press the "Try it out!" button below to construct your own Markov chains and test whether their limiting ensemble-average distributions exist and are unique. Try to construct one chain where the limiting distribution doesn't exist, and one where it exists but isn't unique. Once you've given it a try, scroll down for the answers.

Now that you've explored a bit, let's talk about it. It turns out that there are actually only two things that can prevent a unique limiting ensemble-average distribution from existing in a finite Markov chain:

Multiple trapping regions: The Markov chain has multiple subsets of the state space where it can "get trapped". Depending on the starting state, it may end up trapped in different subsets, so the limiting distribution isn't unique.
Infinite cycling: The Markov chain cycles endlessly between a handful of different distributions, so there is no limiting distribution.

Formally, we refer to case (1) as a Markov chain with multiple recurrent classes (where a recurrent class is the formal name for a trapping subset), and we refer to case (2) as a periodic Markov chain. Let's fully define each of these terms and then look at some examples.

Markov chains with multiple recurrent classes

To fully explain what a recurrent class is, we need a little more terminology:

We say that two states of a Markov chain, say $i$ and $j$, communicate if it is possible for the chain to eventually reach state $i$ starting at state $j$ and vice versa.
A communicating class is a set of states that all communicate with each other, and where every other state in the Markov chain doesn't communicate with any of them.

We separate communicating classes into two types:

A transient class is a communicating class that the Markov chain can (and will) eventually leave forever.
A recurrent class is a communicating class that, once the Markov chain enters it, it is stuck inside forever.

Don't worry if these definitions are a little confusing right now; the best way to understand these ideas is through examples! In each of the following examples, the Markov chain has multiple recurrent classes. Observe the effect this has: even though limiting ensemble-average distributions exist, they depend on the starting state.

Example 1: In this example, the Markov chain has two recurrent classes: $\{\text{R1}\}$ and $\{\text{R2}\}$. The states R1 and R2 don't communicate with each other, and if the Markov chain is at one of these states, it stays there forever.

Example 2: This is similar to the above example, but the Markov chain now also has a transient class, $\{\text{T1}\}$. T1 doesn't communicate with either of R1 or R2 since, while the chain can go from T1 to either state, it cannot go back: once the Markov chain leaves T1, it never returns. To see that the limiting ensemble-average distribution depends on the starting state, click on one of R1 or R2 to set a different starting state.

Example 3: This chain has one transient class, $\{\text{T1-1}, \text{T1-2}, \text{T1-3}\}$, and three recurrent classes: $\{\text{R1-1}\}$, $\{\text{R2-1}, \text{R2-2}\}$, and $\{\text{R3-1}, \text{R3-2}, \text{R3-3}\}$.

Periodic Markov chains

Throughout this post, we're going to use notation like $\left(X_n\right)_{n \geq 0}$ to represent Markov chains. That is $X_0$ represents the starting state of the Markov chain, $X_1$ is the state after 1 step, and so on for all $n \geq 0$. We mention this here since we're about to need it :).

We'll say that a recurrent class of a Markov chain is aperiodic if there exists some $N \geq 0$ such that $$\mathbf P[X_N = j ~|~ X_0 = i] > 0$$ for every pair of states $i$ and $j$ in the recurrent class. Intuitively, this means that the chain doesn't get trapped in rigid cycles: after $N$ steps, you have at least some probability of being at any state in the recurrent class, regardless of where in the recurrent class you started. We say that a recurrent class is periodic if it is not aperiodic.

We say a Markov chain is periodic if any of its recurrent classes are periodic, and we say it is aperiodic otherwise.

Here are some examples of periodic Markov chains. Notice how, in each of these examples, a limiting ensemble-average distribution does not exist.

Example 1: This chain has a single recurrent class that is periodic: after an even number of steps, the chain will always be at R1-1, and after an odd number of steps, the chain will always be at R1-2.

Example 2: This chain has three transient classes: $\{\text{T1-1}\}$, $\{\text{T2-1}, \text{T2-2}\}$, and $\{\text{T3-1}, \text{T3-2}, \text{T3-3}\}$. The single recurrent class, $\{\text{R1-1}, \text{R1-2}, \text{R1-3}\}$ is periodic.

Example 3: This chain has one transient class, $\{\text{T1-1}, \text{T1-2}\}$, and two recurrent classes: $\{\text{R2-1}, \text{R2-2}, \text{R2-3}\}$ and $\{\text{R3-1}, \text{R3-2}\}$. Both the recurrent classes are periodic.

We have seen two different properties of a finite Markov chain that can prevent a unique limiting ensemble-average distribution from existing: periodicity and multiple recurrent classes. In the next section, we will show that these are the only obstacles to convergence.

Proving convergence for finite, aperiodic Markov chains with a single recurrent class (ensemble-average)

We will now prove that if a finite Markov chain is aperiodic and has only one recurrent class, then its ensemble-average distribution converges to a unique limit. This will confirm that periodicity and multiple recurrent classes are indeed the only barriers to convergence for finite Markov chains.

One thing you may have noticed from exploring the above simulations is that, when a limiting distribution does exist, it is always "stable" in the sense that advancing the Markov chain by one more step does not change the distribution. Intuitively this makes sense: if the distribution changed after one more step, then it couldn't be the limiting distribution, which results from taking infinitely many steps. Distributions with this property are known as stationary distributions. Formally, a distribution $\mathbf x$ is stationary if $\mathbf x = \mathbf x P$, where $P$ is the Markov chain's transition matrix.

Lemma 1

If a limiting ensemble-average distribution exists, it is stationary.

Let $\mathbf x_0$ denote the initial distribution of the Markov chain and let $P$ be the Markov chain's transition matrix. Then $\mathbf x_n = \mathbf x_{n-1}P$ is the distribution after advancing $n$ steps, and $\mathbf x = \lim_{n \to \infty} \mathbf x_n$ is the limiting ensemble-average distribution, which we are assuming exists.

Observe that, $$ \mathbf x P = \lim_{n \to \infty} \left(\mathbf x_n \right) P = \lim_{n \to \infty} \left(\mathbf x_n P \right) = \lim_{n \to \infty} \left(\mathbf x_{n+1} \right) = \mathbf x,$$ which means that $\mathbf x$ is stationary. The key step is the second equality, which follows from the continuity of matrix multiplication (which allows us to move the $P$ into the limit).

With this lemma in hand, our proof strategy becomes clearer. Rather than just showing that the ensemble-average distribution converges to something, we now know we should show convergence to a stationary distribution. The proof of this is very elegant and employs a clever technique.

Theorem 1

Let $\left(X_n\right)_{n \geq 0}$ be a finite, aperiodic Markov chain with a single recurrent class and arbitrary initial distribution. Suppose that this Markov chain has a stationary distribution $\pi$. Then, $$\mathbf P[X_n = j] \to \pi(j) \quad \text{as $n \to \infty$ for all states $j$}.$$ That is, the Markov chain's ensemble-average distribution will converge to the stationary distribution independent of the initial distribution.

The key step is to cleverly construct multiple copies of the same Markov chain that depend on each other in a useful way.^[3]

Start with two independent copies of the Markov chain:

$\left(X_n\right)_{n\geq 0}$ with arbitrary initial distribution.
$\color[rgb]{0.215,0.698,0.302} \left(Y_n\right)_{n\geq 0}$ with stationary initial distribution $\pi$.

Note that since $\pi$ is stationary, $\color[rgb]{0.215,0.698,0.302}Y_n$ has distribution $\pi$ for all $n \geq 0$.

Now comes the clever part. We construct a third chain $\color[rgb]{1,0.5725,0.1686}\left(Z_n\right)_{n\geq 0}$ that starts by following $X_n$ but switches to following $\color[rgb]{0.215,0.698,0.302}Y_n$ the first time they all meet. Specifically, let $T$ be the first time $X_n = {\color[rgb]{0.215,0.698,0.302}Y_n}$. Define $$ {\color[rgb]{1,0.5725,0.1686}Z_n} = \begin{cases} X_n & n < T \cr {\color[rgb]{0.215,0.698,0.302}Y_n} & n \geq T \end{cases}. $$ The simulation below illustrates this construction. The black dot represents $X_n$, the green dot represents $\color[rgb]{0.215,0.698,0.302}Y_n$, and the orange dot represents $\color[rgb]{1,0.5725,0.1686}Z_n$. You will see that the orange dot follows the black dot until all three meet, and then switches to following the green dot. You can press "Sample $Y_n$ from stationary" to randomly set $\color[rgb]{0.215,0.698,0.302}Y_n$'s initial state according to the stationary distribution, or you can shift-click any state to manually set where $\color[rgb]{0.215,0.698,0.302}Y_n$ starts.

What's special about $\color[rgb]{1,0.5725,0.1686}Z_n$?

First, if we know that $X_n$ and $\color[rgb]{0.215,0.698,0.302}Y_n$ meet eventually, then $\color[rgb]{1,0.5725,0.1686}Z_n$'s distribution will converge to the stationary distribution. This is because for large $n$, it becomes increasingly likely that $X_n$ and $\color[rgb]{0.215,0.698,0.302}Y_n$ have already met, which would mean that $\color[rgb]{1,0.5725,0.1686}Z_n$ is at the same state as $\color[rgb]{0.215,0.698,0.302}Y_n$ (which has the stationary distribution).

More rigorously, if $\mathbf P[n < T] \to 0$, then $$ \mathbf P[{\color[rgb]{1,0.5725,0.1686}Z_n} = j] = \mathbf P[{\color[rgb]{0.215,0.698,0.302}Y_n} = j \text{ and } n \geq T] + \mathbf P[X_n = j \text{ and } n < T] \sim \mathbf P[{\color[rgb]{0.215,0.698,0.302}Y_n} = j] = \pi(j). $$ (If you are unfamiliar with the $\sim$ notation, it means the expressions converge as $n \to \infty$.^[4])

It can be tempting to say that in a finite Markov chain, it's obvious that $X_n$ and $\color[rgb]{0.215,0.698,0.302}Y_n$ meet eventually, but this is not always true. Let's see what can go wrong by revisiting our earlier examples of chains where the ensemble-average distribution does not converge.

Example 1: If the chain is periodic, the $X_n$ and $\color[rgb]{0.215,0.698,0.302}Y_n$ may cycle forever without meeting. In this example, $\mathbf P[n < T] = 1$ for all $n \geq 0$ if $X_0 \neq {\color[rgb]{0.215,0.698,0.302}Y_0}$.

Example 2: If the chain has multiple recurrent classes, then $X_n$ and $\color[rgb]{0.215,0.698,0.302}Y_n$ may end up in different recurrent classes and never meet. In this example, $\mathbf P[n < T] = 0.5$ for all $n \geq 1$. (Note that since $\mathbf P[n < T] > 0$, the two chains do sometimes meet, so you may need to reset a couple times to see an example where they don't).

However, $X_n$ and $\color[rgb]{0.215,0.698,0.302}Y_n$ will meet eventually for any finite, aperiodic Markov chain with a single recurrent class. You can see an example of this in the simulation below, which shows the probability that $\color[rgb]{1,0.5725,0.1686}Z_n$ is already following $\color[rgb]{0.215,0.698,0.302}Y_n$ at each step. The orange probability below the state with the green dot represents $\mathbf P[n \geq T]$, and you will see that this approaches $1$ as time progresses.

The fact that aperiodicity and a single recurrent class are sufficient to ensure that $X_n$ and $\color[rgb]{0.215,0.698,0.302}Y_n$ eventually meet is intuitive:

Both chains must end up forever moving around the same set of states (the recurrent class).
They can't get stuck in rigid cycles within that set and therefore permanently avoid each other.

For completeness, we'll prove this rigorously later (Lemma 2), but we'll take this for granted for now.

So to summarize, so far we've shown that $\color[rgb]{1,0.5725,0.1686}Z_n$'s ensemble-average distribution will converge to $\pi$, the stationary distribution. How does this help us prove the same thing for $X_n$?

The key observation is that $X_n$ and $\color[rgb]{1,0.5725,0.1686}Z_n$ have identical distributions at every time step, even after $\color[rgb]{1,0.5725,0.1686}Z_n$ starts following $\color[rgb]{0.215,0.698,0.302}Y_n$. This is because they are both copies of the same Markov chain starting from the same initial distribution (since $\color[rgb]{1,0.5725,0.1686}Z_n$ initially follows $X_n$).

Many people find this initially confusing: $\color[rgb]{1,0.5725,0.1686}Z_n$ eventually follows $\color[rgb]{0.215,0.698,0.302}Y_n$, so how can it still be distributed like $X_n$? What's important to remember is that the distribution of $\color[rgb]{1,0.5725,0.1686}Z_n$ depends only on the transition probabilities between states (and the distribution of $\color[rgb]{1,0.5725,0.1686}Z_0$), not on where it goes for any individual sample path. Since $X_n$ and $\color[rgb]{1,0.5725,0.1686}Z_n$ start at the same place, and then move according to the same transition probabilities even after $\color[rgb]{1,0.5725,0.1686}Z_n$ starts following $\color[rgb]{0.215,0.698,0.302}Y_n$, their distributions will be the same. Here is one last simulation of the coupled system that tries to make this visually clear. It initially shows only a single sample path, but it switches to showing how the probability distributions of $X_n$, $\color[rgb]{0.215,0.698,0.302}Y_n$, and $\color[rgb]{1,0.5725,0.1686}Z_n$ evolve after the first time they meet.

The important thing to observe here is that even though $\color[rgb]{1,0.5725,0.1686}Z_n$ switches to following $\color[rgb]{0.215,0.698,0.302}Y_n$ after they all meet, $\color[rgb]{1,0.5725,0.1686}Z_n$'s distribution continues to evolve in the same way that $X_n$'s does.

Therefore, for all states $j$ and $n \geq 0$, $$ \mathbf P[X_n = j] = \mathbf P[{\color[rgb]{1,0.5725,0.1686}Z_n} = j] \to \pi(j), $$ which completes the proof of Theorem 1.

Note that an immediate corollary of Theorem 1 is that, if a finite, aperiodic Markov chain with a single recurrent class has a stationary distribution $\pi$, that must be the unique stationary distribution. This is because, if there were two stationary distributions $\pi$ and $\pi'$, the theorem tells us that the ensemble-average distribution of $X_n$ converges to both $\pi$ and $\pi'$, which is clearly impossible unless $\pi = \pi'$.

A careful reader may have noticed an important gap in our results: we have not yet proven that a stationary distribution must exist. We have only proven that if such a distribution exists, the ensemble-average distribution will converge to it. This gap will be resolved in the next section where we consider limiting time-average distributions.

We conclude this section by proving the lemma we skipped in the above proof. This proof is included for completeness, but can be safely skipped—the intuition was already covered in the proof of Theorem 1.

Lemma 2

If the Markov chain is finite, aperiodic, and has a single recurrent class, then $$\mathbf P[n < T] \to 0.$$ That is, $X_n$ and $\color[rgb]{0.215,0.698,0.302}Y_n$ eventually meet with probability $1$.

Since the chain is finite and has a single recurrent class, with probability $1$, $X_n$ and $\color[rgb]{0.215,0.698,0.302}Y_n$ will eventually both be in the recurrent class. So for simplicity, let's just assume that they both start in the recurrent class.

Since the chain is aperiodic, there is some $N \geq 0$ such that $P^N_{ij} > 0$ for every pair of states $i$ and $j$ in the recurrent class (where $P^N_{ij}$ is the $ij$-th entry of the transition matrix raised to the $N$-th power).

Thus, if $X_n = i$ and ${\color[rgb]{0.215,0.698,0.302}Y_n} = j$, the probability that they are at the same state after $N$ steps is, $$\mathbf P[X_{n+N} = Y_{n + N}] = \sum_{k} P^N_{ik} \cdot P^N_{jk} > 0,$$ where the positivity comes from the fact that $P^N_{ik}$ and $P^N_{jk}$ must be positive for all states $i$, $j$, and $k$. Since there are only finitely many pairs of states in the recurrent class, this means that, regardless of where they start, $X_n$ and ${\color[rgb]{0.215,0.698,0.302}Y_n}$ will meet with at least probability, $$ p_\text{meet} = \min_{(i,j)} \sum_{k} P^N_{ik} \cdot P^N_{jk} > 0, $$ after $N$ steps. Thus, the probability of them eventually meeting is at least as great as the probability of eventually getting a heads when repeatedly flipping a coin with heads probability $p_\text{meet} > 0$, which is 1.

Part 2: When does a unique limiting time-average distribution exist?

The structure of this section is identical to the previous one. We will first consider counterexamples where no unique limiting time-average distribution exists, and then show these counterexamples are exhaustive.

Properties preventing existence or uniqueness (time-average)

Limiting time-average distributions are a bit simpler than limiting ensemble-average distributions. For ensemble-averages there were two problematic properties:

periodicity,
multiple recurrent classes.

On the other hand, only the latter is a problem for time-average distributions—they will converge even in a periodic Markov chain. This is most easily understood by considering two of our simplest counterexamples from earlier.

Example 1: The periodicity is no longer a barrier to convergence since we are averaging over time.

Example 2: Having multiple recurrent classes is still a problem since the time-average distribution will be completely determined by which recurrent class the chain gets trapped in.

Proving convergence for finite Markov chains with a single recurrent class (time-average)

Earlier when we analyzed ensemble-averages, we first proved that any limiting ensemble-average distribution must be stationary, then showed that such a limit actually exists. For time-averages, we'll take the opposite approach. We'll first prove that the time-average distribution converges to a unique limit in any finite Markov chain with a single recurrent class. Then we'll show that this limit must be stationary.

Just as in the ensemble-average case, the convergence proof is quite elegant (but less tricky this time!). The key insight is that we can decompose any sample path of the Markov chain into an initial transient phase followed by i.i.d. cycles:

Choose any state $j$ in the recurrent class. Since there is only one recurrent class, the Markov chain is guaranteed to always return to $j$ eventually, regardless of where it starts.
We can now break down any sample path into:
- An initial phase: The chain moves from its starting state to its first visit of state $j$.
- A cycling phase: The chain repeatedly goes on excursions that start at state $j$, and end the first time the chain returns to state $j$. Because the future behavior of a Markov chain depends only on its current state, each excursion from $j$ to $j$ is independent and identically distributed.

Why does this mean the time-average distribution converges? In the limit, the initial phase becomes negligible because it always has finite length. So the limiting fraction of time spent at each state will be determined entirely by the cycling phase. Since we have i.i.d. cycles, the strong law of large numbers tells us that the average behavior across cycles will converge to the expected behavior of a single cycle.

To make that last claim precise:

Let $C^j$ be the length of an excursion starting and ending at state $j$, and let $V^j(i)$ be the number of visits to state $i$ during that excursion. (Here, the excursion begins after initially leaving state $j$ so as to not double count.) Let $\big(C^j_1, V^j_1(i)\big), \big(C^j_2, V^j_2(i)\big), \dots$ be i.i.d. copies of $\big(C^j, V^j(i)\big)$. By the SLLN, the following holds with probability $1$: $$\frac{V^j_1(i) + \cdots + V^j_k(i)}{k} \to \mathbf E[V^j(i)] \quad \text{and} \quad \frac{C^j_1 + \cdots + C^j_k}{k} \to \mathbf E[C^j].$$ However, we want the fraction of time spent at state $i$, which is: $$\frac{\text{total visits to $i$}}{\text{total time}} = \frac{V^j_1(i) + V^j_2(i) + V^j_3(i) + \cdots}{C^j_1 + C^j_2 + C^j_3 + \cdots}$$ Intuitively, since $C^j_1 + \cdots + C^j_k \approx k\mathbf E[C^j]$ for large $k$, we get: $$\frac{V^j_1(i) + \cdots + V^j_k(i)}{C^j_1 + \cdots + C^j_k} \approx \frac{V^j_1(i) + \cdots + V^j_k(i)}{k\mathbf E[C^j]} \to \frac{\mathbf E[V^j(i)]}{\mathbf E[C^j]}.$$ This final convergence is rigorously established by the Renewal-Reward theorem, which handles the ratio of two sums that each satisfy the SLLN, even when the corresponding terms in the two sums are dependent (as $C^j_k$ and $V^j_k(i)$ are here).

Putting this all together, we get the following convergence result.

Theorem 2

Let $\left(X_n\right)_{n \geq 0}$ be a finite Markov chain with a single recurrent class and an arbitrary initial distribution. Let $j$ be a state in the recurrent class. Then for all states $i$, $$\lim_{n \to \infty} \frac{1}{n} \sum_{k=0}^{n-1} 1(X_k = i) = \frac{\mathbf E[V^j(i)]}{\mathbf E[C^j]} \quad \textit{with probability $1$}.$$ In particular, the Markov chain's limiting time-average distribution exists and is independent of the initial state.

We make the following observations:

If state $i$ is in a transient class, the theorem tells us that the limiting fraction of time spent at state $i$ is zero, because $\mathbf E[V^j(i)] = 0$ (once in the recurrent class, the chain can never return to a transient state). This aligns with the fact that transient states are only visited finitely many times.
If state $i$ is in the recurrent class, applying the theorem with $j=i$ tells us that the limiting fraction of time spent at state $i$ is $\frac{1}{\mathbf E[C^i]}$ (since there is exactly one visit to $i$ in an excursion from $i$ to $i$), which is one over the average time between returns to $i$. It's worth pausing here to think about why this result makes sense.

Now that we've proven convergence for time-average distributions, we will show that the limiting distribution must be stationary.

Lemma 3

If a limiting time-average distribution exists, it is stationary.

Let $\mathbf x$ be the limiting time-average distribution. We'll show that $\mathbf x = \mathbf x P$ using a straightforward counting argument.

Fix any state $j$. For each state $i$, let $T_n(i,j)$ denote the number of times the Markov chain transitions directly from state $i$ to state $j$ in the first $n$ time steps. Let $S_n(j) = \sum_{k=1}^n \mathbf{1}(X_k = j)$ denote the total number of visits to state $j$ in the first $n$ time steps.

The key observation is that we can count the total visits to state $j$ in two equivalent ways:

Direct count: The number of times we visit state $j$ is $S_n(j)$.
Transition count: Since every visit to $j$ (except possibly the first, if the chain started there) comes from some transition into $j$, we have $S_n(j) = \sum_i T_n(i,j) + \text{(initial visit)}$. To keep notation clean, we'll ignore the possible initial visit since it becomes negligible in the limit ($1/n \to 0$), and simply write $S_n(j) \approx \sum_i T_n(i,j)$.

The below simulation shows this equivalence by tracking both the total number of visits to state $j$ and the number of transitions to state $j$ from each other state.

Now, as $n$ gets large, the number of direct transitions from $i$ to $j$ will converge to exactly a $P_{ij}$ fraction of the number of visits to state $i$ (this can also be seen in the simulation!): $$T_n(i,j) \sim P_{ij} \cdot S_n(i).$$ (Recall that $\sim$ means that the expressions converge as $n \to \infty$.^[4:1])

Thus, using the fact that $\mathbf x(j)$ is the limiting fraction of time at state $j$: $$\mathbf x(j) = \lim_{n \to \infty} \frac{S_n(j)}{n} = \lim_{n \to \infty} \sum_i \frac{T_n(i,j)}{n} = \sum_i \lim_{n \to \infty} \frac{P_{ij} \cdot S_n(i)}{n} = \sum_i P_{ij} \mathbf x(i).$$

Since this holds for any state $j$, we have $\mathbf x(j) = (\mathbf x P)_j$ for all $j$, which means $\mathbf x = \mathbf x P$.

Therefore, the limiting time-average distribution is stationary if it exists.

Part 3: Conclusion

Let's summarize what we've learned. For every finite, aperiodic Markov chain with a single recurrent class:

A stationary distribution exists. (Theorem 2 & Lemma 3)
This stationary distribution is unique. (Theorem 1)
The long-run ensemble-average distribution converges to this stationary distribution. (Theorem 1)
The long-run time-average distribution converges to this stationary distribution. (Theorem 2)
Thus, the limiting time-average and ensemble-average distributions are the same.

These five facts together make up the Fundamental Theorem of Markov Chains. At the start of this post we said that the FTMC can be viewed as generalizing the SLLN to certain cases where there is no independence. We now know what those "certain cases" are any finite, aperiodic Markov chain with a single recurrent class.^[5]

We also understand all the reasons that one of these limits might either not exist or not be unique. This means that we can reason about what might go wrong in a periodic Markov chain, or a Markov chain with multiple recurrent classes. For example, we know that in a periodic Markov chain with a single recurrent class, the time-average distribution has a unique limit, but the ensemble-average distribution does not converge.

Thanks for sticking with me to the end! This is my first ever blog post, so I'd love to hear any feedback, suggestions, or requests.

Last note: terminology used in textbooks and papers

You will often see these results stated for either:

ergodic Markov chains, or
irreducible and aperiodic Markov chains,

rather than for aperiodic Markov chains with a single recurrent class, as we have stated them here. An irreducible Markov chain is one with a single recurrent class and no transient classes (so the entire state space is the single recurrent class), and an ergodic Markov chain is defined as one that is irreducible and aperiodic. So such results are the same as the ones we've stated here, except that they also assume no transient classes (which make no real difference anyways).

If you'd like to learn more about this history, this article is a good read. ↩︎
The topic of ergodicity extends well beyond Markov chains and stochastic processes. If you're interested in learning more, the Wikipedia page is a good place to start. ↩︎
This proof technique is known as coupling and is used throughout probability theory and stochastic processes. ↩︎
Formally, if $f(n) \sim g(n)$, then $\frac{f(n)}{g(n)} \to 1$ as $n \to \infty$. ↩︎ ↩︎
The FTMC also holds for countably infinite Markov chains if the recurrent class is positive recurrent, which means that the expected length of a cycle from a state to itself is finite for every state in the class. This is outside the scope of this blog post but worth learning about! ↩︎