Sequence Model 3: Past Observations as Auxiliary Variables

Deriving identifiability theorems for sequence models using the sufficient variability framework with past observations as auxiliary variables.

Recap

In Sequence Model 1, we introduced the identifiability problem: a model can perfectly reconstruct observations while learning a meaningless internal representation. The rotational ambiguity of VAEs with Gaussian priors exemplifies this challenge.

In Sequence Model 2, we introduced the sufficient variability assumption. The key insight was that if we have access to data from multiple distributions indexed by an auxiliary variable \(\theta\), and if the log-density derivatives change in \(2n+1\) linearly independent directions as \(\theta\) varies, we can achieve component-wise identifiability.

The natural question is: what should we use as the auxiliary variable \(\theta\) in sequence models?

In this post, we show that the answer is remarkably elegant: the history itself. Specifically, the previous time step data \(\mathbf{x}_{<t}\) serves as a natural auxiliary variable. This insight, formalized in the TDRL framework, allows us to derive identifiability guarantees for temporal data without requiring external domain labels or interventions.

History as Auxiliary Variable

Recall from the sufficient variability framework that we need the distribution \(p(\mathbf{z}; \theta)\) to change across different values of \(\theta\). The necessary requirement for the choice of \(\theta\) is that the latent variables \(z_i\) are conditionally independent given \(\theta\). In temporal settings, we have a natural source of such a condition: the past observations \(\mathbf{x}_{<t}\).

Consider the latent dynamics from time \(t-1\) to \(t\) Here we focus on first-order Markov dynamics for simplicity. Higher-order dependencies can be handled similarly by augmenting the state space.:

\[p(\mathbf{z}_t \mid \mathbf{z}_{t-1})\]

Different values of \(\mathbf{z}_{t-1}\) induce different conditional distributions over \(\mathbf{z}_t\). And in previous discussion we let the decoder \(g\) to be a deterministic invertible function. Then \(p(\mathbf{z}_t \mid \mathbf{z}_{t-1})\) is equivalent to \(p(\mathbf{z}_t \mid g^{-1}(\mathbf{x}_{t-1}))\). This is exactly what we need! The past state \(\mathbf{x}_{t-1}\) plays the role of \(\theta\):

\[\theta \leftarrow \mathbf{x}_{t-1}\]

This observation is powerful because it means temporal data is self-supervised for identifiability. We don’t need external labels, domain indices, or interventions—the sequential structure itself provides the necessary variation.

Applying Sufficient Variability

Now we apply the sufficient variability machinery with \(\theta = \mathbf{x}_{t-1}\). From Sequence Model 2, recall that the independence constraint for recovered latents \(\hat{\mathbf{z}}_t\) (which must also be conditionally independent given \(\mathbf{z}_{t-1}\)) leads to:

\[\frac{\partial^2 \log p(\hat{\mathbf{z}}_t \mid \mathbf{x}_{t-1})}{\partial \hat{z}_{t,i} \partial \hat{z}_{t,j}} = \frac{\partial^2 \log p(\hat{\mathbf{z}}_t \mid \mathbf{z}_{t-1})}{\partial \hat{z}_{t,i} \partial \hat{z}_{t,j}} = 0 \quad \text{for } i \neq j\]

Applying the chain rule and using \(\mathbf{z}_t = h(\hat{\mathbf{z}}_t)\) for some transformation \(h\):

\[\sum_{k} \frac{\partial \log p(\mathbf{z}_t \mid \mathbf{z}_{t-1})}{\partial z_{t,k}} \frac{\partial^2 z_{t,k}}{\partial \hat{z}_{t,i} \partial \hat{z}_{t,j}} + \sum_{k} \frac{\partial^2 \log p(\mathbf{z}_t \mid \mathbf{z}_{t-1})}{\partial z_{t,k}^2} \frac{\partial z_{t,k}}{\partial \hat{z}_{t,i}} \frac{\partial z_{t,k}}{\partial \hat{z}_{t,j}} + \frac{\partial^2 \log |J_h|}{\partial \hat{z}_{t,i} \partial \hat{z}_{t,j}} = 0\]

In temporal setting instead of constructing different contexts \(\theta\) manually, we can leverage the natural variability in \(\mathbf{z}_{t-1}\) by taking one more partial derivative of \(z_{t-1,l}\). Similar to take the difference across domain indices in the previous post, this operation removes the last term involving the Jacobian \(J_h\) because it does not depend on \(\mathbf{z}_{t-1}\):

\[\frac{\partial}{\partial z_{t-1,l}} \left[ \sum_{k} \frac{\partial \log p(\mathbf{z}_t \mid \mathbf{z}_{t-1})}{\partial z_{t,k}} \frac{\partial^2 z_{t,k}}{\partial \hat{z}_{t,i} \partial \hat{z}_{t,j}} + \sum_{k} \frac{\partial^2 \log p(\mathbf{z}_t \mid \mathbf{z}_{t-1})}{\partial z_{t,k}^2} \frac{\partial z_{t,k}}{\partial \hat{z}_{t,i}} \frac{\partial z_{t,k}}{\partial \hat{z}_{t,j}} \right] = 0\]

Expanding this derivative, we obtain:

\[\sum_{k} \frac{\partial^2 \log p(\mathbf{z}_t \mid \mathbf{z}_{t-1})}{\partial z_{t,k} \partial z_{t-1,l}} \frac{\partial^2 z_{t,k}}{\partial \hat{z}_{t,i} \partial \hat{z}_{t,j}} + \sum_{k} \frac{\partial^3 \log p(\mathbf{z}_t \mid \mathbf{z}_{t-1})}{\partial z_{t,k}^2 \partial z_{t-1,l}} \frac{\partial z_{t,k}}{\partial \hat{z}_{t,i}} \frac{\partial z_{t,k}}{\partial \hat{z}_{t,j}} = 0\]

This equation must hold for all \(i \neq j\) and all \(l \in \{1, \ldots, n\}\). Let us denote:

\[\mathbf{v}_k = \left( \frac{\partial^2 \log p}{\partial z_{t,k} \partial z_{t-1,1}}, \ldots, \frac{\partial^2 \log p}{\partial z_{t,k} \partial z_{t-1,n}}, \right)^\top\] \[\mathring{\mathbf{v}}_k = \left( \frac{\partial^3 \log p}{\partial z_{t,k}^2 \partial z_{t-1,1}}, \ldots, \frac{\partial^3 \log p}{\partial z_{t,k}^2 \partial z_{t-1,n}} \right)^\top\]

The sufficient variability condition becomes for each value of \(\mathbf{z}_t\), the \(2n\) vector functions \(\mathbf{v}_1, \mathbf{v}_2, \dots, \mathbf{v}_n\) and \(\mathring{\mathbf{v}}_1, \mathring{\mathbf{v}}_2, \dots, \mathring{\mathbf{v}}_n\) are linearly independent. Under this condition, the same argument from Sequence Model 2 applies: the transformation \(h\) must be a component-wise function.

In this post we have seen how we can use past observations as auxiliary variable to establish identifiability without using external domain variables or interventions. But we should also notice that the latent causal dynamics \(\mathbf{z}_t = f(\mathbf{z}_{t-1})\) here is a fixed function. In the next post, we will explore how we can expand this framework to handle more general settings with nonstationary time-varying dynamics.

Citation

If you found this post useful, please consider citing it:

@article{song2026seqmodel3,
  title={Sequence Model 3: Past Observations as Auxiliary Variables},
  author={Song, Xiangchen},
  year={2026},
  month={February},
  url={https://xiangchensong.github.io/blog/2026/seq-model-3/}
}

Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • Understanding Language Models 1: Mechanistic Interpretability Meets Causal Representation Learning
  • Sequence Model 4: Nonstationary Dynamics
  • Sequence Model 2: Sufficient Variability
  • Sequence Model 1: Identifiability