Research

Causality is Key for Interpretability Claims to Generalise

arXiv•February 18, 2026 ()•Shruti Joshi, Aaron Mueller, David Klindt, Wieland Brendel, Patrik Reizinger, Dhanya Sridhar

Professional Abstract

"The research paper addresses the critical challenges in the interpretability of large language models (LLMs), particularly focusing on the limitations of existing interpretability studies. The authors argue that while significant insights have been gained regarding model behavior, persistent issues such as non-generalizable findings and causal interpretations that exceed the available evidence remain prevalent. The paper emphasizes the importance of causal inference as a framework for establishing valid mappings from model activations to high-level structures that are invariant across different contexts. By employing Judea Pearl's causal hierarchy, the authors delineate the boundaries of what interpretability studies can substantiate. The paper outlines the distinction between mere observations, which identify associations between model behavior and internal components, and more rigorous interventions, such as ablations or activation patching, that can demonstrate how specific modifications influence behavioral metrics like changes in token probabilities across various prompts. However, the authors highlight a significant gap in the ability to make counterfactual claims—predictions about model outputs under hypothetical unobserved interventions—without the presence of controlled supervision. This limitation underscores the need for a more robust methodological approach to causal inference in the context of LLMs. To address these challenges, the authors propose the concept of causal representation learning (CRL), which operationalizes the causal hierarchy and clarifies which variables can be inferred from model activations and under what assumptions. This framework not only aids in understanding the causal relationships within LLMs but also provides a diagnostic tool for practitioners. It helps in selecting appropriate methods and evaluations that align claims with empirical evidence, thereby enhancing the generalizability of findings. Overall, this research contributes to the ongoing discourse on model interpretability by advocating for a more structured and evidence-based approach to causal inference, ultimately aiming to improve the reliability and applicability of interpretability studies in the field of artificial intelligence."

Technical Insights

1The paper critiques the current state of interpretability research in large language models, identifying key pitfalls such as non-generalizable findings and unsupported causal claims.

2It emphasizes the role of causal inference in establishing valid mappings from model activations to high-level structures, which are invariant across different contexts.

3The authors utilize Judea Pearl's causal hierarchy to clarify the limitations of interpretability studies, distinguishing between observations and interventions.

4Interventions like ablations and activation patching are discussed as methods to demonstrate how changes affect behavioral metrics, such as token probabilities.

5A significant focus is placed on the challenge of making counterfactual claims without controlled supervision, highlighting a gap in current methodologies.

6The concept of causal representation learning (CRL) is introduced as a means to operationalize the causal hierarchy, specifying recoverable variables from activations.

7The paper proposes a diagnostic framework to assist practitioners in aligning methods and evaluations with evidence, enhancing the reliability of findings.

8The research aims to improve the generalizability of interpretability studies, advocating for a structured approach to causal inference in AI.

9Overall, the paper contributes to the broader understanding of model interpretability, pushing for a more evidence-based methodology in the analysis of large language models.