Causality is Key for Interpretability Claims to Generalise
Professional Abstract
"The research paper addresses the critical challenges in the interpretability of large language models (LLMs), particularly focusing on the limitations of existing interpretability studies. The authors argue that while significant insights have been gained regarding model behavior, persistent issues such as non-generalizable findings and causal interpretations that exceed the available evidence remain prevalent. The paper emphasizes the importance of causal inference as a framework for establishing valid mappings from model activations to high-level structures that are invariant across different contexts. By employing Judea Pearl's causal hierarchy, the authors delineate the boundaries of what interpretability studies can substantiate. The paper outlines the distinction between mere observations, which identify associations between model behavior and internal components, and more rigorous interventions, such as ablations or activation patching, that can demonstrate how specific modifications influence behavioral metrics like changes in token probabilities across various prompts. However, the authors highlight a significant gap in the ability to make counterfactual claims—predictions about model outputs under hypothetical unobserved interventions—without the presence of controlled supervision. This limitation underscores the need for a more robust methodological approach to causal inference in the context of LLMs. To address these challenges, the authors propose the concept of causal representation learning (CRL), which operationalizes the causal hierarchy and clarifies which variables can be inferred from model activations and under what assumptions. This framework not only aids in understanding the causal relationships within LLMs but also provides a diagnostic tool for practitioners. It helps in selecting appropriate methods and evaluations that align claims with empirical evidence, thereby enhancing the generalizability of findings. Overall, this research contributes to the ongoing discourse on model interpretability by advocating for a more structured and evidence-based approach to causal inference, ultimately aiming to improve the reliability and applicability of interpretability studies in the field of artificial intelligence."