Research

Are Object-Centric Representations Better At Compositional Generalization?

arXiv•February 18, 2026 ()•Ferdinand Kapl, Amir Mohammad Karimi Mamaghan, Maximilian Seitzer, Karl Henrik Johansson, Carsten Marr, Stefan Bauer, Andrea Dittadi

Professional Abstract

"Compositional generalization is a crucial cognitive ability that allows humans to understand and reason about new combinations of familiar concepts. This capability poses a significant challenge in the field of machine learning, particularly within the domain of visual question answering (VQA). The study presented in this paper addresses this challenge by investigating the effectiveness of object-centric (OC) representations in enhancing compositional generalization in visually rich environments. The authors introduce a benchmark designed to evaluate how well various vision encoders, both with and without object-centric biases, can generalize to unseen combinations of object properties across three distinct visual worlds: CLEVRTex, Super-CLEVR, and MOVi-C. To ensure a robust and fair evaluation, the authors meticulously controlled for several factors, including training data diversity, sample size, representation size, downstream model capacity, and computational resources. They utilized two prominent vision encoders—DINOv2 and SigLIP2—as the foundational models, alongside their object-centric counterparts. The results of their experiments yield several key insights: firstly, OC approaches demonstrate superior performance in more challenging compositional generalization scenarios, suggesting that these representations are particularly adept at handling complex visual reasoning tasks. Conversely, traditional dense representations tend to excel only in simpler settings, often requiring significantly more computational resources to achieve comparable results. Secondly, the study highlights the sample efficiency of OC models, which can achieve robust generalization with fewer training images compared to dense encoders. The latter only begin to match or exceed the performance of OC models when provided with ample data and diversity in training samples. Overall, the findings underscore the potential of object-centric representations to facilitate stronger compositional generalization, particularly when constraints are placed on dataset size, training diversity, or computational capacity. This research contributes valuable insights into the ongoing discourse surrounding the development of more effective machine learning models capable of mimicking human-like reasoning abilities in complex visual environments."

Technical Insights

1Compositional generalization is essential for human cognition and a significant challenge in machine learning.

2The study introduces a Visual Question Answering benchmark across three controlled visual worlds: CLEVRTex, Super-CLEVR, and MOVi-C.

3The research evaluates vision encoders with and without object-centric biases to measure their generalization capabilities.

4Key findings indicate that object-centric approaches outperform dense representations in challenging compositional generalization tasks.

5Dense representations only excel in simpler scenarios and typically require more computational resources to achieve similar performance.

6Object-centric models demonstrate greater sample efficiency, achieving better generalization with fewer training images.

7Dense encoders can only match or exceed the performance of OC models when sufficient data and diversity are provided.

8The study meticulously controls for variables such as training data diversity, sample size, representation size, and downstream model capacity.

9Overall, the research highlights the advantages of object-centric representations in enhancing compositional generalization under constrained conditions.