Are Object-Centric Representations Better At Compositional Generalization?
Professional Abstract
"Compositional generalization is a crucial cognitive ability that allows humans to understand and reason about new combinations of familiar concepts. This capability poses a significant challenge in the field of machine learning, particularly within the domain of visual question answering (VQA). The study presented in this paper addresses this challenge by investigating the effectiveness of object-centric (OC) representations in enhancing compositional generalization in visually rich environments. The authors introduce a benchmark designed to evaluate how well various vision encoders, both with and without object-centric biases, can generalize to unseen combinations of object properties across three distinct visual worlds: CLEVRTex, Super-CLEVR, and MOVi-C. To ensure a robust and fair evaluation, the authors meticulously controlled for several factors, including training data diversity, sample size, representation size, downstream model capacity, and computational resources. They utilized two prominent vision encoders—DINOv2 and SigLIP2—as the foundational models, alongside their object-centric counterparts. The results of their experiments yield several key insights: firstly, OC approaches demonstrate superior performance in more challenging compositional generalization scenarios, suggesting that these representations are particularly adept at handling complex visual reasoning tasks. Conversely, traditional dense representations tend to excel only in simpler settings, often requiring significantly more computational resources to achieve comparable results. Secondly, the study highlights the sample efficiency of OC models, which can achieve robust generalization with fewer training images compared to dense encoders. The latter only begin to match or exceed the performance of OC models when provided with ample data and diversity in training samples. Overall, the findings underscore the potential of object-centric representations to facilitate stronger compositional generalization, particularly when constraints are placed on dataset size, training diversity, or computational capacity. This research contributes valuable insights into the ongoing discourse surrounding the development of more effective machine learning models capable of mimicking human-like reasoning abilities in complex visual environments."