Research

Saliency-Aware Multi-Route Thinking: Revisiting Vision-Language Reasoning

arXiv•February 18, 2026 ()•Mingjia Shi, Yinhan He, Yaochen Zhu, Jundong Li

Professional Abstract

"The paper presents a significant advancement in the field of vision-language models (VLMs), which are designed to integrate and reason with both visual and textual inputs. The authors identify a critical challenge in the current architecture of VLMs: the reliance on visual inputs only at the onset of generation, which leads to a predominance of textual reasoning that can result in compounded errors from initial visual grounding. This issue is exacerbated by the coarse and noisy nature of existing guidance mechanisms for visual grounding during inference, making it difficult to maintain accuracy in reasoning over extended textual outputs. To tackle these challenges, the authors propose the Saliency-Aware Principle (SAP) selection method. This innovative approach emphasizes high-level reasoning principles rather than focusing on token-level trajectories, which allows for more stable control over the generation process, even in the presence of noisy feedback. Notably, SAP enables the model to revisit visual evidence during later reasoning steps, thereby enhancing the accuracy of visual grounding when necessary. Furthermore, SAP facilitates multi-route inference, allowing for parallel exploration of various reasoning behaviors, which can lead to richer and more nuanced outputs. The authors highlight that SAP is model-agnostic and does not require additional training, making it a versatile solution applicable across different VLM architectures. Empirical evaluations demonstrate that SAP significantly reduces the phenomenon of object hallucination—where the model generates incorrect or fabricated visual information—while maintaining competitive performance within the same token-generation budgets. Additionally, the implementation of SAP results in more stable reasoning processes and reduced response latency compared to traditional chain-of-thought (CoT) reasoning methods. This research not only addresses pressing issues in VLMs but also sets the stage for future developments in multimodal AI systems, emphasizing the importance of effective visual grounding in enhancing the overall reasoning capabilities of these models."

Technical Insights

1Vision-language models (VLMs) struggle with visual grounding due to the static nature of visual inputs at the start of generation.

2Textual reasoning in VLMs often becomes dominant, leading to compounded errors from initial visual grounding mistakes.

3Existing visual grounding guidance methods are typically coarse and noisy, complicating long-text reasoning.

4The Saliency-Aware Principle (SAP) selection method is proposed to enhance reasoning by focusing on high-level principles rather than token-level details.

5SAP allows for stable control over discrete generation processes, even under noisy feedback conditions.

6The method enables later reasoning steps to re-consult visual evidence, improving accuracy when renewed grounding is necessary.

7SAP supports multi-route inference, facilitating parallel exploration of diverse reasoning behaviors and enhancing output richness.

8The approach is model-agnostic and does not require additional training, making it adaptable to various VLM architectures.

9Empirical results show that SAP reduces object hallucination and provides more stable reasoning with lower response latency compared to traditional CoT methods.