$τ$-Knowledge: Evaluating Conversational Agents over Unstructured Knowledge
Professional Abstract
"The paper introduces $τ$-Knowledge, an innovative framework designed to evaluate conversational agents in complex, knowledge-intensive environments, particularly in the fintech sector. As conversational agents become more prevalent in customer support roles, their effectiveness hinges on their ability to retrieve and utilize domain-specific knowledge from extensive, unstructured data sources. Traditional benchmarks have typically assessed retrieval capabilities or tool usage in isolation, failing to capture the intricate interplay between these components in real-world applications. The authors highlight this gap and propose $τ$-Knowledge as a solution, extending the existing $τ$-Bench framework to facilitate a more comprehensive evaluation of agent performance in scenarios requiring both knowledge retrieval and tool application. The study specifically focuses on the $τ$-Banking domain, which simulates realistic customer support workflows in the financial technology sector. In this context, agents must navigate approximately 700 interconnected knowledge documents while executing tool-mediated account updates. This complexity presents significant challenges, as agents are required to not only retrieve relevant information but also to apply it accurately in compliance with internal policies. The results of the evaluation reveal that even state-of-the-art models, equipped with advanced reasoning capabilities, achieve only around 25.5% pass rates in this environment. Moreover, the reliability of these models deteriorates sharply with repeated trials, indicating that they struggle to consistently retrieve the correct documents from the densely interlinked knowledge base and to reason effectively over the intricate internal policies governing customer interactions. The significance of this research lies in its provision of a realistic testbed for the development of conversational agents that can effectively integrate unstructured knowledge in human-facing deployments. By addressing the limitations of existing evaluation frameworks, $τ$-Knowledge aims to foster advancements in the design and implementation of more capable and reliable conversational agents, ultimately enhancing customer support experiences in knowledge-intensive domains."