Academic Research

Deep dives into computer science, AI, and machine learning from the world's leading research institutions.

Research

SimpliHuMoN: Simplifying Human Motion Prediction

Human motion prediction is a critical area of research that encompasses both trajectory forecasting and human pose prediction. Traditionally, these tasks have been approached with specialized models tailored to each specific aspect of motion analysis. However, the integration of these models into a cohesive framework for holistic human motion prediction has proven challenging. Recent advancements in the field have indicated that existing methods often fall short when benchmarked against individual tasks, highlighting the need for a more unified approach. In response to this gap, the authors of the paper propose a novel transformer-based model designed to streamline the prediction of human motion. The proposed model leverages a stack of self-attention modules, which are instrumental in capturing spatial dependencies inherent within a single pose as well as temporal relationships that span across a sequence of motions. This architecture allows for a more nuanced understanding of human movement, facilitating both pose-only and trajectory-only predictions, as well as combined tasks without necessitating task-specific modifications. Through rigorous experimentation, the authors validate the efficacy of their model against a variety of benchmark datasets, including Human3.6M, AMASS, ETH-UCY, and 3DPW. The results demonstrate that their transformer-based approach achieves state-of-the-art performance across all evaluated tasks, underscoring its versatility and effectiveness. The implications of this research are significant, as it not only advances the field of human motion prediction but also opens avenues for future exploration into more complex motion dynamics and applications in robotics, animation, and human-computer interaction. The simplicity of the model, combined with its robust performance, positions it as a valuable contribution to the ongoing discourse in motion prediction methodologies.

arXiv•

Research

Accurate and Efficient Hybrid-Ensemble Atmospheric Data Assimilation in Latent Space with Uncertainty Quantification

The paper presents a novel data assimilation (DA) method called HLOBA (Hybrid-Ensemble Latent Observation-Background Assimilation), which aims to overcome the limitations of traditional and machine-learning DA techniques in achieving simultaneous accuracy, efficiency, and uncertainty quantification. Data assimilation is a critical process in meteorology and climate science, as it combines model forecasts with observational data to provide optimal atmospheric state estimates and initial conditions for weather predictions. The authors identify that existing methods often struggle to balance these three key aspects, which can lead to suboptimal performance in weather forecasting and climate reanalyses. HLOBA introduces a three-dimensional hybrid-ensemble framework that operates within a latent space derived from an autoencoder (AE). The method employs an AE to learn a compressed representation of the atmospheric state, allowing both model forecasts and observational data to be mapped into this shared latent space. This is achieved through two main components: the AE encoder, which processes model forecasts, and an end-to-end Observation-to-Latent-space mapping network (O2Lnet), which translates observations into the latent space. The fusion of these two data sources is performed using a Bayesian update mechanism, where the weights for the update are inferred from time-lagged ensemble forecasts. The efficacy of HLOBA is demonstrated through both idealized and real-observation experiments. The results indicate that HLOBA achieves comparable performance to traditional four-dimensional DA methods in terms of analysis and forecast skill. Notably, it does so while maintaining a level of efficiency that allows for end-to-end inference, making it adaptable to various forecasting models. This flexibility is a significant advantage, as it can potentially streamline the data assimilation process across different atmospheric models. A key innovation of HLOBA is its ability to exploit the error decorrelation property of latent variables. This capability enables the method to provide element-wise uncertainty estimates for the latent analysis, which are then propagated back to the model space using the decoder. The idealized experiments conducted in the study reveal that these uncertainty estimates are particularly valuable, as they highlight regions of large errors and capture their seasonal variability. This aspect of the method not only enhances the reliability of the atmospheric state estimates but also contributes to a better understanding of the uncertainties inherent in weather predictions. In summary, HLOBA represents a significant advancement in the field of data assimilation, offering a robust and efficient approach to atmospheric state estimation that integrates the strengths of machine learning with traditional DA techniques. Its ability to quantify uncertainty and provide flexible application across various models positions it as a promising tool for improving weather forecasting and climate research.

arXiv•

Research

SELDON: Supernova Explosions Learned by Deep ODE Networks

The paper introduces SELDON, a novel continuous-time variational autoencoder designed to handle the challenges posed by the anticipated influx of optical transient alerts from the Vera C. Rubin Observatory's Legacy Survey of Space and Time (LSST). With projections estimating up to 10 million public alerts per night, traditional physics-based inference methods are at risk of being overwhelmed due to their slow processing times, often requiring hours for each object. SELDON aims to address this issue by providing millisecond-scale inference capabilities for thousands of astronomical objects daily. The core of SELDON's architecture is a masked GRU-ODE (Gated Recurrent Unit - Ordinary Differential Equation) encoder, which is adept at summarizing panels of sparse and irregularly sampled astrophysical light curves. These light curves are characterized by their nonstationary, heteroscedastic, and dependent nature, complicating traditional analysis methods. The encoder is designed to effectively learn from imbalanced and correlated data, even when only a limited number of observations are available. Following the encoding process, SELDON employs a neural ODE to propagate the learned hidden state forward in continuous time, allowing for the extrapolation of future observations that have not yet been recorded. This capability is crucial for timely decision-making in astrophysical surveys, where rapid follow-up observations can significantly enhance the understanding of transient phenomena. The extrapolated time series is subsequently encoded using deep sets, leading to a latent distribution that is decoded into a weighted sum of Gaussian basis functions. The parameters derived from this decoding process—such as rise time, decay rate, and peak flux—are not only interpretable but also physically meaningful, providing valuable insights for downstream tasks like prioritizing spectroscopic follow-ups. The implications of SELDON extend beyond astronomy, as its architecture offers a versatile framework for continuous-time sequence modeling applicable in various fields where data is multivariate, sparse, heteroscedastic, and irregularly spaced. This adaptability positions SELDON as a significant advancement in the field of machine learning and data analysis, promising to enhance the efficiency and effectiveness of data-driven decision-making in a range of scientific domains.

arXiv•

Research

A Dual-Helix Governance Approach Towards Reliable Agentic AI for WebGIS Development

The research paper titled 'WebGIS Development and Agentic AI: Addressing Limitations through a Dual-Helix Governance Framework' presents a critical examination of the challenges faced in the development of WebGIS systems when utilizing large language models (LLMs). The authors identify five significant limitations of LLMs that hinder their effectiveness in agentic AI applications: context constraints, cross-session forgetting, stochasticity, instruction failure, and adaptation rigidity. These limitations are framed as structural governance problems that cannot be resolved solely through enhancements in model capacity. To address these challenges, the authors propose a novel dual-helix governance framework that is operationalized through a three-track architecture comprising Knowledge, Behavior, and Skills. This architecture leverages a knowledge graph substrate to stabilize execution by externalizing domain-specific facts and enforcing executable protocols, thereby enhancing the reliability of agentic AI systems in geospatial engineering tasks. The implementation of this framework is exemplified through the FutureShorelines WebGIS tool, where a governed agent was able to refactor a substantial 2,265-line monolithic codebase into modular ES6 components. This refactoring process yielded significant improvements in software quality, evidenced by a 51% reduction in cyclomatic complexity and a 7-point increase in the maintainability index. Furthermore, the study includes a comparative experiment against a zero-shot LLM, which underscores the importance of externalized governance mechanisms in achieving operational reliability, rather than relying solely on the capabilities of the model itself. The findings highlight that the proposed governance framework not only enhances the performance of agentic AI in WebGIS development but also contributes to the broader discourse on the integration of AI technologies in complex engineering domains. The approach is made accessible through the open-source AgentLoom governance toolkit, which aims to facilitate the adoption of these governance strategies in future AI-driven projects.

arXiv•

Research

ZipMap: Linear-Time Stateful 3D Reconstruction with Test-Time Training

The paper presents ZipMap, an innovative stateful feed-forward model designed to address the computational inefficiencies associated with existing state-of-the-art 3D vision methods, particularly those that scale quadratically with the number of input images. Traditional models like VGGT and $π^3$ have demonstrated impressive results in 3D reconstruction but suffer from significant computational costs, making them impractical for large-scale image collections. In contrast, ZipMap leverages a linear-time approach that not only reduces computational overhead but also maintains or exceeds the accuracy of its quadratic-time counterparts. The authors introduce a novel mechanism involving test-time training layers that facilitate the compression of an entire image collection into a compact hidden scene state during a single forward pass. This breakthrough allows ZipMap to reconstruct over 700 frames in less than 10 seconds on a single H100 GPU, achieving a speed improvement of more than 20 times compared to VGGT. Furthermore, the paper highlights the advantages of a stateful representation, which enhances real-time scene-state querying and supports sequential streaming reconstruction. The implications of this research are significant, as they pave the way for more efficient and scalable 3D vision applications, particularly in scenarios where rapid processing of large image datasets is essential. The authors provide extensive experimental results that validate the performance of ZipMap, demonstrating its potential to revolutionize the field of 3D vision by combining speed, efficiency, and accuracy in a single framework.

arXiv•

Research

AgentIR: Reasoning-Aware Retrival for Deep Research Agents

The emergence of Deep Research agents as primary consumers of retrieval systems has necessitated a reevaluation of how these systems interpret user intent and context. Traditional retrieval systems often overlook the nuanced reasoning that precedes a query, which is critical for understanding user intent. This paper introduces a novel paradigm called Reasoning-Aware Retrieval, which integrates the reasoning process of Deep Research agents into the retrieval mechanism. By embedding the agent's reasoning alongside its query, the system can leverage additional contextual information that enhances retrieval accuracy. Furthermore, the authors present DR-Synth, a data synthesis method designed to create training data specifically for Deep Research retrievers from existing question-answering datasets. The effectiveness of these innovations is demonstrated through the development of AgentIR-4B, an embedding model that significantly outperforms conventional models on the BrowseComp-Plus benchmark. AgentIR-4B achieved an impressive 68% accuracy with the open-weight agent Tongyi-DeepResearch, compared to 50% accuracy from larger conventional embedding models and a mere 37% from the traditional BM25 algorithm. The results underscore the importance of reasoning in retrieval tasks and suggest that integrating reasoning traces can lead to substantial improvements in performance. The code and data for this research are publicly available, promoting further exploration and development in this area.

arXiv•

Research

TaxonRL: Reinforcement Learning with Intermediate Rewards for Interpretable Fine-Grained Visual Reasoning

In the realm of vision-language models, the challenge of fine-grained taxonomic reasoning has emerged as a significant hurdle, particularly when differentiating between visually similar species within the same taxonomic group. The research presented in this paper introduces TaxonRL, a novel reinforcement learning framework that leverages Group Relative Policy Optimization (GRPO) to enhance the model's ability to perform hierarchical reasoning across various taxonomic levels. This innovative approach decomposes the classification task into a series of intermediate predictions that focus on species, genus, and family characteristics, thereby fostering a structured reasoning process that is both interpretable and verifiable. The methodology employed in this study involves the application of GRPO, which is designed to optimize the model's decision-making process by providing intermediate rewards based on the accuracy of predictions at each taxonomic level. This structured reward system encourages the model to engage in a more nuanced analysis of the visual data, leading to improved classification accuracy. The researchers evaluated TaxonRL using the Birds-to-Words dataset, a challenging benchmark that tests the model's ability to recognize and classify bird species based on visual inputs and associated language descriptions. The results of the study are compelling; TaxonRL achieved an impressive average accuracy of 91.7%, significantly surpassing human performance, which was recorded at 77.3%. This remarkable achievement not only highlights the efficacy of the proposed method but also underscores the potential for reinforcement learning techniques to enhance fine-grained visual discrimination tasks. Furthermore, the model's ability to generate interpretable reasoning traces provides valuable insights into the decision-making process, allowing for greater transparency in how classifications are made. Additionally, the research demonstrates strong cross-domain generalization capabilities, with TaxonRL showing substantial improvements in tasks involving primate and marine species verification. This suggests that the hierarchical reasoning framework established by the model is not only effective within the domain of avian species but can also be adapted to other biological classifications, thereby broadening its applicability. The significance of this research lies in its contribution to the field of machine learning and computer vision, particularly in enhancing the interpretability and accuracy of models tasked with fine-grained classification. By enforcing a structured approach to reasoning, TaxonRL sets a precedent for future developments in vision-language models, emphasizing the importance of hierarchical thinking in complex classification scenarios. This work opens new avenues for research into reinforcement learning applications in biological taxonomy and beyond, paving the way for more sophisticated models that can tackle similar challenges across various domains.

arXiv•

Research

Helios: Real Real-Time Long Video Generation Model

The Helios model represents a significant advancement in the field of video generation, introducing a 14 billion parameter autoregressive diffusion model capable of generating videos at 19.5 frames per second (FPS) on a single NVIDIA H100 GPU. This model is particularly notable for its ability to generate minute-scale videos while maintaining quality comparable to existing strong baselines. The development of Helios addresses several critical challenges in video generation, including long-video drifting, real-time generation capabilities, and efficient training methodologies. One of the primary innovations of Helios is its robustness to long-video drifting, a common issue in video generation that can lead to inconsistencies and repetitive motion. Unlike traditional methods that rely on anti-drifting heuristics such as self-forcing or keyframe sampling, Helios employs a novel training strategy that simulates drifting during the training process. This approach allows the model to learn to mitigate drifting effectively while eliminating repetitive motion at its source. In terms of performance, Helios achieves real-time video generation without the use of standard acceleration techniques like KV-cache or sparse attention. This is a significant achievement, as many existing models require complex optimizations to achieve similar speeds. The model's efficiency is further enhanced by its ability to compress historical and noisy context, as well as by reducing the number of sampling steps, resulting in computational costs that are comparable to or even lower than those of smaller models with only 1.3 billion parameters. Helios also stands out for its training efficiency, as it does not rely on parallelism or sharding frameworks. This enables the model to utilize image-diffusion-scale batch sizes while fitting up to four 14B models within 80 GB of GPU memory. Such optimizations are crucial for researchers and developers working with limited computational resources. The authors conducted extensive experiments to validate the performance of Helios, demonstrating that it consistently outperforms prior methods in both short- and long-video generation tasks. The results indicate that Helios not only meets but exceeds the expectations set by existing models in terms of quality and efficiency. To support further research and development in the community, the authors plan to release the code, base model, and a distilled version of the model. This open-source approach is expected to foster collaboration and innovation in the field of video generation, allowing other researchers to build upon the advancements made with Helios. Overall, the introduction of the Helios model marks a pivotal moment in video generation technology, with its unique capabilities and performance metrics paving the way for future developments in this rapidly evolving area of artificial intelligence.

arXiv•

Research

Robustness of Agentic AI Systems via Adversarially-Aligned Jacobian Regularization

The research paper presents a significant advancement in the training of Large Language Models (LLMs) within the context of autonomous multi-agent systems. As these models evolve, the need for robust minimax training becomes increasingly critical, particularly due to the instability that can arise when non-linear policies lead to extreme local curvature during the inner maximization process. Traditional methods that enforce global Jacobian bounds have been found to be overly conservative, as they suppress sensitivity across all directions, which results in a substantial Price of Robustness. This paper introduces a novel approach termed Adversarially-Aligned Jacobian Regularization (AAJR), which offers a trajectory-aligned strategy for controlling sensitivity specifically along adversarial ascent directions. The authors provide a theoretical foundation for AAJR, demonstrating that it allows for a strictly larger admissible policy class compared to global constraints under mild conditions. This implies a reduction in the approximation gap and less degradation in nominal performance. Furthermore, the paper outlines specific step-size conditions that ensure AAJR maintains effective smoothness along optimization trajectories, thereby enhancing inner-loop stability. The findings contribute to a structural theory of agentic robustness, effectively decoupling minimax stability from the limitations imposed by global expressivity restrictions. This research not only addresses the challenges of training LLMs in complex environments but also sets the stage for future developments in robust AI systems, emphasizing the importance of tailored regularization techniques that align with the dynamics of adversarial interactions.

arXiv•

Research

$τ$-Knowledge: Evaluating Conversational Agents over Unstructured Knowledge

The paper introduces $τ$-Knowledge, an innovative framework designed to evaluate conversational agents in complex, knowledge-intensive environments, particularly in the fintech sector. As conversational agents become more prevalent in customer support roles, their effectiveness hinges on their ability to retrieve and utilize domain-specific knowledge from extensive, unstructured data sources. Traditional benchmarks have typically assessed retrieval capabilities or tool usage in isolation, failing to capture the intricate interplay between these components in real-world applications. The authors highlight this gap and propose $τ$-Knowledge as a solution, extending the existing $τ$-Bench framework to facilitate a more comprehensive evaluation of agent performance in scenarios requiring both knowledge retrieval and tool application. The study specifically focuses on the $τ$-Banking domain, which simulates realistic customer support workflows in the financial technology sector. In this context, agents must navigate approximately 700 interconnected knowledge documents while executing tool-mediated account updates. This complexity presents significant challenges, as agents are required to not only retrieve relevant information but also to apply it accurately in compliance with internal policies. The results of the evaluation reveal that even state-of-the-art models, equipped with advanced reasoning capabilities, achieve only around 25.5% pass rates in this environment. Moreover, the reliability of these models deteriorates sharply with repeated trials, indicating that they struggle to consistently retrieve the correct documents from the densely interlinked knowledge base and to reason effectively over the intricate internal policies governing customer interactions. The significance of this research lies in its provision of a realistic testbed for the development of conversational agents that can effectively integrate unstructured knowledge in human-facing deployments. By addressing the limitations of existing evaluation frameworks, $τ$-Knowledge aims to foster advancements in the design and implementation of more capable and reliable conversational agents, ultimately enhancing customer support experiences in knowledge-intensive domains.

arXiv•