Research

ZipMap: Linear-Time Stateful 3D Reconstruction with Test-Time Training

arXiv•March 4, 2026 ()•Haian Jin, Rundi Wu, Tianyuan Zhang, Ruiqi Gao, Jonathan T. Barron, Noah Snavely, Aleksander Holynski

Professional Abstract

"The paper presents ZipMap, an innovative stateful feed-forward model designed to address the computational inefficiencies associated with existing state-of-the-art 3D vision methods, particularly those that scale quadratically with the number of input images. Traditional models like VGGT and $π^3$ have demonstrated impressive results in 3D reconstruction but suffer from significant computational costs, making them impractical for large-scale image collections. In contrast, ZipMap leverages a linear-time approach that not only reduces computational overhead but also maintains or exceeds the accuracy of its quadratic-time counterparts. The authors introduce a novel mechanism involving test-time training layers that facilitate the compression of an entire image collection into a compact hidden scene state during a single forward pass. This breakthrough allows ZipMap to reconstruct over 700 frames in less than 10 seconds on a single H100 GPU, achieving a speed improvement of more than 20 times compared to VGGT. Furthermore, the paper highlights the advantages of a stateful representation, which enhances real-time scene-state querying and supports sequential streaming reconstruction. The implications of this research are significant, as they pave the way for more efficient and scalable 3D vision applications, particularly in scenarios where rapid processing of large image datasets is essential. The authors provide extensive experimental results that validate the performance of ZipMap, demonstrating its potential to revolutionize the field of 3D vision by combining speed, efficiency, and accuracy in a single framework."

Technical Insights

1ZipMap achieves linear-time bidirectional 3D reconstruction, significantly reducing computational costs compared to quadratic-time methods.

2The model is capable of processing over 700 frames in under 10 seconds on a single H100 GPU, marking a performance improvement of over 20 times against VGGT.

3Utilizes test-time training layers to compress an entire image collection into a compact hidden scene state in a single forward pass.

4Maintains or surpasses the accuracy of existing state-of-the-art methods while offering enhanced efficiency.

5The stateful representation allows for real-time querying of scene states, improving interaction and usability in dynamic environments.

6ZipMap's architecture is designed to support sequential streaming reconstruction, making it suitable for real-time applications.

7The research addresses the limitations of sequential-reconstruction approaches that compromise on quality for speed.

8Experimental results demonstrate ZipMap's robustness across various datasets, confirming its effectiveness in practical scenarios.

9The findings indicate a significant step forward in the scalability of 3D vision technologies, with implications for fields such as robotics, augmented reality, and computer vision.