ZipMap: Linear-Time Stateful 3D Reconstruction with Test-Time Training
Professional Abstract
"The paper presents ZipMap, an innovative stateful feed-forward model designed to address the computational inefficiencies associated with existing state-of-the-art 3D vision methods, particularly those that scale quadratically with the number of input images. Traditional models like VGGT and $π^3$ have demonstrated impressive results in 3D reconstruction but suffer from significant computational costs, making them impractical for large-scale image collections. In contrast, ZipMap leverages a linear-time approach that not only reduces computational overhead but also maintains or exceeds the accuracy of its quadratic-time counterparts. The authors introduce a novel mechanism involving test-time training layers that facilitate the compression of an entire image collection into a compact hidden scene state during a single forward pass. This breakthrough allows ZipMap to reconstruct over 700 frames in less than 10 seconds on a single H100 GPU, achieving a speed improvement of more than 20 times compared to VGGT. Furthermore, the paper highlights the advantages of a stateful representation, which enhances real-time scene-state querying and supports sequential streaming reconstruction. The implications of this research are significant, as they pave the way for more efficient and scalable 3D vision applications, particularly in scenarios where rapid processing of large image datasets is essential. The authors provide extensive experimental results that validate the performance of ZipMap, demonstrating its potential to revolutionize the field of 3D vision by combining speed, efficiency, and accuracy in a single framework."