Topologically Consistent Multi-view 3D Head Reconstruction via Coarse-Guided Layered Surface Sampling

Google
SIGGRAPH 2026

TL;DR: From calibrated multi-view images, SHELLS reconstructs 18k-vertex 3D heads in 0.08 seconds. It aggregates DinoV2 features via projective surface-aware feature sampling, allowing a transformer to predict dense semantic meshes 3.5x faster with 88% less GPU memory than state-of-the-art methods. The result is an efficient, globally consistent multi-view reconstruction framework.


Abstract

We present SHELLS (Semantic Head Estimation via Layered Local Sampling), an efficient feed-forward framework for 3D head reconstruction in dense semantic correspondence from multi-view images. State-of-the-art methods typically refine vertices independently using localized feature volumes. This couples memory-intensive feature sampling to the output mesh resolution, limiting scalability for dense topologies (≥ 10k vertices) and producing noisy surfaces due to a lack of global context. In contrast, SHELLS decouples feature extraction from mesh resolution via a hierarchical sampling strategy. We extract multi-view features using a DinoV2 backbone with LoRA adaptation, projectively sample a sparse global feature cloud, and predict an intermediate coarse mesh. This coarse prior guides the construction of layered, surface-aware sampling shells that serve as a discrete search space for the final reconstruction. By utilizing a shared transformer architecture, SHELLS maintains global surface consistency while requiring only 12% of the GPU memory of volume-based approaches. Experimental results show that SHELLS reduces median registration error by 21% – 29% and achieves a 3.5× speedup, predicting 18k-vertex meshes in 0.08 seconds. Notably, our model is trained exclusively on synthetic data yet generalizes effectively to real-world captures, eliminating the need for the costly, pre-registered multi-view datasets common in prior work.

Architecture

SHELLS Architecture Overview

A shared DinoV2 backbone with LoRA adaptation extracts per-view feature maps from the input images (left). The graph stage (top) projectively samples features for a sparse graph and processes them alongside a downsampled tokenized template using an XCiT-based transformer. From the transformer output, a coarse mesh is regressed as an attention-weighted sum over the sampling graph coordinates. This coarse prediction is displaced along its normals to construct sampling shells for surface-aware feature sampling. Finally, the shared transformer aggregates these shell-based features with a full-resolution tokenized template to predict the high-fidelity mesh as an attention-weighted sum of dynamic shell coordinates (bottom).


Performance registration

SHELLS can be applied frame-by-frame to dynamic facial performances and produces temporally smooth and expressive performance registrations.


Implicit occlusion handling

SHELLS handles occluded regions like the inner mouth cavity by correlating these with the visible areas to regress all vertices holistically.


Robustness to number of views

SHELLS Robustness to Number of Views

Thanks to random camera dropout during training and mean-variance feature fusion, SHELLS is robust to the number of input views at inference time. Reconstructions remain plausible and detailed even with as few as two input views (featuring large disparities that challenge traditional MVS methods), and scale gracefully as more views (e.g., 3, 4, or 10) are added.


BibTeX Citation

@inproceedings{Bolkart2026SHELLS,
  author    = {Bolkart, Timo and Wang, Daoye and Chandran, Prashanth},
  title     = {Topologically Consistent Multi-view 3D Head Reconstruction via Coarse-Guided Layered Surface Sampling},
  year      = {2026},
  publisher = {Association for Computing Machinery},
  keywords  = {Registration, 3D Head Reconstruction},
  series    = {SIGGRAPH Conference Papers '26}
}