Pixels2Points: Fusing 2D and 3D Features for Facial Skin Segmentation

1Google 2ETH Zurich

Human head registration aligns a template head mesh to the skin region of a 3D head scan. Previous methods first segment images and subsequently project 2D segmentation mask onto the 3D scan to differentiate between skin and non-skin areas. However, 3D scans are susceptible to reconstruction artifacts like interconnected hair, which are not present in the 2D image space. These artifacts persist in the 3D skin mask, leading to inaccurate registration results. Our model uses both the 3D scan and image ViT features to output per-vertex labels directly, resulting in a clean skin/non-skin separation and thus better registration.

Abstract

Face registration deforms a template mesh to closely fit a 3D face scan, the quality of which commonly degrades in non-skin regions (e.g., hair, beard, accessories), because the optimized template-to-scan distance pulls the template mesh towards the noisy scan surface.

Improving registration quality requires a clean separation of skin and non-skin regions on the scan mesh. Existing image-based (2D) or scan-based (3D) segmentation methods however perform poorly. Image-based segmentation outputs multi-view inconsistent masks, and they cannot account for scan inaccuracies or scan-image misalignment, while scan-based methods suffer from lower spatial resolution compared to images.

In this work, we introduce a novel method that accurately separates skin from non-skin geometry on 3D human head scans. For this, our method extracts features from multi-view images using a frozen image foundation model and aggregates these features in 3D. These lifted 2D features are then fused with 3D geometric features extracted from the scan mesh, to then predict a segmentation mask directly on the scan mesh. We show that our segmentations improve the registration accuracy over pure 2D or 3D segmentation methods by 8.89% and 14.3%, respectively. Although trained only on synthetic data, our model generalizes well to real data.

Model Architecture

Given multi-view images of an expressive human face and a reconstructed 3D scan, we first extract features from images using a frozen ViT model and upscale them back to image resolution using up-convolution layers. Then, with precomputed camera parameters, we project the features onto the mesh vertices and fuse features across views weighted by vertex visibility. A combination of geometric and ViT features is fed into DiffusionNet, outputting the final segmentation.

Qualitative Comparison

RGB Images
Registration using 2D mask
Registration using 3D mask
Our Segmentation
Registration using our mask
RGB Images
Registration using 2D mask
Registration using 3D mask
Our Segmentation
Registration using our mask
RGB Images
Registration using 2D mask
Registration using 3D mask
Our Segmentation
Registration using our mask

Synthetic Hair

We re-render real data examples with a variety of synthetic hairstyles from our digital assets library. Thereafter, we run MVS and registration. Ideally, even when adding different hairstyles, the segmentation mask should allow the downstream registration result to remain consistent. It is clear here that our method allows the segmentation result to remain more consistent with the original registration result.

RGB Image
Original Registration
Synthetic Hair
2D Model Segmentation
2D Mask Registration
Our Model Segmentation
Our Mask Registration
RGB Image
Original Registration
Synthetic Hair
2D Model Segmentation
2D Mask Registration
Our Model Segmentation
Our Mask Registration

BibTeX

@article{chen2025pixels2points,
    author = {Chen, Victoria Yue and Wang, Daoye and Garbin, Stephan and Winberg, Sebastian and Bednarik, Jan and Bolkart, Timo and Beeler, Thabo},
    booktitle = {Eurographics 2025 - Short Papers},
    editor = {Cignoni, Paolo and Miguel, Eder},
    title = {{Pixels2Points: Fusing 2D and 3D Features for Facial Skin Segmentation}},
    year = {2025},
    publisher = {The Eurographics Association},
  }