
EUROGRAPHICS 2025/ D. Ceylan and T.-M. Li Short Paper
Pixels2Points: Fusing 2D and 3D Features for Facial Skin Segmentation
Supplementary Materials
V. Y. Chen
2
, D. Wang
1
, G. Stephan
1
, J. Bednarik
1
, S. Winberg
1
, T. Bolkart
1
, T. Beeler
1
1
Google
2
ETH Zürich
1. Training Details
The trainable part of our model includes the ViT head and the Dif-
fusionNet. The ViT head upscales ViT features to the original im-
age size. It consists of 3 convolution layers, with 512 dimensions.
DiffusionNet consists of 4 DiffusionNet blocks. Each DiffusionNet
block will first diffuse vertex features and then apply a point-wised
MLP. Within each of Block we use a 2-layer MLP with 128 dimen-
sions. In total, the model has 3.7M trainable parameters.
We use a weighted cross-entropy loss where more weight is ap-
plied to the skin class (0.7 on skin and 0.3 on non-skin). We train
our model using AdamW optimizer with learning rate 1e-3 and
weight decay 1e-5. We train our model on a single H100 GPU for
60 epochs. Our scan meshes differ in vertex number, resulting in
spectral operators of non-uniform size, so we use batch size 1.
We generate a synthetics dataset of 3000 heads. Each head is ren-
dered from 13 camera views on 1K resolution, mimicking the cap-
ture system of the real data. The input scan mesh is reconstructed
from running a multi-view reconstruction algorithm on rendered
images. It has around 250K vertices.
2. Alpha Blending between Real Images and Rendered
Hairstyles
To further evaluate our model against 2D and 3D baselines on real
data, we augmented real data with synthetic hairstyles in Fig. 1. For
subjects with minimal hair. We render various hairstyles and alpha
blend that onto the real captures. Registration computed from clean
images serves as ground truth. High quality segmentation masks
will allow the downstream registration result on alpha-blended im-
ages to remain consistent with the original registration result. Fig. 1
shows that the 2D method fails on hairstyles with many hair strands
and create erroneous bumps on the registration result.
3. Additional Qualitative Comparison on Real Data
In Fig. 2, we share additional results on real scans. For some scans,
there are no significant differences between the three models, espe-
cially when there are not many occluded regions. However, when
hair strands are near the face or forehead, the downstream registra-
tion result differs.
In Fig. 3, We show that variance of image features across views
helps the model to identify ambiguous regions on the scan mesh.
Poorly reconstructed vertices can fall onto the skin region in certain
camera views and confuse the model, which variance helps to rule
out.
4. 3D Morphable Model (3DMM)
As explained in Section 2.3 of the main paper, in order to generate
synthetic training data we rely on a custom 3DMM. The model
closely follows the definition introduced in [WBH
∗
21]. Specif-
ically, the model represents a function M(Θ) : R
|Θ|
→ R
3×N
V
,
which takes model parameters Θ on the input and produces N
V
vertices of a human head mesh with a given fixed topology.
Following [WBH
∗
21], the model parameters Θ consist of iden-
tity parameters β ∈ R
|β|
, expression parameters φ ∈ R
|φ|
, joint ro-
tation parameters θ ∈ R
K×3
and a global translation vector τ ∈ R
3
.
The model is linear, i.e. both the identity and expression parameters
β, φ control the additive contribution of the identity and expression
basis represented as a PCA computed over an in-house dataset of
registered expressive faces.
Our model differs from that of [WBH
∗
21] in the two follow-
ing aspects. (1) We defined a custom mesh topology consisting of
N
V
= 17821 vertices, and (2) we rely on custom skinning weights
prepared manually by an artist.
5. Surface Variation
As discussed in the Section 2.1 of the main paper, one of the ge-
ometric features we rely on is a surface variation [PGK02]. This
feature represents a notion of a second order surface property akin
to a curvature. As opposed to a standard mean or Gaussian cur-
vature, surface variation tends to produce less noisy results when
computed on a discretized 3D surface (e.g. a triangular mesh), it is
paramatereizable in terms of the spane of the surface area it consid-
ers for a given surface point, and has been successfully used in ML-
based approach to a deformable 3D surface analysis [GCP
∗
22].
For any given surface point p
i
∈ R
3
, the surface variation σ
i
is
computed by computing the covariance matrix C of the k nearest
neighbors of p
i
, performing the eigen-decomposition, and consid-
ering the difference betwen the smallest and the largest eigenvalue.
c
2025 The Authors.
Proceedings published by Eurographics - The European Association for Computer Graphics.
This is an open access article under the terms of the Creative Commons Attribution License, which
permits use, distribution and reproduction in any medium, provided the original work is properly
cited.