EUROGRAPHICS 2025/ D. Ceylan and T.-M. Li Short Paper
Pixels2Points: Fusing 2D and 3D Features for Facial Skin Segmentation
Supplementary Materials
V. Y. Chen
2
, D. Wang
1
, G. Stephan
1
, J. Bednarik
1
, S. Winberg
1
, T. Bolkart
1
, T. Beeler
1
1
Google
2
ETH Zürich
1. Training Details
The trainable part of our model includes the ViT head and the Dif-
fusionNet. The ViT head upscales ViT features to the original im-
age size. It consists of 3 convolution layers, with 512 dimensions.
DiffusionNet consists of 4 DiffusionNet blocks. Each DiffusionNet
block will first diffuse vertex features and then apply a point-wised
MLP. Within each of Block we use a 2-layer MLP with 128 dimen-
sions. In total, the model has 3.7M trainable parameters.
We use a weighted cross-entropy loss where more weight is ap-
plied to the skin class (0.7 on skin and 0.3 on non-skin). We train
our model using AdamW optimizer with learning rate 1e-3 and
weight decay 1e-5. We train our model on a single H100 GPU for
60 epochs. Our scan meshes differ in vertex number, resulting in
spectral operators of non-uniform size, so we use batch size 1.
We generate a synthetics dataset of 3000 heads. Each head is ren-
dered from 13 camera views on 1K resolution, mimicking the cap-
ture system of the real data. The input scan mesh is reconstructed
from running a multi-view reconstruction algorithm on rendered
images. It has around 250K vertices.
2. Alpha Blending between Real Images and Rendered
Hairstyles
To further evaluate our model against 2D and 3D baselines on real
data, we augmented real data with synthetic hairstyles in Fig. 1. For
subjects with minimal hair. We render various hairstyles and alpha
blend that onto the real captures. Registration computed from clean
images serves as ground truth. High quality segmentation masks
will allow the downstream registration result on alpha-blended im-
ages to remain consistent with the original registration result. Fig. 1
shows that the 2D method fails on hairstyles with many hair strands
and create erroneous bumps on the registration result.
3. Additional Qualitative Comparison on Real Data
In Fig. 2, we share additional results on real scans. For some scans,
there are no significant differences between the three models, espe-
cially when there are not many occluded regions. However, when
hair strands are near the face or forehead, the downstream registra-
tion result differs.
In Fig. 3, We show that variance of image features across views
helps the model to identify ambiguous regions on the scan mesh.
Poorly reconstructed vertices can fall onto the skin region in certain
camera views and confuse the model, which variance helps to rule
out.
4. 3D Morphable Model (3DMM)
As explained in Section 2.3 of the main paper, in order to generate
synthetic training data we rely on a custom 3DMM. The model
closely follows the definition introduced in [WBH
21]. Specif-
ically, the model represents a function M(Θ) : R
|Θ|
R
3×N
V
,
which takes model parameters Θ on the input and produces N
V
vertices of a human head mesh with a given fixed topology.
Following [WBH
21], the model parameters Θ consist of iden-
tity parameters β R
|β|
, expression parameters φ R
|φ|
, joint ro-
tation parameters θ R
K×3
and a global translation vector τ R
3
.
The model is linear, i.e. both the identity and expression parameters
β, φ control the additive contribution of the identity and expression
basis represented as a PCA computed over an in-house dataset of
registered expressive faces.
Our model differs from that of [WBH
21] in the two follow-
ing aspects. (1) We defined a custom mesh topology consisting of
N
V
= 17821 vertices, and (2) we rely on custom skinning weights
prepared manually by an artist.
5. Surface Variation
As discussed in the Section 2.1 of the main paper, one of the ge-
ometric features we rely on is a surface variation [PGK02]. This
feature represents a notion of a second order surface property akin
to a curvature. As opposed to a standard mean or Gaussian cur-
vature, surface variation tends to produce less noisy results when
computed on a discretized 3D surface (e.g. a triangular mesh), it is
paramatereizable in terms of the spane of the surface area it consid-
ers for a given surface point, and has been successfully used in ML-
based approach to a deformable 3D surface analysis [GCP
22].
For any given surface point p
i
R
3
, the surface variation σ
i
is
computed by computing the covariance matrix C of the k nearest
neighbors of p
i
, performing the eigen-decomposition, and consid-
ering the difference betwen the smallest and the largest eigenvalue.
c
2025 The Authors.
Proceedings published by Eurographics - The European Association for Computer Graphics.
This is an open access article under the terms of the Creative Commons Attribution License, which
permits use, distribution and reproduction in any medium, provided the original work is properly
cited.
2 of 5 V. Y. Chen et al. / Pixels2Points: Fusing 2D and 3D Features for Facial Skin Segmentation Supplementary Materials
RGB Image Orig. Registration Synthetic Hair Segmentation Registration Segmentation Registration
Synthetics 2D Method Ours
Figure 1: We re-render real data examples with a variety of synthetic hairstyles from our digital assets library. Thereafter, we run MVS and
registration. Ideally, even when adding different hairstyles, the segmentation mask should allow the downstream registration result to remain
consistent. It is clear here that our method allows the segmentation result to remain more consistent with the original registration result.
Geometrically, the smallest eigenvalue corresponds to the point
variance along the surface normal of p
i
, and therefore the larger
the smallest eigenvalue is, the more the surface is locally curved at
p
i
.
Formally, let P = {p
i
} be a set of surface points and let N
p
i
R
k×3
be a mean-centered matrix collecting the k nearest points of
p
i
(including p
i
itself). Let C =
1
k1
N
>
p
i
N
p
i
be the associated co-
variance matrix C R
3×3
, and λ
λ
λ = [λ
1
, λ
2
, λ
3
] the vector of the
three eigenvalues of C. The surface variation of the point p
i
is com-
puted as
c
2025 The Authors.
Proceedings published by Eurographics - The European Association for Computer Graphics.
V. Y. Chen et al. / Pixels2Points: Fusing 2D and 3D Features for Facial Skin Segmentation Supplementary Materials 3 of 5
RGB Images Segmentation Registration Segmentation Registration Segmentation Registration
2D Method 3D Method Ours
Figure 2: We compare our method with the 2D method and the best performing pure 3D method on the real data. On the left we show input
RGB images. We capture 13 views in total. Then, we show both the segmentation results as well as the registration results. Dark orange is
the predicted skin. Additional examples show that our model works on a diveristy of subjects.
Raw Scan No Variance With Variance
Figure 3: Feature Variance. The model can better segment regions
with noisy geometry, when using variance of image features across
views.
σ
i
=
min(λ
λ
λ)
3
j
λ
j
. (1)
6. Feature Visualization
In Fig.4, we visualize input features on a scan mesh of a real sub-
ject. The feature mean shows one ViT feature dimension when pro-
jected onto the mesh, whereas the feature variance denotes the cor-
responding variance among the 13 views for that specific feature.
Surface variation is shown on the far right. Note that even though
Colored Scan Feature Mean Feature Variance Surface Variance σ
30
Figure 4: Feature Visualization, from left to right: Colored raw
scan, Single ViT feature mean, corresponding variance, and surface
variation. Notice how feature variance highlights the ambiguous
region on the scan mesh, close to the boundary of skin and hair.
the mesh consists of 250K vertices, the values along the mesh are
fairly smooth.
7. Segmentation Quality of Custom 2D Segmenter
In Fig. 5, we compare our custom 2D segmentation model to the
Meta Sapiens model [KBM
24]. Our custom 2D segmenter is fine-
c
2025 The Authors.
Proceedings published by Eurographics - The European Association for Computer Graphics.
4 of 5 V. Y. Chen et al. / Pixels2Points: Fusing 2D and 3D Features for Facial Skin Segmentation Supplementary Materials
Figure 5: Comparison of 2D segmentation using our custom 2D
segmenter and Meta Sapiens. Only one frontol camera view is
shown here.
Figure 6: Spa Mask. We use a limited region of the template mesh
to compute distance to surface costs for synthetic data examples.
tuned to recognise eyes, eyebrows, nose, and mouth. On the other
hand, the default Meta Sapiens model only recognises the skin and
hair separation. This demonstrates that the quality of our custom
2D segmenter is on par with SOTA.
8. Labelling Real Capture Data
Real scan meshes were exported to Blender and annotated by color-
ing the mesh using brush strokes. The labels come from one anno-
tator, due to the time and costs that labelling requires. Furthermore,
we did not use vertex color during labelling and mostly focused on
identifying geometry that does not belong to the skin region.
9. Spa Mask Region
Since mostly the frontal face region affects the registration quality,
we do not use the full template mesh for computing d
surface
. We
only focus on the regions which are directly on the temples, cheeks,
and nose region. Below in Fig. 6, the spa mask region of a synthetic
data example is shown.
Figure 7: Examples from our synthetic dataset. We have a variety
of identity, skin tones, and hairstyle.
Figure 8: Experiment on how dataset size affects mIoU on syn-
thetic test set.
10. Additional synthetics data examples
More examples of our synthetic dataset can be seen in Fig. 7,
11. Ablation Study on Dataset Size
Finally, we study how the dataset size affects model performance in
Fig. 8. Due to the capacity of the ViT backbone, our method does
require many synthetic examples to achieve competitive perfor-
mance. However, generalization performance may differ depending
on the diversity of the training set.
References
[GCP
22] GUNDOGDU E., CONSTANTIN V., PARASHAR S., SEIFOD-
DINI A., DANG M., SALZMANN M., FUA P.: Garnet++: Improving fast
and accurate static 3d cloth draping by curvature loss. TPAMI (2022). 1
[KBM
24] KHIRODKAR R., BAGAUTDINOV T., MARTINEZ J.,
ZHAOEN S., JAMES A., SELEDNIK P., ANDERSON S., SAITO
c
2025 The Authors.
Proceedings published by Eurographics - The European Association for Computer Graphics.
V. Y. Chen et al. / Pixels2Points: Fusing 2D and 3D Features for Facial Skin Segmentation Supplementary Materials 5 of 5
S.: Sapiens: Foundation for human vision models. arXiv preprint
arXiv:2408.12569 (2024). 3
[PGK02] PAULY M., GROSS M., KOBBELT L. P.: Efficient simplifica-
tion of point-sampled surfaces. In IEEE Visualization, 2002. VIS 2002.
(2002), IEEE, pp. 163–170. 1
[WBH
21] WOOD E., BALTRUŠAITIS T., HEWITT C., DZIADZIO S.,
JOHNSON M., ESTELLERS V., CASHMAN T. J., SHOTTON J.: Fake it
till you make it: Face analysis in the wild using synthetic data alone. 1
c
2025 The Authors.
Proceedings published by Eurographics - The European Association for Computer Graphics.