EUROGRAPHICS 2025/ D. Ceylan and T.-M. Li Short Paper

Pixels2Points: Fusing 2D and 3D Features for Facial Skin Segmentation

Supplementary Materials

V. Y. Chen

, D. Wang

, G. Stephan

, J. Bednarik

, S. Winberg

, T. Bolkart

, T. Beeler

Google

ETH Zürich

1. Training Details

The trainable part of our model includes the ViT head and the Dif-

fusionNet. The ViT head upscales ViT features to the original im-

age size. It consists of 3 convolution layers, with 512 dimensions.

DiffusionNet consists of 4 DiffusionNet blocks. Each DiffusionNet

block will ﬁrst diffuse vertex features and then apply a point-wised

MLP. Within each of Block we use a 2-layer MLP with 128 dimen-

sions. In total, the model has 3.7M trainable parameters.

We use a weighted cross-entropy loss where more weight is ap-

plied to the skin class (0.7 on skin and 0.3 on non-skin). We train

our model using AdamW optimizer with learning rate 1e-3 and

weight decay 1e-5. We train our model on a single H100 GPU for

60 epochs. Our scan meshes differ in vertex number, resulting in

spectral operators of non-uniform size, so we use batch size 1.

We generate a synthetics dataset of 3000 heads. Each head is ren-

dered from 13 camera views on 1K resolution, mimicking the cap-

ture system of the real data. The input scan mesh is reconstructed

from running a multi-view reconstruction algorithm on rendered

images. It has around 250K vertices.

2. Alpha Blending between Real Images and Rendered

Hairstyles

To further evaluate our model against 2D and 3D baselines on real

data, we augmented real data with synthetic hairstyles in Fig. 1. For

subjects with minimal hair. We render various hairstyles and alpha

blend that onto the real captures. Registration computed from clean

images serves as ground truth. High quality segmentation masks

will allow the downstream registration result on alpha-blended im-

ages to remain consistent with the original registration result. Fig. 1

shows that the 2D method fails on hairstyles with many hair strands

and create erroneous bumps on the registration result.

3. Additional Qualitative Comparison on Real Data

In Fig. 2, we share additional results on real scans. For some scans,

there are no signiﬁcant differences between the three models, espe-

cially when there are not many occluded regions. However, when

hair strands are near the face or forehead, the downstream registra-

tion result differs.

In Fig. 3, We show that variance of image features across views

helps the model to identify ambiguous regions on the scan mesh.

Poorly reconstructed vertices can fall onto the skin region in certain

camera views and confuse the model, which variance helps to rule

out.

4. 3D Morphable Model (3DMM)

As explained in Section 2.3 of the main paper, in order to generate

synthetic training data we rely on a custom 3DMM. The model

closely follows the deﬁnition introduced in [WBH

∗

21]. Specif-

ically, the model represents a function M(Θ) : R

|Θ|

→ R

3×N

which takes model parameters Θ on the input and produces N

vertices of a human head mesh with a given ﬁxed topology.

Following [WBH

∗

21], the model parameters Θ consist of iden-

tity parameters β ∈ R

|β|

, expression parameters φ ∈ R

|φ|

, joint ro-

tation parameters θ ∈ R

K×3

and a global translation vector τ ∈ R

The model is linear, i.e. both the identity and expression parameters

β, φ control the additive contribution of the identity and expression

basis represented as a PCA computed over an in-house dataset of

registered expressive faces.

Our model differs from that of [WBH

∗

21] in the two follow-

ing aspects. (1) We deﬁned a custom mesh topology consisting of

= 17821 vertices, and (2) we rely on custom skinning weights

prepared manually by an artist.

5. Surface Variation

As discussed in the Section 2.1 of the main paper, one of the ge-

ometric features we rely on is a surface variation [PGK02]. This

feature represents a notion of a second order surface property akin

to a curvature. As opposed to a standard mean or Gaussian cur-

vature, surface variation tends to produce less noisy results when

computed on a discretized 3D surface (e.g. a triangular mesh), it is

paramatereizable in terms of the spane of the surface area it consid-

ers for a given surface point, and has been successfully used in ML-

based approach to a deformable 3D surface analysis [GCP

∗

22].

For any given surface point p

∈ R

, the surface variation σ

computed by computing the covariance matrix C of the k nearest

neighbors of p

, performing the eigen-decomposition, and consid-

ering the difference betwen the smallest and the largest eigenvalue.

 2025 The Authors.

Proceedings published by Eurographics - The European Association for Computer Graphics.

This is an open access article under the terms of the Creative Commons Attribution License, which

permits use, distribution and reproduction in any medium, provided the original work is properly

cited.

2 of 5 V. Y. Chen et al. / Pixels2Points: Fusing 2D and 3D Features for Facial Skin Segmentation Supplementary Materials

RGB Image Orig. Registration Synthetic Hair Segmentation Registration Segmentation Registration

Synthetics 2D Method Ours

Figure 1: We re-render real data examples with a variety of synthetic hairstyles from our digital assets library. Thereafter, we run MVS and

registration. Ideally, even when adding different hairstyles, the segmentation mask should allow the downstream registration result to remain

consistent. It is clear here that our method allows the segmentation result to remain more consistent with the original registration result.

Geometrically, the smallest eigenvalue corresponds to the point

variance along the surface normal of p

, and therefore the larger

the smallest eigenvalue is, the more the surface is locally curved at

Formally, let P = {p

} be a set of surface points and let N

∈

k×3

be a mean-centered matrix collecting the k nearest points of

(including p

itself). Let C =

k−1

be the associated co-

variance matrix C ∈ R

3×3

, and λ

λ = [λ

, λ

] the vector of the

three eigenvalues of C. The surface variation of the point p

is com-

puted as

 2025 The Authors.

Proceedings published by Eurographics - The European Association for Computer Graphics.

V. Y. Chen et al. / Pixels2Points: Fusing 2D and 3D Features for Facial Skin Segmentation Supplementary Materials 3 of 5

RGB Images Segmentation Registration Segmentation Registration Segmentation Registration

2D Method 3D Method Ours

Figure 2: We compare our method with the 2D method and the best performing pure 3D method on the real data. On the left we show input

RGB images. We capture 13 views in total. Then, we show both the segmentation results as well as the registration results. Dark orange is

the predicted skin. Additional examples show that our model works on a diveristy of subjects.

Raw Scan No Variance With Variance

Figure 3: Feature Variance. The model can better segment regions

with noisy geometry, when using variance of image features across

views.

min(λ

λ)

∑

. (1)

6. Feature Visualization

In Fig.4, we visualize input features on a scan mesh of a real sub-

ject. The feature mean shows one ViT feature dimension when pro-

jected onto the mesh, whereas the feature variance denotes the cor-

responding variance among the 13 views for that speciﬁc feature.

Surface variation is shown on the far right. Note that even though

Colored Scan Feature Mean Feature Variance Surface Variance σ

Figure 4: Feature Visualization, from left to right: Colored raw

scan, Single ViT feature mean, corresponding variance, and surface

variation. Notice how feature variance highlights the ambiguous

region on the scan mesh, close to the boundary of skin and hair.

the mesh consists of 250K vertices, the values along the mesh are

fairly smooth.

7. Segmentation Quality of Custom 2D Segmenter

In Fig. 5, we compare our custom 2D segmentation model to the

Meta Sapiens model [KBM

∗

24]. Our custom 2D segmenter is ﬁne-

 2025 The Authors.

Proceedings published by Eurographics - The European Association for Computer Graphics.

4 of 5 V. Y. Chen et al. / Pixels2Points: Fusing 2D and 3D Features for Facial Skin Segmentation Supplementary Materials

Figure 5: Comparison of 2D segmentation using our custom 2D

segmenter and Meta Sapiens. Only one frontol camera view is

shown here.

Figure 6: Spa Mask. We use a limited region of the template mesh

to compute distance to surface costs for synthetic data examples.

tuned to recognise eyes, eyebrows, nose, and mouth. On the other

hand, the default Meta Sapiens model only recognises the skin and

hair separation. This demonstrates that the quality of our custom

2D segmenter is on par with SOTA.

8. Labelling Real Capture Data

Real scan meshes were exported to Blender and annotated by color-

ing the mesh using brush strokes. The labels come from one anno-

tator, due to the time and costs that labelling requires. Furthermore,

we did not use vertex color during labelling and mostly focused on

identifying geometry that does not belong to the skin region.

9. Spa Mask Region

Since mostly the frontal face region affects the registration quality,

we do not use the full template mesh for computing d

surface

. We

only focus on the regions which are directly on the temples, cheeks,

and nose region. Below in Fig. 6, the spa mask region of a synthetic

data example is shown.

Figure 7: Examples from our synthetic dataset. We have a variety

of identity, skin tones, and hairstyle.

Figure 8: Experiment on how dataset size affects mIoU on syn-

thetic test set.

10. Additional synthetics data examples

More examples of our synthetic dataset can be seen in Fig. 7,

11. Ablation Study on Dataset Size

Finally, we study how the dataset size affects model performance in

Fig. 8. Due to the capacity of the ViT backbone, our method does

require many synthetic examples to achieve competitive perfor-

mance. However, generalization performance may differ depending

on the diversity of the training set.

References

[GCP

∗

22] GUNDOGDU E., CONSTANTIN V., PARASHAR S., SEIFOD-

DINI A., DANG M., SALZMANN M., FUA P.: Garnet++: Improving fast

and accurate static 3d cloth draping by curvature loss. TPAMI (2022). 1

[KBM

∗

24] KHIRODKAR R., BAGAUTDINOV T., MARTINEZ J.,

ZHAOEN S., JAMES A., SELEDNIK P., ANDERSON S., SAITO

 2025 The Authors.

Proceedings published by Eurographics - The European Association for Computer Graphics.

V. Y. Chen et al. / Pixels2Points: Fusing 2D and 3D Features for Facial Skin Segmentation Supplementary Materials 5 of 5

S.: Sapiens: Foundation for human vision models. arXiv preprint

arXiv:2408.12569 (2024). 3

[PGK02] PAULY M., GROSS M., KOBBELT L. P.: Efﬁcient simpliﬁca-

tion of point-sampled surfaces. In IEEE Visualization, 2002. VIS 2002.

(2002), IEEE, pp. 163–170. 1

[WBH

∗

21] WOOD E., BALTRUŠAITIS T., HEWITT C., DZIADZIO S.,

JOHNSON M., ESTELLERS V., CASHMAN T. J., SHOTTON J.: Fake it

till you make it: Face analysis in the wild using synthetic data alone. 1

 2025 The Authors.

Proceedings published by Eurographics - The European Association for Computer Graphics.