TeGA: Texture Space Gaussian Avatars for High-Resolution DynamicHead Modeling

Gengyan Li^{1, 2}, Paulo Gotardo², Timo Bolkart², Stephan Garbin², Kripasindhu Sarkar², Abhimitra Meka², Alexandros Lattas², Thabo Beeler²

¹ETH Zürich, ²Google

PDF DOI/ACM Video

Abstract

Sparse volumetric reconstruction and rendering via 3D Gaussian splatting have recently enabled animatable 3D head avatars that are rendered under arbitrary viewpoints with impressive photorealism. Today, such photoreal avatars are seen as a key component in emerging applications in telepresence, extended reality, and entertainment. Building a photoreal avatar requires estimating the complex non-rigid motion of different facial components as seen on input video images; due to inaccurate motion estimation, animatable models typically present a loss of fidelity and detail when compared to their non-animatable counterparts, built from an individual facial expression. Also, recent state-of-the-art models are often affected by memory limitations that reduce the number of 3D Gaussians used for modeling, leading to lower detail and quality. To address these problems, we present a new high-detail 3D head avatar model that improves upon the state of the art, largely increasing the number of 3D Gaussians and modeling quality for rendering at 4K resolution. Our high-quality model is reconstructed from multiview input video and builds on top of a mesh-based 3D morphable model, which provides a coarse deformation layer for the head. Photoreal appearance is modelled by 3D Gaussians embedded within the continuous UVD tangent space of this mesh, allowing for more effective densification where most needed. Additionally, these Gaussians are warped by a novel UVD deformation field to capture subtle, localized motion. Our key contribution is the novel deformable Gaussian encoding and overall fitting procedure that allows our head model to preserve appearance detail, while capturing facial motion and other transient high-frequency features such as skin wrinkling.

* We recommend using the Chrome or Safari browser to correctly display all visuals.

BibTeX

@inproceedings{10.1145/3721238.3730710,
author = {Li, Gengyan and Gotardo, Paulo and Bolkart, Timo and Garbin, Stephan and Sarkar, Kripasindhu and Meka, Abhimitra and Lattas, Alexandros and Beeler, Thabo},
title = {TeGA: Texture Space Gaussian Avatars for High-Resolution Dynamic Head Modeling},
year = {2025},
isbn = {9798400715402},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3721238.3730710},
doi = {10.1145/3721238.3730710},
abstract = {Sparse volumetric reconstruction and rendering via 3D Gaussian splatting have recently enabled animatable 3D head avatars that are rendered under arbitrary viewpoints with impressive photorealism. Today, such photoreal avatars are seen as a key component in emerging applications in telepresence, extended reality, and entertainment. Building a photoreal avatar requires estimating the complex non-rigid motion of different facial components as seen in input video images; due to inaccurate motion estimation, animatable models typically present a loss of fidelity and detail when compared to their non-animatable counterparts, built from an individual facial expression. Also, recent state-of-the-art models are often affected by memory limitations that reduce the number of 3D Gaussians used for modeling, leading to lower detail and quality. To address these problems, we present a new high-detail 3D head avatar model that improves upon the state of the art, largely increasing the number of 3D Gaussians and modeling quality for rendering at 4K resolution. Our high-quality model is reconstructed from multiview input video and builds on top of a mesh-based 3D morphable model, which provides a coarse deformation layer for the head. Photoreal appearance is modelled by 3D Gaussians embedded within the continuous UVD tangent space of this mesh, allowing for more effective densification where most needed. Additionally, these Gaussians are warped by a novel UVD deformation field to capture subtle, localized motion. Our key contribution is the novel deformable Gaussian encoding and overall fitting procedure that allows our head model to preserve appearance detail, while capturing facial motion and other transient high-frequency features such as skin wrinkling.},
booktitle = {Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers},
articleno = {144},
numpages = {12},
keywords = {Gaussian Splatting, Face Reconstruction, Face Animation, Multiview Stereo reconstruction},
location = {
},
series = {SIGGRAPH Conference Papers '25}
}

Novel View Synthesis

We can render the same (training) expression from continuous camera views. As can be seen below, when training on a static lighting setting, RGCA does not maintain multiview consistency across motion which manifests itself through semitransparent surfaces especially visible in the last example. GHA, using CNN-based upscaling, suffers from noticeable aliasing effects. While Gaussian Avatars and MVP do not suffer from these issues, the overall quality is worse.

Ours

RGCA

GaussianAvatars

MVP

GHA

Ours

RGCA

GaussianAvatars

MVP

GHA

Ours

RGCA

GaussianAvatars

MVP

GHA

Ours

RGCA

GaussianAvatars

MVP

GHA

Reenactment

We show results on expressions unseen during training. In these examples, we extract 3DMM coefficients from the Expression Target (left) and apply them to our target subject on the right. Note how both MVP and RGCA often fails to properly reconstruct the expressions. Furthermore, although quality is overall fairly comparable, we are able to reconstruct certain finer details better than RGCA. Both our method and RGCA significantly outperform GaussianAvatars, MVP and Gaussian Head Avatars in terms of quality. Note that since the multiface dataset does not contain eyeball meshes, no method is able to reconstruct gaze correctly. In order to show that our method does not improve purely through the use of additional gaussians, we additionally provide comparisons with our method using a roughly equal number of gaussians. Specifically, RGCA was trained using 1M gaussians, and Gaussian Avatars typically densifies up to ~200K gaussians. Note how despite some additional artifacts and blurrier hair, our method is able to retain quality even with only 200K gaussians.

GT

Ours

Ours(1M Gaussians)

RGCA

Ours(200K Gaussians)

GaussianAvatars

MVP

GHA

GT

Ours

Ours(1M Gaussians)

RGCA

Ours(200K Gaussians)

GaussianAvatars

MVP

GHA

GT

Ours

Ours(1M Gaussians)

RGCA

Ours(200K Gaussians)

GaussianAvatars

MVP

GHA

GT

Ours

Ours(1M Gaussians)

RGCA

Ours(200K Gaussians)

GaussianAvatars

MVP

GHA

GT

Ours

Ours(1M Gaussians)

RGCA

Ours(200K Gaussians)

GaussianAvatars

MVP

GHA

GT

Ours

Ours(1M Gaussians)

RGCA

Ours(200K Gaussians)

GaussianAvatars

MVP

GHA

GT

Ours

Ours(1M Gaussians)

RGCA

Ours(200K Gaussians)

GaussianAvatars

MVP

GHA

GT

Ours

Ours(1M Gaussians)

RGCA

Ours(200K Gaussians)

GaussianAvatars

MVP

GHA

Warp Ablation

We show results on test expression reenactments. Note how the CNN warp field is necessary in order to reconstruct all the fine details.

GT

Ours

No Warp (With Shading

MLP Warp

GT

Ours

No Warp (With Shading)

MLP Warp

GT

Ours

No Warp (With Shading

MLP Warp

GT

Ours

No Warp (With Shading)

MLP Warp

GT

Ours

No Warp (With Shading

MLP Warp

GT

Ours

No Warp (With Shading)

MLP Warp

Shading Ablation

We show results on expressions unseen during training and interpolated expressions with camera orbits for all our ablation settings.

GT

Ours

No Shading (With Warp)

No Warp or Shading

GT

Ours

No Shading (With Warp)

No Warp or Shading

GT

Ours

No Shading (With Warp)

No Warp or Shading

GT

Ours

No Shading (With Warp)

No Warp or Shading

GT

Ours

No Shading (With Warp)

No Warp or Shading

GT

Ours

No Warp (With Shading)

No Warp or Shading

Other Ablations

Here, we show how our method performs without adaptive densification, with binding inheritance, and without using an LPIPS loss. Here, we show that densification is critical to general quality. The UV based triangle method is critical to avoid artifacts where gaussians are bound to the "wrong" triangle, and LPIPs significantly reduces blurriness.

GT

Ours

No Densification

No Triangle Updates

No VGG loss

GT

Ours

No Densification

No Triangle Updates

No VGG loss

GT

Ours

No Densification

No Triangle Updates

No VGG loss

GT

Ours

No Densification

No Triangle Updates

No VGG loss

GT

Ours

No Densification

No Triangle Updates

No VGG loss

GT

Ours

No Densification

No Triangle Updates

No VGG loss

Temporal Instability

Some of our results shown temporal instability in the form of flickering. This is not an inherent part of our method, but rather as a result of imperfect 3DMM registration. Below, we show comparisons of our results with the underlying 3DMM mesh. As can be seen here, the flickering mostly caused by instability of the underlying mesh,

Render

Mesh

Render

Mesh