ShellNeRF: Learning a Controllable High-resolution Model of the Eye and Periocular Region

1ETH Zürich, 2Google

We present ShellNeRF - a novel method for high-resolution novel view synthesis and animation of the periocular face region. Our method allows for controlling expressions and eye gaze and renders novel views at an unprecedented level of detail.

Abstract

Eye gaze and expressions are crucial non-verbal signals in face-to-face communication. Visual effects and telepresence demand significant improvements in personalized tracking, animation, and synthesis of the eye region to achieve true immersion. Morphable face models, in combination with coordinate-based neural volumetric representations, show promise in solving the difficult problem of reconstructing intricate geometry (eyelashes) and synthesizing photorealistic appearance variations (wrinkles and specularities) of eye performances. We propose a novel hybrid representation - ShellNeRF - that builds a discretized volume around a 3DMM face mesh using concentric surfaces to model the deformable ‘periocular’ region. We define a canonical space using the UV layout of the shells that constrains the space of dense correspondence search. Combined with an explicit eyeball mesh for modeling corneal light-transport, our model allows for animatable photorealistic 3D synthesis of the whole eye region. Using multi-view video input, we demonstrate significant improvements over state-of-the-art in expression re-enactment and transfer for high-resolution close-up views of the eye region.

BibTeX

@article {10.1111:cgf.15041,
journal = {Computer Graphics Forum},
title = {{ShellNeRF: Learning a Controllable High-resolution Model of the Eye and Periocular Region}},
author = {Li, Gengyan and Sarkar, Kripasindhu and Meka, Abhimitra and Buehler, Marcel and Mueller, Franziska and Gotardo, Paulo and Hilliges, Otmar and Beeler, Thabo},
year = {2024},
publisher = {The Eurographics Association and John Wiley & Sons Ltd.},
ISSN = {1467-8659},
DOI = {10.1111/cgf.15041}
}

Novel View Synthesis

We can render the same scene from continuous camera views. Unlike other methods, we ensure multiview consistency and do not "hide" wrinkles and shadows beneath the skin surface. Nerface suffers from high instability and often diverges, as seen in the second subject

  Ours          MVP       Nerface + EyeNeRF  3DMM cond. EyeNeRF     MonoAvatar

  Ours          MVP       Nerface + EyeNeRF  3DMM cond. EyeNeRF     MonoAvatar

Decomposition

We can also decompose each video into albedo, diffuse shading and specular shading.

 Albedo       Diffuse       Specular

 Albedo       Diffuse       Specular

Regazing

Our method can control eye gaze and synthesize novel view directions. Note how our method is significantly sharper and moves smoother than all other competing methods. Mixture of Volumetric Primitives(MVP) requires a projected texture as an input, which is not given for such manipulations. We therefore use a neutral texture as a placeholder. MVP is unable to regaze properly, instead blending from one eye pose to the next. On the other hand, the lack of overall quality prevents the eyeball in EyeNeRF from being learned correctly for one pose of the first subject, and generally for the second subject.

  Ours          MVP       Nerface + EyeNeRF  3DMM cond. EyeNeRF     MonoAvatar

  Ours          MVP       Nerface + EyeNeRF  3DMM cond. EyeNeRF     MonoAvatar

Regazing with a Moving Camera

We can do the same with a moving camera.

  Ours          MVP       Nerface + EyeNeRF  3DMM cond. EyeNeRF     MonoAvatar

  Ours          MVP       Nerface + EyeNeRF  3DMM cond. EyeNeRF     MonoAvatar

Interpolating Expressions

We interpolate between 13 expressions. Related works struggle to render convincing expressions. For our method, the periocular region smoothly adapts and shows detailed, naturally deforming wrinkles and highly detailed reflections on the eyeball. Again, for MVP we use the neutral texture as a placeholder.

  Ours          MVP       Nerface + EyeNeRF  3DMM cond. EyeNeRF     MonoAvatar

  Ours          MVP       Nerface + EyeNeRF  3DMM cond. EyeNeRF     MonoAvatar

Interpolating Expressions with a Moving Camera

We can do also perform interpolation of expressions while moving the camera and maintain 3D consistency throughout the motion. SotA methods struggle with gaze and expression changes and produce significant floaters for novel camera viewpoints.

  Ours          MVP       Nerface + EyeNeRF  3DMM cond. EyeNeRF     MonoAvatar

  Ours          MVP       Nerface + EyeNeRF  3DMM cond. EyeNeRF     MonoAvatar

3DMM Expressions

Our shell-based formulation enables fine-grained control via a 3DMM parameters. We show slow-motion renderings of a highly complex expression: a closing eyelid. Only our method and MonoAvatar are capable of this.

  Ours          MVP       Nerface + EyeNeRF  3DMM cond. EyeNeRF     MonoAvatar

  Ours          MVP       Nerface + EyeNeRF  3DMM cond. EyeNeRF     MonoAvatar

Reenactment

We show results on expressions unseen during training. In these examples, we extract 3DMM coefficients from the Expression Target (left) and apply them to our target subject on the right. Note how our method faithfully applies the desired expressions. Note that our 3DMM fitting method does not enforce temporal smoothness, resulting in jitter. As we crop our image based on the estimated eye pose, this results in jitter in the GT video as well. Although performs comparably in most situations, it is unable to handle certain expressions which it did not directly see, such as the half open eye in the first expression, resulting in strong artifacting.

Expression Target             Ours                 MVP    

Nerface + EyeNeRF         3DMM cond. EyeNeRF           MonoAvatar  

Expression Target             Ours                 MVP    

Nerface + EyeNeRF         3DMM cond. EyeNeRF           MonoAvatar