Cafca: High-quality Novel View Synthesis of Expressive Faces from Casual Few-shot Captures

Marcel C. Bühler^1,2, Gengyan Li^1,2, Erroll Wood², Leonhard Helminger², Xu Chen², Tanmay Shah², Daoye Wang², Stephan Garbin², Sergio Orts-Escolano², Otmar Hilliges¹, Dmitry Lagun², Jérémy Riviere², Paulo Gotardo², Thabo Beeler², Abhimitra Meka², Kripasindhu Sarkar²

¹ETH Zurich, ²Google

Paper (arxiv) Talk Uncompressed (100 MB) Extended Website (zip, 424 MB) Synt. Dataset In-the-wild Dataset (zip, 23 MB)

Abstract

Volumetric modeling and neural radiance field representations have revolutionized 3D face capture and photorealistic novel view synthesis. However, these methods often require hundreds of multi-view input images and are thus inapplicable to cases with less than a handful of inputs. We present a novel volumetric prior on human faces that allows for high-fidelity expressive face modeling from as few as three input views captured in the wild. Our key insight is that an implicit prior trained on synthetic data alone can generalize to extremely challenging real-world identities and expressions and render novel views with fine idiosyncratic details like wrinkles and eyelashes. We leverage a 3D Morphable Face Model to synthesize a large training set, rendering each identity with different expressions, hair, clothing, and other assets. We then train a conditional Neural Radiance Field prior on this synthetic dataset and, at inference time, fine-tune the model on a very sparse set of real images of a single subject. On average, the fine-tuning requires only three inputs to cross the synthetic-to-real domain gap. The resulting personalized 3D model reconstructs strong idiosyncratic facial expressions and outperforms the state-of-the-art in high-quality novel view synthesis of faces from sparse inputs in terms of perceptual and photo-metric quality.

Please use the links in the navigation bar to quickly jump to results.

We recommend using Chrome for playing the videos.

Method

TL;DR;

We train a synthetic face prior for few-shot, personalized 3D face modeling in the wild. Although the model never sees a single real image during pre-training, it generalizes to real and stylized face images during inference.
Our new model and few-shot fine-tuning method can robustly reconstruct expressive faces under challenging in-the-wild lighting and synthesize photorealistic novel views from only three inputs.

1. Pretraining an Expressive Prior Model

We train an implicit prior model on synthetic renderings of a 3DMM combined with assets like hair, beard, clothing, and more. The prior takes as input a frozen 3DMM identity code β, a frozen 3DMM expression code ψ, and an optimizable latent code w, which models appearance, clothing, and other out-of-3DMM characteristics. The prior model F is a conditional Radiance Field optimized to reconstruct the training identities. Please see Sec. 3.2 in the paper for details.

2. Morphable Model Fitting and Camera Estimation

For inference, we capture in-the-wild face images and recover camera and 3DMM parameters from the un-calibrated input. We follow the approach of previous work [Wood et al. 2022] and fit to dense 2D landmarks. The outputs of this step are the camera parameters, a shared identity code, and per-image expression codes. The per-image expression codes are crucial for enabling high-quality in-the-wild results because they compensate for small motion during the capture (Sec. 3.3.2 in the paper).

3. Fine-tuning on Sparse Views

Fine-tuning requires a short warm-up phase where only the latent code for the target w_t is optimized. After that, fine-tuning optimizes all model parameters under additional constraints on the geometry and the appearance weights. We randomly sample rays from all available inputs, typically three images, and mask them to the foreground by multiplying by an estimated foreground mask. Please see Sec. 3.3.2 in the paper for more details.

In-the-Wild Results

We demonstrate the generalization capability of our method to in-the-wild mobile camera captures. With just 3 input views, our method is able to generate highly consistent and photorealistic free-view renders of a subject. Our method not only reconstructs coherent geometry, it also learns to interpolate view-dependent specularities, such as on the hair and skin.

Input Views

Novel View Synthesis

Input Views

Novel View Synthesis

Input Views

Novel View Synthesis

Input Views

Novel View Synthesis

One-shot Synthesis In-the-wild

Synthesis from single view naturally suffers from bas-relief ambiguity making it much more challenging, but our prior model enables plausible results. We show smartphone images, selfies, an Internet image from the Rodin website, and stylized inputs from LP3D (Trevithick et al., 2023). Note that our prior model has never seen any stylization or strong face paint.

Smartphone Image

Novel View Synthesis

Selfie

Novel View Synthesis

Internet Image

Novel View Synthesis

Stylized Input

Novel View Synthesis

Stylized Input

Novel View Synthesis

Stylized Input

Novel View Synthesis

Synthetic Prior Model

We train a synthetic prior model to capture the distribution of human heads with arbitrary facial expressions. The prior model is a conditional NeRF with three additional inputs modeling the identity β, the expression ψ, out-of-3DMM properties w like hair, clothing, and appearance.

We show interpolations for each of the latent spaces.

Identity Interpolation β

Expression Interpolation ψ

Out-of-model Interpolation w

Limitations

Our synthetic prior is trained on glasses, however only 20% of the synthetic identities wear glasses. In some cases, the fine-tuned results incorrectly paint the frame of glasses onto the skin (see the example on the first row). This happens when only a single input view shows the frame from a side angle. Furthermore, our method tends to produce bad results when the shoulders are not unobserved in the input (bottom left), and it fails for heavily out-of-distribution faces (bottom right). Note how the prior tries to reconstruct Gollum by adding glasses. Please see the paper for a discussion of limitations.

The bottom left image is from the Rodin website. The Gollum image (bottom right) is from The Lord of the Rings © New Line Cinema / Warner Bros.

Inputs

Result

Inconsistent Frame for Glasses

Input

Result

Input

Result

Future Work: Animation

While our focus is on novel view synthesis of static faces, future work could leverage similar prior models for facial animation. To demonstrate this, we test our model by fine-tuning to 12 smartphone images and interpolating random expressions from the synthetic pre-training set. Fine-tuning to multiple expressions helps preserve the expression space of the prior model, and enables the rendering of expressions that are not seen in the input images.

Inputs

Novel Expressions

If you find this work useful, please consider citing:

@incollection{buehler2024cafca,
    title={Cafca: High-quality Novel View Synthesis of Expressive Faces from Casual Few-shot Captures},
    author={Marcel C. Buehler and Gengyan Li and Erroll Wood and Leonhard Helminger and Xu Chen and Tanmay Shah and Daoye Wang and Stephan Garbin and Sergio Orts-Escolano and Otmar Hilliges and Dmitry Lagun and Jérémy Riviere and Paulo Gotardo and Thabo Beeler and Abhimitra Meka and Kripasindhu Sarkar},
    year={2024},
    booktitle={ACM SIGGRAPH Asia 2024 Conference Paper},
    doi={10.1145/3680528.3687580},
    url={https://doi.org/10.1145/3680528}
}