-

Abstract

Volumetric modeling and neural radiance field representations have revolutionized 3D face capture and photorealistic novel view synthesis. However, these methods often require hundreds of multi-view input images and are thus inapplicable to cases with less than a handful of inputs. We present a novel volumetric prior on human faces that allows for high-fidelity expressive face modeling from as few as three input views captured in the wild. Our key insight is that an implicit prior trained on synthetic data alone can generalize to extremely challenging real-world identities and expressions and render novel views with fine idiosyncratic details like wrinkles and eyelashes. We leverage a 3D Morphable Face Model to synthesize a large training set, rendering each identity with different expressions, hair, clothing, and other assets. We then train a conditional Neural Radiance Field prior on this synthetic dataset and, at inference time, fine-tune the model on a very sparse set of real images of a single subject. On average, the fine-tuning requires only three inputs to cross the synthetic-to-real domain gap. The resulting personalized 3D model reconstructs strong idiosyncratic facial expressions and outperforms the state-of-the-art in high-quality novel view synthesis of faces from sparse inputs in terms of perceptual and photo-metric quality.
Please use the links in the navigation bar to quickly jump to results.
We recommend using Chrome for playing the videos.







Method

TL;DR;

We train a synthetic face prior for few-shot, personalized 3D face modeling in the wild. Although the model never sees a single real image during pre-training, it generalizes to real and stylized face images during inference.
Our new model and few-shot fine-tuning method can robustly reconstruct expressive faces under challenging in-the-wild lighting and synthesize photorealistic novel views from only three inputs.















In-the-Wild Results

We demonstrate the generalization capability of our method to in-the-wild mobile camera captures. With just 3 input views, our method is able to generate highly consistent and photorealistic free-view renders of a subject. Our method not only reconstructs coherent geometry, it also learns to interpolate view-dependent specularities, such as on the hair and skin.








One-shot Synthesis In-the-wild

Synthesis from single view naturally suffers from bas-relief ambiguity making it much more challenging, but our prior model enables plausible results. We show smartphone images, selfies, an Internet image from the Rodin website, and stylized inputs from LP3D (Trevithick et al., 2023). Note that our prior model has never seen any stylization or strong face paint.








Synthetic Prior Model

We train a synthetic prior model to capture the distribution of human heads with arbitrary facial expressions. The prior model is a conditional NeRF with three additional inputs modeling the identity β, the expression ψ, out-of-3DMM properties w like hair, clothing, and appearance.

We show interpolations for each of the latent spaces.








Limitations

Our synthetic prior is trained on glasses, however only 20% of the synthetic identities wear glasses. In some cases, the fine-tuned results incorrectly paint the frame of glasses onto the skin (see the example on the first row). This happens when only a single input view shows the frame from a side angle. Furthermore, our method tends to produce bad results when the shoulders are not unobserved in the input (bottom left), and it fails for heavily out-of-distribution faces (bottom right). Note how the prior tries to reconstruct Gollum by adding glasses. Please see the paper for a discussion of limitations.

The bottom left image is from the Rodin website. The Gollum image (bottom right) is from The Lord of the Rings © New Line Cinema / Warner Bros.

Inputs

Result

Inconsistent Frame for Glasses

Input

Result

Input

Result








Future Work: Animation

While our focus is on novel view synthesis of static faces, future work could leverage similar prior models for facial animation. To demonstrate this, we test our model by fine-tuning to 12 smartphone images and interpolating random expressions from the synthetic pre-training set. Fine-tuning to multiple expressions helps preserve the expression space of the prior model, and enables the rendering of expressions that are not seen in the input images.

Inputs

Novel Expressions

If you find this work useful, please consider citing:
@incollection{buehler2024cafca,
    title={Cafca: High-quality Novel View Synthesis of Expressive Faces from Casual Few-shot Captures},
    author={Marcel C. Buehler and Gengyan Li and Erroll Wood and Leonhard Helminger and Xu Chen and Tanmay Shah and Daoye Wang and Stephan Garbin and Sergio Orts-Escolano and Otmar Hilliges and Dmitry Lagun and Jérémy Riviere and Paulo Gotardo and Thabo Beeler and Abhimitra Meka and Kripasindhu Sarkar},
    year={2024},
    booktitle={ACM SIGGRAPH Asia 2024 Conference Paper},
    doi={10.1145/3680528.3687580},
    url={https://doi.org/10.1145/3680528}
}