TL;DR;
We train a synthetic face prior for few-shot, personalized 3D face modeling in the
wild.
Although the
model never sees a single real image during pre-training, it generalizes to real and
stylized
face
images during inference.
Our new model and few-shot fine-tuning method can robustly reconstruct expressive
faces under
challenging in-the-wild lighting and synthesize photorealistic novel views from only
three
inputs.
We train an implicit prior model on synthetic renderings of a 3DMM combined with assets like hair, beard, clothing, and more. The prior takes as input a frozen 3DMM identity code β, a frozen 3DMM expression code ψ, and an optimizable latent code w, which models appearance, clothing, and other out-of-3DMM characteristics. The prior model F is a conditional Radiance Field optimized to reconstruct the training identities. Please see Sec. 3.2 in the paper for details.
For inference, we capture in-the-wild face images and recover camera and 3DMM parameters from the un-calibrated input. We follow the approach of previous work [Wood et al. 2022] and fit to dense 2D landmarks. The outputs of this step are the camera parameters, a shared identity code, and per-image expression codes. The per-image expression codes are crucial for enabling high-quality in-the-wild results because they compensate for small motion during the capture (Sec. 3.3.2 in the paper).
Fine-tuning requires a short warm-up phase where only the latent code for the target wt is optimized. After that, fine-tuning optimizes all model parameters under additional constraints on the geometry and the appearance weights. We randomly sample rays from all available inputs, typically three images, and mask them to the foreground by multiplying by an estimated foreground mask. Please see Sec. 3.3.2 in the paper for more details.
We demonstrate the generalization capability of our method to in-the-wild mobile camera captures. With just 3 input views, our method is able to generate highly consistent and photorealistic free-view renders of a subject. Our method not only reconstructs coherent geometry, it also learns to interpolate view-dependent specularities, such as on the hair and skin.
Synthesis from single view naturally suffers from bas-relief ambiguity making it much more challenging, but our prior model enables plausible results. We show smartphone images, selfies, an Internet image from the Rodin website, and stylized inputs from LP3D (Trevithick et al., 2023). Note that our prior model has never seen any stylization or strong face paint.
We train a synthetic prior model to capture the distribution of human heads with arbitrary facial expressions. The prior model is a conditional NeRF with three additional inputs modeling the identity β, the expression ψ, out-of-3DMM properties w like hair, clothing, and appearance.
We show interpolations for each of the latent spaces.
Our synthetic prior is trained on glasses, however only 20% of the synthetic identities wear glasses. In some cases, the fine-tuned results incorrectly paint the frame of glasses onto the skin (see the example on the first row). This happens when only a single input view shows the frame from a side angle. Furthermore, our method tends to produce bad results when the shoulders are not unobserved in the input (bottom left), and it fails for heavily out-of-distribution faces (bottom right). Note how the prior tries to reconstruct Gollum by adding glasses. Please see the paper for a discussion of limitations.
The bottom left image is from the Rodin website. The Gollum image (bottom right) is from The Lord of the Rings © New Line Cinema / Warner Bros.
While our focus is on novel view synthesis of static faces, future work could leverage similar prior models for facial animation. To demonstrate this, we test our model by fine-tuning to 12 smartphone images and interpolating random expressions from the synthetic pre-training set. Fine-tuning to multiple expressions helps preserve the expression space of the prior model, and enables the rendering of expressions that are not seen in the input images.
@incollection{buehler2024cafca, title={Cafca: High-quality Novel View Synthesis of Expressive Faces from Casual Few-shot Captures}, author={Marcel C. Buehler and Gengyan Li and Erroll Wood and Leonhard Helminger and Xu Chen and Tanmay Shah and Daoye Wang and Stephan Garbin and Sergio Orts-Escolano and Otmar Hilliges and Dmitry Lagun and Jérémy Riviere and Paulo Gotardo and Thabo Beeler and Abhimitra Meka and Kripasindhu Sarkar}, year={2024}, booktitle={ACM SIGGRAPH Asia 2024 Conference Paper}, doi={10.1145/3680528.3687580}, url={https://doi.org/10.1145/3680528} }