Abstract

NeRFs have enabled highly realistic synthesis of human faces including complex appearance and reflectance effects of hair and skin. These methods typically require a large number of multi-view input images, making the process hardware intensive and cumbersome, limiting applicability to unconstrained settings. We propose a novel volumetric human face prior that enables the synthesis of ultra high-resolution novel views of subjects that are not part of the prior's training distribution. This prior model consists of an identity-conditioned NeRF, trained on a dataset of low-resolution multi-view images of diverse humans with known camera calibration. A simple sparse landmark-based 3D alignment of the training dataset allows our model to learn a smooth latent space of geometry and appearance despite a limited number of training identities. A high-quality volumetric representation of a novel subject can be obtained by model fitting to 2 or 3 camera views of arbitrary resolution. Importantly, our method requires as few as two views of casually captured images as input at inference time.

We recommend using Chrome to correctly display the videos. If a video doesn't play automatically, click the play button.

Please note that we can only show limited high-resolution video results on this website. Please download the extended website for more results.


Ultra High-resolution Synthesis from Studio Captures

Given 3 views of a held-out test subject from our dataset, we show high-quality novel view synthesis. Note the 3D consistent rendering of details such as hair strands, eye-lashes, and view-dependent effects for example on the forehead.

Input Views (4Kx6K)

Novel View Synthesis - cropped to center

Novel View Synthesis (2Kx2K)





High-resolution Synthesis on FaceScape Dataset

We show novel view synthesis results at 2K resolution using only two input views of subjects from the FaceScape dataset. Note that our prior model was trained on a different dataset—these results represent the out-of-distribution setting.

Input Views

Novel View Synthesis



In-the-wild Captures

We demonstrate the generalization capability of our method to in-the-wild mobile camera captures. With just 2 input views, our method is able to genenerate highly consistent and photorealistic free-view renders of a subject. Our method not only reconstructs coherent geometry, it also learns to interpolate view-dependent specularities, such as on the hair and skin.

Input Views

Novel View Synthesis

Input Views

Novel View Synthesis



Method

We show that with a good initialization, a NeRF can be fine-tuned given only a few images. Our key idea is to train a “prior model” on a large collection of low-resolution, multiview images. The prior model learns a smooth latent space over face geometry and appearance that can be finetuned to ultra-high resolutions. Our approach generalises to images captured in-the-wild using a mobile camera.



Prior Model Latent Space

Alignment of faces in our dataset allows us to learn a continuous latent space, where the embeddings of training identities can be interpolated to achieve plausible intermediate identities. Note that we do not train our model in an adversarial manner but only with reconstruction losses.

Random Sampling

Interpolation



Geometry

We visualise the image-space geometry estimated by our method. Note the 3D consistent depth and normals. The normals in the hair have a grey appearance due to transparent density.

Input Views

Colour

Depth

Normals

Foreground Matte



Comparison with Related Works

We compare with related works at 1K resolution given two input views.

Input Views

EG3D-based prior

Ours

*We made a considerable effort to train KeypointNeRF at 1K resolution, but we found that their results at the resolution 256x256 is of much higher quality than their results at 1K. Therefore, the video presents their results at 256x256 resolution.


Limitations

While our method achieves state-of-the-art results in high-resolution synthesis of faces, it struggles with strong expressions and large accessories. This is due to limited representation in our training dataset. The training dataset only contains neutral faces and none of the subjects was wearing voluminous clothing like sweaters or jackets. This limitation could potentially be mitigated by training a more diverse prior model that includes those modalities.

Grin

Input Views

Novel View Synthesis

Cap and Eyeglasses

Input Views

Novel View Synthesis



If you find this work useful, please consider citing:
@inproceedings{buhler2023preface,
  title={Preface: A Data-driven Volumetric Prior for Few-shot Ultra High-resolution Face Synthesis},
  author={B{\"u}hler, Marcel C and Sarkar, Kripasindhu and Shah, Tanmay and Li, Gengyan and Wang, Daoye and Helminger, Leonhard and Orts-Escolano, Sergio and Lagun, Dmitry and Hilliges, Otmar and Beeler, Thabo and others},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  pages={3402--3413},
  year={2023}
}




Acknowledgements

We thank Emre Aksan for insightful discussions and Malte Prinzler for sharing DINER results.

We are grateful to NeRFies for providing the website template.