GANtlitz: Ultra High Resolution Generative Model for Multi-Modal Face Textures

aurel.gruber@inf.ethz.ch
edocollins@google.com
abhim@google.com
franziskamu@google.com
krsarkar@google.com
sorts@google.com
lprasso@google.com
jbusch@google.com
grossm@inf.ethz.ch
tbeeler@google.com
1ETH Zurich, Switzerland
2Google
(NOT OPTIMIZED FOR MOBILE, VIEW ON DEKTOP)

Abstract

High-resolution texture maps are essential to render photoreal digital humans for visual effects or to generate data for machine learning. The acquisi- tion of high resolution assets at scale is cumbersome, it involves enrolling a large number of human subjects, using expensive multi-view camera setups, and significant manual artistic effort to align the textures. To alleviate these problems, we introduce GANtlitz, a generative model that can synthesize multi-modal ultra-high-resolution face appearance maps for novel identities. Our method solves three distinct challenges: 1) unavailability of a very large data corpus generally required for training generative models, 2) memory and computational limitations of training a GAN at ultra-high resolutions, and 3) consistency of appearance features such as skin color, pores and wrinkles in high-resolution textures across different modalities. We introduce dual-style blocks, an extension to the style blocks of the StyleGAN2 architecture, which improve multi-modal synthesis. Our patch-based architecture is trained only on image patches obtained from a small set of face textures (<100) and yet allows us to generate seamless appearance maps of novel identities at 6𝑘 × 4𝑘 resolution. Extensive qualitative and quantitative evaluations and baseline comparisons show the efficacy of our proposed system.

Method

During training we sample overlapping patches of 512 × 512 from our data set. We prompt our StyleGAN2-based generator to produce patches for the same locations. Due to the aligned nature of our UV textures training-set, we can facilitate the generator’s task by multiplying the UV layout mask onto the generated samples. These real and fake samples are augmented before being fed to a patch location aware discriminator. While we apply several augmentations at an adaptive rate based on the discriminator performance in line with prior work, we additionally apply a frequency mid-band filter with random frequency bands at a constant rate. This augmentation directs the discriminator’s attention equally at all frequencies and yields a more frequency diverse generator.
Multi-modal generation is achieved with a mechanism we term dual-style blocks. In this setting, the generator is conditioned on the desired modality, and the generation of multimodal samples is done over successive evaluations, one per modality. The conditioning input is a per-modality learned embedding. These embeddings are utilized as additional style inputs and learned bias offsets for each style block in convolutional layers.

Results

* We recommend using the Chrome or Safari browser to correctly display all visuals. Not optimized for mobile, view on desktop.

Sample Gallery

The following samples were generated with our method. Note the high level of detail and good global consistency. For more samples and interactive zooming see the section Full Samples.
Samples

Latent Interpolation

Despite training on less than 100 samples, we obtain a smooth latent space as conveyed by this animation. Note that the noise maps remain fixed. See the next section for noise map sampling.

Noise Sampling

The StyleGAN Architecture additionally enables the resampling of the noise map inputs. while the coarse structure is preserved, novel variation of wrinkle and freckle patterns appear.

(Click to sample noise, double-click to cycle zoom levels.)
Sample 0/3

Full Samples

The following offers a more interactive exploration of some generated samples. The zoom windows at the bottom allow the inspection of arbitrary regions close-up.

Applications

On top of sample generation the model allows for som intriguing applications based on inversion. We present two: Modality completion and super resolution.

Modality Completion

Given a target albedo map, we can invert the texture into our model while conditioning the generator on the albedo embedding. Subsequently, we can generate the other modalitites through separate generator evaluations conditioned on the respective embeddings.
Target
Albedo Fit
Generated Specular Map
Generated Normal Map

Super Resolution

We can also fit to a low resolution target. Some additional care must be taken to avoid degenerate pixels due to the many-to-one dynamic. And we can afterwards again generate the other modalities as well.
Target
Low Res Albedo Fit
Generated Albedo
Generated Specular Map
Generated Normal Map

Citation

          
@inproceedings{gruber2024gantlitz,
    title={GANtlitz: Ultra High Resolution Generative Model for Multi-Modal Face Textures},
    author={Aurel Gruber and Edo Collins and Abhimitra Meka and Franziska Mueller and 
    Kripasindhu Sarkar and Sergio Orts-Escolano and Luca Prasso and Jay Busch and 
    Markus Gross and Thabo Beeler},
    booktitle={Computer Graphics Forum},
    year={2024}
}