Learning to Stabilize Faces

Italian Trulli

Abstract

Modern facial scanning produces high-quality meshes but often requires subsequent stabilization to remove rigid head movement. This is crucial for game or movie character development where deformation due to expressions need be isolated from global head motion. Manual stabilization is tedious, leading to automated attempts. However, existing methods are flawed, requiring manual input, being imprecise, slow, or needing temporally consistent data. We propose a new learning-based approach which is precise and fully automatic. We frame stabilization as regression, use a 3DMM for training data generation and predict rigid transforms stabilizing pairs of misaligned face meshes. Experiments confirm our method's effectiveness for both random expression sets and facial performances."

High-level Method Overview

The core idea of our approach is to exploit a 3D morphable model (3DMM) to generate a diverse dataset of misaligned facial expression pairs with known ground-truth (GT) rigid transformation, and train a ML model to regress the rigid alignment. As shown before [1], a high quality synthetic datset suffices for the model to naturally generalize to real-world samples. We use a standard registration pipeline [2, 3] to produces a large dataset of registered facial meshes containing ~2,500 identities and ~38,000 expressions. An examples is shown below.
dataset
We use a custom 3DMM with the formulation similar to [1]. The 3DMM is used to sample identities, expressions and neck rotations. Specifically, we constract intra-subject pairs with the corresponding GT rigid transformation as a training dataset.
dataset
We devise a pre-processing scheme which simplifies the alignment task of the ML model. Specifically, we mask out the facial parts carrying no signal useful for the stabilization and pre-align the source to the target using Procrustes alignment.
dataset
Finally, we trai a ML model, which takes the preprocesed pair of source and target facial meshes on the input and precdicts a corresponding rigid transformation, which aligns the source to the target, on the output.
dataset

Results

Below are videos of stabilized facial performances and comparison to the SotA methods. The text highlights the typical failure modes of the comepeting methods. On the left of each video you can see the original frontal camera view from our multi-camera capture studio. Second to the left is our original unstabilized registered mesh and the remaining columns showcase the stabilized result produced by OUR and each of the compared method.

Input video

Original

OUR

CMAP

UNPOSE

PROCupper

This performance contains small but abrupt head rotations along the frontal axis, i.e. side-to-side head tilting along the coronal plane. Note, especially when the subject turns their lips downs and tilts the head at the same time, the method UNPOSE introduces a spurious global motion where the chin appears to draw inwards to compensate for the neck rotation.

Input video

Original

OUR

CMAP

UNPOSE

PROCupper

The subject conitnuously moves and rotates their head as they are speaking, and this global motion is correctly undone by al the methods. However, note that when the subject blinks, the method CMAP subtly yet visibly rotates the head upwards which, when focusing on the texture, appears as if the subject raised their eyeborows or the forehead. However, it is evident from the input RGB video that no such motion happened in reality.

Input video

Original

OUR

CMAP

UNPOSE

PROCupper

Over the course of the utterance, the subject mildly but continuously rotates and moves their head. The performance represents a neutral speech and there are no extreme expressions, thus there should not appear any strong motion of the ears. However, by focusing on the right ear of the stabilized sequences, one can notice that when compared to the other methods, UNPOSE manifests much more visible global motion.

Input video

Original

OUR

CMAP

UNPOSE

PROCupper

Over the course of the utterance, the subject is quite abruptly rotating and moving their head. Note that as the head tilts side-to-side along the coronal plane, the method UNPOSE wrongly compensates by a global head rotation. During the whole performance, when focusing on the forehead region and the occlusion boundary of the head scalp, one can notice subtle but continuous spurious global motion introduced by the method CMAP.

Input video

Original

OUR

CMAP

UNPOSE

PROCupper

The performance contains high-frequency low-anplitude global head motion. When focusing on the nose bridge area of the stabilized meshes, a small spurious leftover motion is visible in the UNPOSE method.

Input video

Original

OUR

CMAP

UNPOSE

PROCupper

Similarly to the video above, the nose bridge area of the stabilized meshes reveals a subtle unwanted motion left by the method UNPOSE.

Limitations

While our method operates on meshes only at test time, to train the method one needs a 3DMM which produces stable faces by design (in our case with the use of 3rd-party FACS-like blendshapes) and which is computed from a large and diverse dataset of registered meshes.

Citation

      @proceedings{bednarik2024stabilization,          
          author = {Jan Bednarik and Erroll Wood and Vassilis Choutas and Timo Bolkart and Daoye Wang and Chenglei Wu and Thabo Beeler},
          title = {Learning to Stabilize Faces},
          booktitle = {Eurographics},          
          year={2024}}