Visual dubbing pipeline with localized lip-sync and two-pass identity transferDhyey Patel, Houssem Zouaghi, Sudhir Mudur, Eric Paquette, Serge Laforest, Martin Rouillard, Tiberiu Popa. Computers & Graphics, Volume 110, pages 19-27, 2023. Presented at the ACM SIGGRAPH Conference on Motion, Interaction and Games (MIG), Guanajuato, Mexico, November 3-5, 2022
|
Visual dubbing uses visual computing and deep learning to alter the lip and mouth articulations of the actor to sync with the dubbed speech. It has the potential to greatly improve the content generated from the dubbing industry. Quality of the dubbed result is primary for the industry. An important requirement is that visual lip sync changes be localized to the mouth region and not affect the rest of the actor’s face or the rest of the video frame. Current methods can create realistic looking fake faces with expressions. However, many fail to localize lip sync and have quality problems such as identity loss, low-res, blurs, face skin feature or color loss, and temporal jitter. These problems mainly arise because end-to-end training of networks to correctly disentangle these different visual dubbing parameters (pose, skin color, identity, lip movements, etc.) is very difficult to achieve. Our main contribution is a new visual dubbing pipeline, in which, instead of end-to-end training we apply incrementally different disentangling techniques for each parameter. Our pipeline is composed of three main steps: pose alignment, identity transfer and video reassembly. Expert models in each step are fine-tuned for the actor. We propose an identity transfer network with an added style block, which with pre-training is able to decouple face components, specifically identity and expression, and also works with short video clips like TV ads. Our pipeline also includes novel stages related to temporal smoothing of the reenacted face, actor specific super resolution to retain fine facial details, and a second pass through the identity transfer network for preserving actor identity. Localization of lip-sync is achieved by restricting changes in the original video frame to just the actor’s mouth region. The results are convincing, and a user survey also confirms their quality. Relevant quantitative metrics are included.
Visual dubbing, Reenactment, Style transfer
@article{Patel2023, author = {Dhyey Patel and Houssem Zouaghi and Sudhir Mudur and Eric Paquette and Serge Laforest and Martin Rouillard and Tiberiu Popa}, title = {Visual dubbing pipeline with localized lip-sync and two-pass identity transfer}, journal = {Computers & Graphics}, volume = {110}, pages = {19-27}, year = {2023}, }
Official published paper: https://www.sciencedirect.com/science/article/abs/pii/S0097849322001984.
Preliminary version of the paper.
Pre-print version of the video: