Yaman Kumar, Aradhya Neeraj Mathur, Devansh Batra

Abstract

In this work we explore the linguistic aspect of video reconstruction which is pivotal to reliable reconstruction of speech videos. We develop a test suite to check the robustness of common techniques and do a stress test for video reconstruction. Furthermore, we introduce a simple methodology to improve upon the quality of reconstruction obtained by introducing an Region of Interest (ROI) unit more suited to these linguistic tasks.

1. Introduction

Video frame interpolation and extrapolation is the task of synthesizing new video frames conditioned on the context of a given video. Contemporary applications of such interpolation are video playback software for increasing frame rates, video editing software for creating slow motion effects, and virtual reality software to decrease resource usage. From the early per-pixel phase-based shifting approaches, current approaches have shifted to generating video frames using techniques like optical flow or stereo methods. These types of approaches typically involve two steps: motion estimation and pixel synthesis. While motion estimation is needed to understand the movement of different objects of the images across frames, pixel synthesis focuses on the generation of the new data.

A related task to video frame interpolation is talking face generation. Here, given an audio waveform, the task is to automatically synthesize a talking face. In recent times, these approaches have become popular with both academic and non-academic purposes. While, on one hand, they are being used to extend speechreading models to low resource languages, on the other, many of them are also used to generate fake news and paid content as well. The literature on talking face generation, as well as video interpolation and extrapolation use the conventional video quality metrics such as root mean squared (RMS) distance, structural similarity index, etc. to measure the quality of generated videos. Although this evaluates the quality of generated pixels well, but none of the research works show how it translates to linguistically plausible video frames. This is important since speech being a linguistic problem, cannot be solely addressed by non-linguistic metrics like RMS distance. The goal of this paper is to investigate how the different video interpolation and extrapolation algorithms are able to capture linguistic differences between the generated videos.

Speech as a natural signal is composed of three parts:visual modality, audio modality and the context in which it was spoken (crudely the role played by language). Correspondingly, there are three tasks for modeling speech: speechreading (or popularly known as lipreading), speech recognition (or ASR) and language modeling. The part of speech which is closest to the speech video generation task is the visual modality of speech; and visemes are the fundamental units of this part of speech. Calculating metrics such as mean squared error (MSE) over the whole video does not directly yield any information on these aspects of speech which makes us question the faithfulness of the thus attained reconstruction. Therefore, the focus of this work is to investigate video generation model’s understanding of the visual speech modality. To this end, we propose five tasks for looking at different aspects of the visual speech modality.

Hence, with this work, we try to do the following:

We explore different video interpolation and extrapolation networks for usage on speechreading videos and propose a new auxiliary method of ROI loss using an visemic ROI unit for attaining faithful reconstructions.
We test the video frame generation algorithms on different (linguistic) aspects of speech such as visemic completion, generating prefix and suffix from context, word-level understanding, etc. These facets of language are critical to a model’s language understanding.
We release seven new challenge datasets corresponding to different language aspects. These have been verified automatically and manually and are meant to facilitate reproduction and follow-up testing and interpretation.

2. Why current metrics don't suffice ?

To show the ineffectiveness of the existing metrics such as PSNR, SSIM, MSE we create a new synthetic video clip of size 32 frames by replicating the first frame 16 times and the last frame 16 times in the second image. We notice that even though contextually the information is incorrect since only two frames are used to construct the complete video, the metrics show high values indicating that the synthetic video is a faithful reconstruction of the original video. This result reinforces the fact that we need better metrics to compensate for the underlying context and comes in par with the metrics obtained by training FCN3D. However, the proposed method takes into account the visemic reconstruction thus enabling to get a more faithful reconstruction.

In the above figure obtained by simple replication we get a high PSNR value of 20.2734, SSIM of 0.6174879 and MSE is 0.0093898. The one below has a PSNR of 19.6407, SSIM of 0.761964 and MSE is 0.01086.

For reference, the standard metrics used for judging the quality of image reconstruction are defined here:

Table of Contents:

Abstract

1. Introduction

2. Why current metrics don't suffice ?