KeyVID: Keyframe-Aware Video Diffusion for Audio-Synchronized Visual Animation

Xingrui Wang^1,2, Jiang Liu¹, Ze Wang¹, Xiaodong Yu¹, Jialian Wu¹, Ximeng Sun¹, Yusheng Su¹

Alan Yuille² Zicheng Liu¹ Emad Barsoum¹

¹Advanced Micro Devices, ²Johns Hopkins University

Abstract

Generating video from various conditions, such as text, image, and audio, enables both spatial and temporal control, leading to high-quality generation results. Videos with dramatic motions often require a higher frame rate to ensure smooth motion. Currently, most audio-to-visual animation models use uniformly sampled frames from video clips. However, these uniformly sampled frames fail to capture significant key moments in dramatic motions at low frame rates and require significantly more memory when increasing the number of frames directly.

In this paper, we propose KeyVID, a keyframe-aware audio-to-visual animation framework that significantly improves the generation quality for key moments in audio signals while maintaining computation efficiency. Given an image and an audio input, we first localize keyframe time steps from the audio. Then, we use a keyframe generator to generate the corresponding visual keyframes. Finally, we generate all intermediate frames using the motion interpolator. Through extensive experiments, we demonstrate that KeyVID significantly improves audio-video synchronization and video quality across multiple datasets, particularly for highly dynamic motions.

Audio-Synchronized Visual Animation

Audio-synchronized visual animation (ASVA) aims to animate objects from static images into videos with motion dynamics that are semantic and temporally aligned with input audio. The audio conditions provide fine-grained control to the video generation, which requires the key action precisely aligned with the exact moment of audio signal.

However, this synchronization is constrained by the frame rates of video generation models. The video diffusion model normally trains and generates videos at a fixed FPS. Since audio carries fine-grained temporal information, the key moments in the audio can be lost in uniformly sampled low frame rate videos, leading to compromising audio-video synchronization.

Keyframe-Aware Video Diffusion for ASVA

A straightforward solution is to train a video generation model on high frame rate data to match the fine-grained temporal information in audio. However, this approach incurs substantial computational costs in terms of GPU memory and training time. To ensure accurate audio-visual synchronization while maintaining computation efficiency, we propose KeyVID, a Keyframe-aware Video Diffusion framework that generates audio-synchronized video based on the input image and audio. Our model enables generating high frame rate videos but train on low frame rate by the three-step generation:

I. Keyframe selection

We first develop a keyframe selection strategy that identifies critical moments in the video sequence based on an optical flow-based motion score. We train a keyframe localizer that predicts such keyframe positions directly from the input audio cue.

II. Keyframe generator

Next, instead of applying uniform downsampling to video frames, we select the keyframes to train a keyframe generator. The keyframe generator explicitly captures crucial moments of dynamic motion that might otherwise be missed with uniform sampling without requiring an excessively high number of frames.

III. Keyframe interpolator

Then, we train a specialized motion interpolator to synthesize intermediate frames between the keyframes to generate high frame rate videos. The motion interpolator ensures smooth motion transition and precise audio-visual synchronization throughout the sequence.

Generation Results

We generate these audio-synchronized videos on the AVSync15 dataset using KeyVID at 24 fps with 48 frames (2-second clips), where our keyframe-aware approach first selects 12 strategic keyframes based on audio-driven motion analysis, then interpolates to the full 48-frame sequence to achieve precise audio-visual synchronization for dynamic actions like "Dog barking," "Hammering," and "Playing cello."

The video results (from left to right) show comparisons between: (a) our KeyVID with keyframe awareness, (b) our uniform sampling baseline (KeyVID-Uniform), (c) AVSyncD state-of-the-art method, and (d) DynamiCrafter image-to-video baseline.

"Machine gun shooting"

"Dog barking"

"Hammering"

"Playing trumpet"

"Toilet flushing"

"Chicken crowing"

"Dog barking"

"Baby crying"

"Cap gun shooting"

"Playing violin"

"Frog croaking"

"Lion roaring"

"Playing trombone"

"Hammering"

"Machine gun shooting"

"Knife sharpening"

Open Domain Generation

We generate open-domain videos by applying KeyVID to frames from Sora-generated content, controlling motion dynamics through different audio inputs. Using the same initial frame with hammering audio clips - one representing strikes on wooden surfaces and another on metal objects - our model adapts the generated motion based on material properties inferred from the audio, demonstrating KeyVID's capability to animate diverse scenarios beyond training data while maintaining semantic consistency between audio cues and visual dynamics.

1. With audio of hitting wooden surface

2. With audio of hitting metal surface

BibTeX

@article{wang2025keyvid,
  author    = {Wang, Xingrui and Liu, Jiang and Wang, Ze and Yu, Xiaodong and Wu, Jialian and Sun, Ximeng and Su, Yusheng and Yuille, Alan and Liu, Zicheng and Barsoum, Emad},
  title     = {KeyVID: Keyframe-Aware Video Diffusion for Audio-Synchronized Visual Animation},
  journal   = {arXiv preprint arXiv:2504.09656},
  year      = {2025},
}