Enhancing Motion in Text-to-Video Generation with Decomposed Encoding and Conditioning

  • Penghui Ruan
  • , Pichao Wang
  • , Divya Saxena
  • , Jiannong Cao
  • , Yuhui Shi

Research output: Chapter in book / Conference proceedingConference article published in proceeding or bookAcademic researchpeer-review

Abstract

Despite advancements in Text-to-Video (T2V) generation, producing videos with realistic motion remains challenging. Current models often yield static or minimally dynamic outputs, failing to capture complex motions described by text. This issue stems from the internal biases in text encoding, which overlooks motions, and inadequate conditioning mechanisms in T2V generation models. To address this, we propose a novel framework called DEcomposed MOtion (DEMO), which enhances motion synthesis in T2V generation by decomposing both text encoding and conditioning into content and motion components. Our method includes a content encoder for static elements and a motion encoder for temporal dynamics, alongside separate content and motion conditioning mechanisms. Crucially, we introduce text-motion and video-motion supervision to improve the model's understanding and generation of motion. Evaluations on benchmarks such as MSR-VTT, UCF-101, WebVid-10M, EvalCrafter, and VBench demonstrate DEMO's superior ability to produce videos with enhanced motion dynamics while maintaining high visual quality. Our approach significantly advances T2V generation by integrating comprehensive motion understanding directly from textual descriptions.
Original languageEnglish
Title of host publicationNIPS '24: Proceedings of the 38th International Conference on Neural Information Processing Systems
PublisherCurran Associates Inc.
Pages70101-70129
Volume37
ISBN (Print)979-8-3313-1438-5
Publication statusPublished - Dec 2024
Event38th International Conference on Neural Information Processing Systems - Vancouver Convention Center, Vancouver, Canada
Duration: 10 Dec 202415 Dec 2024
Conference number: 38
https://neurips.cc/Conferences/2024

Competition

Competition38th International Conference on Neural Information Processing Systems
Abbreviated titleNeurIPS 2024
Country/TerritoryCanada
CityVancouver
Period10/12/2415/12/24
Internet address

Fingerprint

Dive into the research topics of 'Enhancing Motion in Text-to-Video Generation with Decomposed Encoding and Conditioning'. Together they form a unique fingerprint.

Cite this