MotionBooth: Motion-Aware Customized Text-to-Video Generation

1Peking University 2S-Lab, Nanyang Technological University 3Shanghai AI Laboratory
4Zhejiang University 5Shanghai Jiao Tong University

TL;DR: MotionBooth is a innovative framework designed for animating customized subjects with precise control over both object and camera movements.

Abstract

In this work, we present MotionBooth, an innovative framework designed for animating customized subjects with precise control over both object and camera movements. By leveraging a few images of a specific object, we efficiently fine-tune a text-to-video model to capture the object's shape and attributes accurately. Our approach presents subject region loss and video preservation loss to enhance the subject's learning performance, along with a subject token cross-attention loss to integrate the customized subject with motion control signals. Additionally, we propose training-free techniques for managing subject and camera motions during inference. In particular, we utilize cross-attention map manipulation to govern subject motion and introduce a novel latent shift module for camera movement control as well. MotionBooth excels in preserving the appearance of subjects while simultaneously controlling the motions in generated videos. Extensive quantitative and qualitative evaluations demonstrate the superiority and effectiveness of our method. Models and codes will be made publicly available.

The Framework

The overall pipeline of MotionBooth. We first fine-tune a T2V model on the subject. This procedure incorporates subject region loss, video preservation loss, and subject token cross-attention loss. During inference, we control the camera movement with a novel latent shift module. At the same time, we manipulate the cross-attention maps to govern the subject motion.

Results Visualization

MotionBooth compared with baseline models for motion-aware customized video generation.

MotionBooth compared with baseline models for camera control.

More results of MotionBooth.

BibTeX

article{wu2024motionbooth,
      title={MotionBooth: Motion-Aware Customized Text-to-Video Generation},
      author={Jianzong Wu and Xiangtai Li and Yanhong Zeng and Jiangning Zhang and Qianyu Zhou and Yining Li and Yunhai Tong and Kai Chen},
      journal={NeurIPS},
      year={2024},
    }