Towards Language-Driven Video Inpainting via Multimodal Large Language Models

Jianzong Wu¹, Xiangtai Li², Chenyang Si², Shangchen Zhou², Jingkang Yang², Jiangning Zhang³, Yining Li⁴ Kai Chen⁴ Yunhai Tong¹ Ziwei Liu² Chen Change Loy²

¹Peking University ²S-Lab, Nanyang Technological University ³Zhejiang University ⁴Shanghai AI Laboratory

arXiv Code Data

We propose Language-Driven Video Inpainting. It contains two sub-tasks based on the expression types. The referring video inpainting task takes simple referring expressions as input, while interactive video inpainting receives chat-style conversations. The conversation may encounter implicit requests, and the model needs to reason for a correct understanding.

Video

Abstract

In the field of video inpainting, traditional methods rely on pre-computed binary masks to identify areas for restoration. The mask labeling process can be time-consuming and labor-intensive, particularly in applications like object removal. This paper introduces a pioneering language-driven video inpainting task that leverages natural language instructions. Our proposed language-driven video inpainting task uniquely utilizes natural language, including chatstyle conversations, to guide the inpainting process. To support this innovative approach, we develop the Remove Objects from Videos by Instructions (ROVI) dataset, comprising 5,650 videos and 9,091 inpainting results, specifically tailored to facilitate training and evaluation in this new paradigm. We present a novel diffusion-based language-driven video inpainting framework, representing the first end-to-end baseline for this task. This framework is distinguished by its integration with Multimodal Large Language Models, which enables it to comprehend and effectively process complex language-based inpainting requests. Our comprehensive evaluation, encompassing both quantitative metrics and qualitative analysis, demonstrates the robustness and versatility of our dataset and the efficacy of our proposed model in handling a wide range of inpainting scenarios driven by natural language instructions.