6-min Introduction Video
Abstract (TL;DR)
The Problem: T2V models often produce hallucinations, exaggerated motion, or "contact avoidance" (where objects don't actually touch) because they lack a grounding in physical laws.
The Hypothesis: Audio is inherently causal (e.g., a thud implies a collision). Training on audio should regularize video dynamics.
The Method: We propose AVFullDiT, a parameter-efficient framework that:
The Result: Compared to a T2V-only baseline, our T2AV model generates more physically plausible motion and better object interactions, proving that hearing does help seeing!.
The Framework
The architecture of AVFullDiT and Audio-Video Full-Attention. (a) AVFullDiT reuses pre-trained T2V/T2A early towers and stacks joint blocks that predict video/audio velocities under a unified flow-matching loss. (b) AVFull-Attention performs symmetric MHSA over the concatenated audio–video token sequence using the video width as the joint dimension; audio projections are expanded with small adapter matrices. The attended sequence is split and projected back per modality.
Results Visualization
(T2AV) Tempered Motion -- (T2V) Excessive Motion
In T2V video, the movement amplitude of gun is excessively large.
Audio prompt: The audio features a person firing a gun, producing a sharp, loud gunshot followed by a series of rapid, sharp clicks.
T2AV
T2V
In T2V video, the movement amplitude of gun is excessively large.
Audio prompt: The audio features a distinct gunshot sound, followed by a series of rapid, sharp clicks, allled by a background of muted, ambient noise that suggests an outdoor setting with trees.
T2AV
T2V
In T2V video, playing the strings too fast makes the video look unnatural.
Audio prompt: The audio features features a violin playing a melody in the key of E major, with a tempo of 100.0 bpm, and a 4/4 time signature, accompanied by a background of music.
T2AV
T2V
In T2V video, playing the strings too fast makes the video look unnatural.
Audio prompt: The audio features includes a man playing the violin with a bow, producing a melodic tune, accompanied by background music that enhances the overall sound, creating a harmonious and pleasant auditory experience.
T2AV
T2V
In T2V video, the windmill is spinning too fast.
Audio prompt: The audio features a consistent, rhythmic sound of a horse trotting, with a steady pace and a clear, resonant tone that suggests a hard surface beneath its hooves.
T2AV
T2V
(T2AV) Tempered Motion -- (T2V) Static Video
In T2V video, the waves stirred up by the water flow are relatively inconspicuous.
Audio prompt: The audio features the sound of a large wave crashing, creating a powerful and intense auditory experience with a deep, resonant tone.
T2AV
T2V
In T2V video, the waves stirred up by the water flow are relatively inconspicuous.
Audio prompt: The audio features a continuous, low-pitched humming sound, reminiscent of a distant engine.
T2AV
T2V
In T2V video, the waves stirred up by the water flow are relatively inconspicuous.
Audio prompt: The audio features the sound of a large wave crashing, creating a powerful and immersive auditory experience.
T2AV
T2V
In T2V video, the waves stirred up by the water flow are relatively inconspicuous.
Audio prompt: The audio features the sound of waves crashing, creating a rhythmic and soothing tone with a low pitch, accompanied by the distant sound of a train horn in the background.
T2AV
T2V
In T2V video, the water flow is very small, and the dynamics are barely perceptible.
Audio prompt: The audio features a continuous sound of water flowing steadily, creating a soothing and rhythmic ambiance.
T2AV
T2V
In T2V video, the background scene is too static and unnatural.
Audio prompt: The audio features a repetitive, high-pitched squeaking sound, likely from a small animal, with a consistent rhythm and a background of faint, indistinct noises.
T2AV
T2V
In T2V video, the scene is overly
static, with no changes in the hand gestures for playing saxophone;
In T2AV
video,
the hand gestures roughly match the audio.
Audio prompt: The audio features a saxophone playing a slow, melancholic melody with a low pitch, accompanied by a background of a soft, ambient electronic music that has a rhythmic pattern, creating a somber and reflective atmosphere.
T2AV
T2V
In T2V video, the camera movement
is relatively static;
In T2AV video, the camera movement is relatively natural.
Audio prompt: The audio features a male voice speaking in Portuguese, with a tone that suggests he is instructing or explaining something, possibly in a casual or conversational manner. The pitch of his voice is moderate, and the rhythm is steady, indicating a normal speaking pace. There are no significant variations in
T2AV
T2V
(T2AV) Contact with Sound -- (T2V) Contact Avoidance
In T2V video, the hammer does not properly make contact with the desktop.
Audio prompt: The audio features a distinct hammeringing sound, characterized by a sharp, resonant tone, and a rhythmic pattern, set against a backdrop of faint, ambient noise.
T2AV
T2V
In T2V video, the knife and the whetstone are not in proper contact.
Audio prompt: The audio features a continuous, high-pitched, metallic scraping sound, accompanied by a soft, ambient background noise that suggests a workshop or industrial setting.
T2AV
T2V
In T2V video, the knife and the whetstone are not in proper contact.
Audio prompt: The audio features a continuous, rhythmic scraping sound, likely from a knife being sharpened on a whetstone, with a consistent tone and pitch, accompanied by faint background noises that suggest a workshop setting.
T2AV
T2V
In T2V video, the stick and the chair are not in proper contact.
Audio prompt: The audio features a series of sharp, repetitive whip sounds, creating a rhythmic pattern against a relatively quiet background.
T2AV
T2V
In T2V video, the stick and the keyboard are not in proper contact.
Audio prompt: The audio features a series of distinct, sharp snapping sounds, each with a brief echo, occurring at regular intervals, set against a relatively quiet background.
T2AV
T2V
In T2V video, the stick and the chair are not in proper contact.
Audio prompt: The audio features a series of distinct, sharp thumping sounds, likely from a hard object hitting a surface, with a consistent rhythm and a slightly dark, reverberant tone.
T2AV
T2V
In T2V video, the stick and the table are not in proper contact.
Audio prompt: The audio features a series of rhythmic, repetitive sounds, likely from a hammer striking a hard surface, creating a consistent and steady beat.
T2AV
T2V
In T2V video, the stick and the cabinet are not in proper contact.
Audio prompt: The audio features a distinct metallic clanging sound, which is sharp and resonant, occurring at regular intervals, creating a rhythmic pattern against a relatively quiet background.
T2AV
T2V
In T2V video, the stick and the game console are not in proper contact.
Audio prompt: The audio features a series of distinct, sharp clapping sounds, each with a brief pause, creating a rhythmic pattern against a quiet background.
T2AV
T2V
In T2V video, the stick and the box are not in proper contact.
Audio prompt: The audio features a series of distinct, sharp thuds against a hard surface, likely wood, with a clear and resonant tone, maintaining a consistent rhythm throughout.
T2AV
T2V
In T2V video, the stick and the water fountain are not in proper contact.
Audio prompt: The audio features a series of distinct, sharp metallic clanks, each with a clear and resonant tone, occurring at regular intervals, creating a rhythmic pattern against a relatively quiet background.
T2AV
T2V
In T2V video, the bowl and the hammer are not in proper contact.
Audio prompt: The audio features a single, resonant bell sound with a clear, high-pitched tone that rings out and gradually fades, accompanied by a soft, ambient background noise that remains constant throughout.
T2AV
T2V
In T2V video, the hand and the disc are not in proper contact.
Audio prompt: The audio features a rhythmic scratching sound, likely from a DJ, with a consistent tempo and a background of soft, indistinct music.
T2AV
T2V
In T2V video, the electric chainsaw and the tree are not in proper contact.
Audio prompt: The audio features a continuous, high-pitched whine from a chainsaw cutting through wood, with a background of wind and wind sounds, creating a sense of an outdoor work environment.
T2AV
T2V
In T2V video, the electric chainsaw and the tree are not in proper contact.
Audio prompt: The audio features a continuous, high-pitched whirring sound, characteristic of a chainsaw in operation, with a consistent rhythm and a background of faint, indistinct noise.
T2AV
T2V
In T2V video, the knife and vegetables are not in proper contact.
Audio prompt: A woman speaks with a neutral tone and moderate pitch, saying "Can I say hmm it's not beautiful but it is beautiful", while in the background, there are sounds of cutl cutl, a woman's voice, and a man's voice.
T2AV
T2V
In T2V video, the stick and guiro are not in proper contact.
Audio prompt: The audio features a lively Latin American music track with a fast tempo, characterized by a prominent horn section, likely trumpets, playing a rhythmic melody, accompanied by a male vocalist singing in a high-pitched, energetic style, with a background of a steady, upbeat drum rhythm and occasional guitar strums.
T2AV
T2V
In T2V video, the hand didn't press the button properly.
Audio prompt: The audio features a male speaker with a neutral tone and moderate pitch, delivering a speech in German about maintaining a cubic meter per hour without exceeding the 75 decibel mark, with a consistent rhythm and no notable variations in background noises.
T2AV
T2V
Physics Commonsenses
In T2V video, the vertical knife-sharpening method violates the laws of physics.
Audio prompt: The audio features the sound of a knife being rubbed against a sharpeninging surface, accompanied by a faint background hum.
T2AV
T2V
In T2V video, sharpening a knife on hand violates the laws of physics.
Audio prompt: The audio features a continuous, high-pitched, and rhythmic scratching sound, likely produced by a tool or object being scraped against a hard surface, accompanied by a faint background noise that could be a distant conversation or ambient kitchen sounds.
T2AV
T2V
In T2V video, the ball disappearing directly violates the laws of physics.
Audio prompt: The audio features a bowling ball rolling down the lane, followed by the satisfying sound of pins crashing, all by a lively background of music and chatter.
T2AV
T2V
In T2V video, the car suddenly reversing violates the daily rules.
Audio prompt: An emergency vehicle sots is heard, characterized by a high-pitched, urgent tone, with a consistent rhythm, and a background of urban street noise.
T2AV
T2V
In T2V video, the position of the welding torch is incorrect.
Audio prompt: The audio features a continuous, high-pitched hissing sound, likely from a spray can, followed by a brief, sharp metallic clink, possibly a can hitting a hard surface.
T2AV
T2V
In T2V video, the rotating fan is distorted.
Audio prompt: The audio features a distinct click sound, likely from a switch or button being pressed, followed by a brief period of silence, then another similar click, all throughout the audio.
T2AV
T2V
In T2V video, the video and audio are relatively consistent, achieving good AV-Joint generation.
Audio prompt: The audio features a beatboxing performance with a consistent rhythm and a low-pitched, resonant tone, accompanied by a faint background noise that seems to be a soft, ambient hum.
T2AV
T2V
In T2V video, the bell is
small, and its movement violates the laws of physics.
In contrast,
in the T2AV video, the bell is clear and its movement is synchronized with audio.
Audio prompt: The audio features footsteps on gravel, a man speaking, and a bicycle bell ringing, with the bell's sound becoming more prominent over time.
T2AV
T2V
In T2V video, the human hand undergoes deformation.
Audio prompt: A male voice with a German accent speaks in a calm tone, delivering the line 'Was den -0 tut dir das nicht, bekommst du kein Signal.' The background is quiet with no noticeable ambient noise.
T2AV
T2V
In T2V video, the generated chainsaw shape is incorrect.
Audio prompt: The audio features a consistent, high-pitched, and rhythmic sound of a bicycle wheel spinning, accompanied by a faint background noise that seems to be the ambient sound of the environment.
T2AV
T2V
In T2AV video, girls' hand movements when pinching plastic are more accurate and synchronized with the sound.
Audio prompt: The audio features a child speaking in English, saying "on to our last basket", with a neutral tone and a pitch typical of a young child, accompanied by the sound of crumpling paper in the background.
T2AV
T2V
In T2V video, the meat hasn't been cut properly.
Audio prompt: The audio features a conversation in a language that sounds like it could be Chinese, with a woman speaking in a higher pitch and a man respondinging in a lower pitch, both maintaining a steady rhythm, and there are no noticeable background noises.
T2AV
T2V
In T2V video, The bell didn't sway
when struck, which violates the laws of physics;
As for the T2AV video, the
bell's
sway is synchronized with the bell sound.
Audio prompt: The audio features a distinct church bell ringing, accompanied by a soft, ambient background noise, creating a serene and rhythmic atmosphere.
T2AV
T2V
In T2V movie, the position of electronic trimmer is incorrect.
Audio prompt: The audio features a man speaking in a neutral tone over English, accompanied by the continuous sound of an electric shaver buzzing in the background.
T2AV
T2V
In T2V movie, continuous firing violates the laws of physics.
Audio prompt: The audio features a series of distinct impact sounds, likely from objects being dropped or hit, interspersed with a brief, low-pitched male voice speaking in the background.
T2AV
T2V
In T2V video, the model failed to
correctly generate the button and the action of pressing it.
Besides, in T2AV
video, the model correctly generated the language "turn on", indicating it possesses
a certain ability to understand the world
Audio prompt: The audio features a male voice speaking in English with a neutral tone, saying "up for four seconds" at a moderate pitch and rhythm, accompanied by the sound of a car engine running in the background.
T2AV
T2V
In T2V video, the paper generated by the model is unformed with a certain degree of distortion.
Audio prompt: The audio features a consistent humming sound throughout, likely from an electrical device, with a brief, sharp click occurring near the end, possibly from a switch or button being pressed.
T2AV
T2V
In T2V video, the paper generated by the model violates the rule of physics.(soft like silk)
Audio prompt: The audio features the distinct sound of paper being crumpled, with a consistent rhythm and a slightly high-pitched tone, set against a relatively quiet background.
T2AV
T2V
In T2V video, the generated flame only burns on the right side, which does not conform to the laws of physics.
Audio prompt: The audio features a male voice speaking in English with a neutral tone, saying 'pretty much out of fluid', in a clear and straightforward manner without notable variations or background noises.
T2AV
T2V