Does Hearing Help Seeing?
Investigating Audio-Video Joint Denoising for Video Generation

Jianzong Wu1,2, Hao Lian1, Dachao Hao1, Ye Tian1, Qingyu Shi1, Biaolong Chen2, Hao Jiang2, Yunhai Tong1,
1Peking University 2Alibaba Group

6-min Introduction Video

Abstract (TL;DR)

The Problem: T2V models often produce hallucinations, exaggerated motion, or "contact avoidance" (where objects don't actually touch) because they lack a grounding in physical laws.
The Hypothesis: Audio is inherently causal (e.g., a thud implies a collision). Training on audio should regularize video dynamics.
The Method: We propose AVFullDiT, a parameter-efficient framework that:

  • Fuses pre-trained T2V and T2A models.
  • Uses AVFull-Attention for bidirectional information flow.
  • Uses AVSyncRoPE to align the different time scales of audio and video latents.

  • The Result: Compared to a T2V-only baseline, our T2AV model generates more physically plausible motion and better object interactions, proving that hearing does help seeing!.

    The Framework

    The architecture of AVFullDiT and Audio-Video Full-Attention. (a) AVFullDiT reuses pre-trained T2V/T2A early towers and stacks joint blocks that predict video/audio velocities under a unified flow-matching loss. (b) AVFull-Attention performs symmetric MHSA over the concatenated audio–video token sequence using the video width as the joint dimension; audio projections are expanded with small adapter matrices. The attended sequence is split and projected back per modality.

    Results Visualization

    (T2AV) Tempered Motion -- (T2V) Excessive Motion

    In T2V video, the movement amplitude of gun is excessively large.

    Video prompt: A person is holding a black handgun and firing it towards the right side of the frame, with a bright flash visible from the muzzle. The gun is then held steady without firing for a few moments. The person fires the gun again, causing the muzzle to flash once more. The gun is held steady again without firing. The background consists of a wooden cabinet with shelves and a plain wall.
    Audio prompt: The audio features a person firing a gun, producing a sharp, loud gunshot followed by a series of rapid, sharp clicks.

    T2AV

    Video prompt: A person is holding a black handgun and firing it towards the right side of the frame, with a bright flash visible from the muzzle. The gun is then held steady without firing for a few moments. The person fires the gun again, causing the muzzle to flash once more. The gun is held steady again without firing. The background consists of a wooden cabinet with shelves and a plain wall.

    T2V

    In T2V video, the movement amplitude of gun is excessively large.

    Video prompt: A person is holding a black handgun in their right hand, pointing it towards the left side of the frame, with a background of leafless trees and a gray sky. The person pulls the trigger, and a bright flash of light is seen coming from the gun's barrel. The gun is then held steady again, with no visible changes in the background. Another flash of light is seen from the gun's barrel. The gun is held steady once more, with no visible changes in the background. The gun is held steady again, with no visible changes in the background. A small flash of light is seen from the gun's barrel. The gun is held steady once more, with no visible changes in the background.
    Audio prompt: The audio features a distinct gunshot sound, followed by a series of rapid, sharp clicks, allled by a background of muted, ambient noise that suggests an outdoor setting with trees.

    T2AV

    Video prompt: A person is holding a black handgun in their right hand, pointing it towards the left side of the frame, with a background of leafless trees and a gray sky. The person pulls the trigger, and a bright flash of light is seen coming from the gun's barrel. The gun is then held steady again, with no visible changes in the background. Another flash of light is seen from the gun's barrel. The gun is held steady once more, with no visible changes in the background. The gun is held steady again, with no visible changes in the background. A small flash of light is seen from the gun's barrel. The gun is held steady once more, with no visible changes in the background.

    T2V

    In T2V video, playing the strings too fast makes the video look unnatural.

    Video prompt: A person wearing a blue shirt is standing in a room with a window in the background, playing a violin. The person holds the violin with their left hand and uses a bow with their right hand to play the instrument. The room appears to be well-lit with natural light coming through the window. The person's movements are consistent as they play the violin, indicating a continuous performance.
    Audio prompt: The audio features features a violin playing a melody in the key of E major, with a tempo of 100.0 bpm, and a 4/4 time signature, accompanied by a background of music.

    T2AV

    Video prompt: A person wearing a blue shirt is standing in a room with a window in the background, playing a violin. The person holds the violin with their left hand and uses a bow with their right hand to play the instrument. The room appears to be well-lit with natural light coming through the window. The person's movements are consistent as they play the violin, indicating a continuous performance.

    T2V

    In T2V video, playing the strings too fast makes the video look unnatural.

    Video prompt: An elderly man with a beard is seated in a room, playing the violin. He is wearing a striped sweater and is holding the violin with his left hand while using a bow with his right hand to play. The background includes a bookshelf with various items and a closed door. The man continues to play the violin throughout the video, maintaining a consistent posture and focus on his instrument.
    Audio prompt: The audio features includes a man playing the violin with a bow, producing a melodic tune, accompanied by background music that enhances the overall sound, creating a harmonious and pleasant auditory experience.

    T2AV

    Video prompt: An elderly man with a beard is seated in a room, playing the violin. He is wearing a striped sweater and is holding the violin with his left hand while using a bow with his right hand to play. The background includes a bookshelf with various items and a closed door. The man continues to play the violin throughout the video, maintaining a consistent posture and focus on his instrument.

    T2V

    In T2V video, the windmill is spinning too fast.

    Video prompt: The video shows a wind turbine mounted on a pole against a clear blue sky. The turbine's propeller starts spinning slowly and gradually increases in speed. The propeller continues to spin faster, reaching a more rapid pace. The propeller spins at a high speed, maintaining its rotation. The background remains consistent with a clear blue sky and a tree branch visible on the right side.
    Audio prompt: The audio features a consistent, rhythmic sound of a horse trotting, with a steady pace and a clear, resonant tone that suggests a hard surface beneath its hooves.

    T2AV

    Video prompt: The video shows a wind turbine mounted on a pole against a clear blue sky. The turbine's propeller starts spinning slowly and gradually increases in speed. The propeller continues to spin faster, reaching a more rapid pace. The propeller spins at a high speed, maintaining its rotation. The background remains consistent with a clear blue sky and a tree branch visible on the right side.

    T2V

    (T2AV) Tempered Motion -- (T2V) Static Video

    In T2V video, the waves stirred up by the water flow are relatively inconspicuous.

    Video prompt: The video captures a coastal scene with a lighthouse standing on a rocky cliff, surrounded by a clear sky transitioning from day to dusk. The ocean waves are seen crashing against the rocks in the foreground, creating white foam and splashes. The lighthouse, with a white tower and a red-roofed building beside it, remains a constant focal point throughout the frames. The sky gradually darkens, and a faint moon is visible in the upper right part of the sky.
    Audio prompt: The audio features the sound of a large wave crashing, creating a powerful and intense auditory experience with a deep, resonant tone.

    T2AV

    Video prompt: The video captures a coastal scene with a lighthouse standing on a rocky cliff, surrounded by a clear sky transitioning from day to dusk. The ocean waves are seen crashing against the rocks in the foreground, creating white foam and splashes. The lighthouse, with a white tower and a red-roofed building beside it, remains a constant focal point throughout the frames. The sky gradually darkens, and a faint moon is visible in the upper right part of the sky.

    T2V

    In T2V video, the waves stirred up by the water flow are relatively inconspicuous.

    Video prompt: The video begins with a black screen. The scene gradually brightens, revealing a rocky coastline with waves crashing against the rocks. The waves continue to crash against the rocks, creating white foam and splashes. The background shows a dark sky, indicating it is either early morning or late evening. The waves persistently hit the rocks, showcasing the dynamic interaction between the sea and the shoreline.
    Audio prompt: The audio features a continuous, low-pitched humming sound, reminiscent of a distant engine.

    T2AV

    Video prompt: The video begins with a black screen. The scene gradually brightens, revealing a rocky coastline with waves crashing against the rocks. The waves continue to crash against the rocks, creating white foam and splashes. The background shows a dark sky, indicating it is either early morning or late evening. The waves persistently hit the rocks, showcasing the dynamic interaction between the sea and the shoreline.

    T2V

    In T2V video, the waves stirred up by the water flow are relatively inconspicuous.

    Video prompt: The video shows a rocky shoreline with various sizes of stones and boulders, with waves continuously crashing against the rocks, creating white foam. The background features a calm sea extending to the horizon under a cloudy sky. The waves consistently hit the rocks, causing the foam to spread and recede with each wave.
    Audio prompt: The audio features the sound of a large wave crashing, creating a powerful and immersive auditory experience.

    T2AV

    Video prompt: The video shows a rocky shoreline with various sizes of stones and boulders, with waves continuously crashing against the rocks, creating white foam. The background features a calm sea extending to the horizon under a cloudy sky. The waves consistently hit the rocks, causing the foam to spread and recede with each wave.

    T2V

    In T2V video, the waves stirred up by the water flow are relatively inconspicuous.

    Video prompt: The video shows a series of waves crashing against a rocky shoreline. The waves are foamy and turbulent as they approach the rocks, creating splashes and white foam. The rocks are jagged and covered with some seaweed, indicating a natural coastal environment. The waves continue to hit the rocks, causing water to splash up and spread out. The background remains consistent with the rocky shoreline and the continuous movement of the waves.
    Audio prompt: The audio features the sound of waves crashing, creating a rhythmic and soothing tone with a low pitch, accompanied by the distant sound of a train horn in the background.

    T2AV

    Video prompt: The video shows a series of waves crashing against a rocky shoreline. The waves are foamy and turbulent as they approach the rocks, creating splashes and white foam. The rocks are jagged and covered with some seaweed, indicating a natural coastal environment. The waves continue to hit the rocks, causing water to splash up and spread out. The background remains consistent with the rocky shoreline and the continuous movement of the waves.

    T2V

    In T2V video, the water flow is very small, and the dynamics are barely perceptible.

    Video prompt: The video shows a serene natural setting with a small stream flowing over moss-covered rocks, surrounded by lush green foliage. The text 'Nature Sounds' is prominently displayed in the center of the frame, overlaying the scene. The water gently cascades down the rocks, creating a soothing and calming atmosphere.
    Audio prompt: The audio features a continuous sound of water flowing steadily, creating a soothing and rhythmic ambiance.

    T2AV

    Video prompt: The video shows a serene natural setting with a small stream flowing over moss-covered rocks, surrounded by lush green foliage. The text 'Nature Sounds' is prominently displayed in the center of the frame, overlaying the scene. The water gently cascades down the rocks, creating a soothing and calming atmosphere.

    T2V

    In T2V video, the background scene is too static and unnatural.

    Video prompt: A small, striped animal, likely a chipmunk, is perched on a branch surrounded by green leaves and foliage. The animal remains mostly still, with only slight movements of its head and body. The background is a blurred mix of green and brown, indicating a natural, outdoor setting. The animal's fur is detailed with dark stripes running along its back and sides. The animal's eyes are large and black, and it appears to be observing its surroundings.
    Audio prompt: The audio features a repetitive, high-pitched squeaking sound, likely from a small animal, with a consistent rhythm and a background of faint, indistinct noises.

    T2AV

    Video prompt: A small, striped animal, likely a chipmunk, is perched on a branch surrounded by green leaves and foliage. The animal remains mostly still, with only slight movements of its head and body. The background is a blurred mix of green and brown, indicating a natural, outdoor setting. The animal's fur is detailed with dark stripes running along its back and sides. The animal's eyes are large and black, and it appears to be observing its surroundings.

    T2V

    In T2V video, the scene is overly static, with no changes in the hand gestures for playing saxophone;
    In T2AV video, the hand gestures roughly match the audio.

    Video prompt: A person dressed in black is holding a white, intricately designed musical instrument, which appears to be a flute, and is playing it. The background shows a room with a bed and some objects, including a box and a bag, placed on a surface. The person's fingers are moving along the instrument, pressing the keys as they play.
    Audio prompt: The audio features a saxophone playing a slow, melancholic melody with a low pitch, accompanied by a background of a soft, ambient electronic music that has a rhythmic pattern, creating a somber and reflective atmosphere.

    T2AV

    Video prompt: A person dressed in black is holding a white, intricately designed musical instrument, which appears to be a flute, and is playing it. The background shows a room with a bed and some objects, including a box and a bag, placed on a surface. The person's fingers are moving along the instrument, pressing the keys as they play.

    T2V

    In T2V video, the camera movement is relatively static;
    In T2AV video, the camera movement is relatively natural.

    Video prompt: The video begins with a close-up of a car's speedometer and digital display, showing the odometer reading 135560 km and the trip meter at 145.2 km. The speedometer and trip meter remain constant throughout the video. The oil pressure and battery warning lights turn on, followed by the temperature warning light. The fuel gauge starts to move, indicating a change in the fuel level. The camera gradually zooms in on the illuminated warning lights and the moving fuel gauge.
    Audio prompt: The audio features a male voice speaking in Portuguese, with a tone that suggests he is instructing or explaining something, possibly in a casual or conversational manner. The pitch of his voice is moderate, and the rhythm is steady, indicating a normal speaking pace. There are no significant variations in

    T2AV

    Video prompt:The video begins with a close-up of a car's speedometer and digital display, showing the odometer reading 135560 km and the trip meter at 145.2 km. The speedometer and trip meter remain constant throughout the video. The oil pressure and battery warning lights turn on, followed by the temperature warning light. The fuel gauge starts to move, indicating a change in the fuel level. The camera gradually zooms in on the illuminated warning lights and the moving fuel gauge.

    T2V

    (T2AV) Contact with Sound -- (T2V) Contact Avoidance

    In T2V video, the hammer does not properly make contact with the desktop.

    Video prompt: On a wooden surface, there are several nails scattered on the left side, and a person wearing gloves is holding a hammer in their right hand. The person repeatedly strikes a nail on the wooden surface, causing wood chips and dust to scatter around. As the hammering continues, the nail gradually bends and eventually falls to the left side of the surface.
    Audio prompt: The audio features a distinct hammeringing sound, characterized by a sharp, resonant tone, and a rhythmic pattern, set against a backdrop of faint, ambient noise.

    T2AV

    Video prompt: On a wooden surface, there are several nails scattered on the left side, and a person wearing gloves is holding a hammer in their right hand. The person repeatedly strikes a nail on the wooden surface, causing wood chips and dust to scatter around. As the hammering continues, the nail gradually bends and eventually falls to the left side of the surface.

    T2V

    In T2V video, the knife and the whetstone are not in proper contact.

    Video prompt: A person is seen sharpening a knife on a sharpening stone, moving the knife back and forth with both hands. The sharpening stone is secured on a table with a white cloth underneath. The person continues to sharpen the knife, maintaining a consistent motion. After completing the sharpening process, the person lifts the knife away from the stone. The sharpening stone remains on the table as the person moves the knife out of the frame.
    Audio prompt: The audio features a continuous, high-pitched, metallic scraping sound, accompanied by a soft, ambient background noise that suggests a workshop or industrial setting.

    T2AV

    Video prompt: A person is seen sharpening a knife on a sharpening stone, moving the knife back and forth with both hands. The sharpening stone is secured on a table with a white cloth underneath. The person continues to sharpen the knife, maintaining a consistent motion. After completing the sharpening process, the person lifts the knife away from the stone. The sharpening stone remains on the table as the person moves the knife out of the frame.

    T2V

    In T2V video, the knife and the whetstone are not in proper contact.

    Video prompt: A person is seen sharpening a knife on a whetstone labeled 'Whetstone #400 Grit' in a tray filled with water. The person moves the knife back and forth across the whetstone, applying consistent pressure and motion. The background includes various sharpening tools and a wooden block. After several passes, the person lifts the knife and inspects it closely.
    Audio prompt: The audio features a continuous, rhythmic scraping sound, likely from a knife being sharpened on a whetstone, with a consistent tone and pitch, accompanied by faint background noises that suggest a workshop setting.

    T2AV

    Video prompt: A person is seen sharpening a knife on a whetstone labeled 'Whetstone #400 Grit' in a tray filled with water. The person moves the knife back and forth across the whetstone, applying consistent pressure and motion. The background includes various sharpening tools and a wooden block. After several passes, the person lifts the knife and inspects it closely.

    T2V

    In T2V video, the stick and the chair are not in proper contact.

    Video prompt: The video shows a yellow chair with perforated holes in the backrest and a black armrest, set against a wooden background. A wooden stick appears and hits the armrest of the chair. The stick then moves across the backrest of the chair, continuing to hit it. The stick moves back and forth across the backrest multiple times.
    Audio prompt: The audio features a series of sharp, repetitive whip sounds, creating a rhythmic pattern against a relatively quiet background.

    T2AV

    Video prompt: The video shows a yellow chair with perforated holes in the backrest and a black armrest, set against a wooden background. A wooden stick appears and hits the armrest of the chair. The stick then moves across the backrest of the chair, continuing to hit it. The stick moves back and forth across the backrest multiple times.

    T2V

    In T2V video, the stick and the keyboard are not in proper contact.

    Video prompt: The video shows a desk with two computer monitors, a black keyboard, a black mouse, a box of tissues, a ceramic mug, and some papers. In the fifth frame, a hand holding a wooden stick appears from the left side of the frame. The hand starts moving the stick in a conducting motion over the keyboard and mouse. The stick continues to move back and forth over the keyboard and mouse, simulating conducting motions. The hand and stick move across the desk, continuing the conducting motion. The stick moves back to the left side of the frame, and the hand is no longer visible.
    Audio prompt: The audio features a series of distinct, sharp snapping sounds, each with a brief echo, occurring at regular intervals, set against a relatively quiet background.

    T2AV

    Video prompt: The video shows a desk with two computer monitors, a black keyboard, a black mouse, a box of tissues, a ceramic mug, and some papers. In the fifth frame, a hand holding a wooden stick appears from the left side of the frame. The hand starts moving the stick in a conducting motion over the keyboard and mouse. The stick continues to move back and forth over the keyboard and mouse, simulating conducting motions. The hand and stick move across the desk, continuing the conducting motion. The stick moves back to the left side of the frame, and the hand is no longer visible.

    T2V

    In T2V video, the stick and the chair are not in proper contact.

    Video prompt: The video shows a light blue office chair with black armrests and wheels, positioned in a room with a blue floor and other office chairs in the background. A hand holding a wooden stick appears and starts hitting the seat of the chair repeatedly. The hand continues to hit the chair in a consistent manner, moving the stick back and forth across the seat.
    Audio prompt: The audio features a series of distinct, sharp thumping sounds, likely from a hard object hitting a surface, with a consistent rhythm and a slightly dark, reverberant tone.

    T2AV

    Video prompt: The video shows a light blue office chair with black armrests and wheels, positioned in a room with a blue floor and other office chairs in the background. A hand holding a wooden stick appears and starts hitting the seat of the chair repeatedly. The hand continues to hit the chair in a consistent manner, moving the stick back and forth across the seat.

    T2V

    In T2V video, the stick and the table are not in proper contact.

    Video prompt: The video shows a round wooden table with a metal base placed on a blue carpeted floor. A hand holding a stick appears from the right side of the frame and begins to hit the table surface repeatedly. The stick makes contact with the table multiple times, creating a rhythmic pattern. The background remains consistent throughout the video, with no other objects or changes in the environment.
    Audio prompt: The audio features a series of rhythmic, repetitive sounds, likely from a hammer striking a hard surface, creating a consistent and steady beat.

    T2AV

    Video prompt: The video shows a round wooden table with a metal base placed on a blue carpeted floor. A hand holding a stick appears from the right side of the frame and begins to hit the table surface repeatedly. The stick makes contact with the table multiple times, creating a rhythmic pattern. The background remains consistent throughout the video, with no other objects or changes in the environment.

    T2V

    In T2V video, the stick and the cabinet are not in proper contact.

    Video prompt: The video shows a close-up of a metallic cabinet with a smooth surface and a drawer. A wooden stick appears and repeatedly hits the cabinet surface from the right side. The stick continues to hit the cabinet in a consistent manner, creating a rhythmic motion. The background includes a wooden floor and a metal structure on the left side.
    Audio prompt: The audio features a distinct metallic clanging sound, which is sharp and resonant, occurring at regular intervals, creating a rhythmic pattern against a relatively quiet background.

    T2AV

    Video prompt: The video shows a close-up of a metallic cabinet with a smooth surface and a drawer. A wooden stick appears and repeatedly hits the cabinet surface from the right side. The stick continues to hit the cabinet in a consistent manner, creating a rhythmic motion. The background includes a wooden floor and a metal structure on the left side.

    T2V

    In T2V video, the stick and the game console are not in proper contact.

    Video prompt: The video shows a game console with a controller on top, placed on a wooden shelf. A hand holding a drumstick appears and starts tapping the buttons on the controller. The hand continues to tap different buttons on the controller with the drumstick. The background remains consistent with the wooden shelf and the game console throughout the video.
    Audio prompt: The audio features a series of distinct, sharp clapping sounds, each with a brief pause, creating a rhythmic pattern against a quiet background.

    T2AV

    Video prompt: The video shows a game console with a controller on top, placed on a wooden shelf. A hand holding a drumstick appears and starts tapping the buttons on the controller. The hand continues to tap different buttons on the controller with the drumstick. The background remains consistent with the wooden shelf and the game console throughout the video.

    T2V

    In T2V video, the stick and the box are not in proper contact.

    Video prompt: A cardboard box labeled 'SCOTTFOLD' is placed on a wooden surface next to a wall. A hand holding a wooden stick appears from the left side of the frame and starts tapping the box. The hand continues to tap the box with the stick in a rhythmic manner.
    Audio prompt: The audio features a series of distinct, sharp thuds against a hard surface, likely wood, with a clear and resonant tone, maintaining a consistent rhythm throughout.

    T2AV

    Video prompt: A cardboard box labeled 'SCOTTFOLD' is placed on a wooden surface next to a wall. A hand holding a wooden stick appears from the left side of the frame and starts tapping the box. The hand continues to tap the box with the stick in a rhythmic manner.

    T2V

    In T2V video, the stick and the water fountain are not in proper contact.

    Video prompt: The video shows a wall-mounted water fountain with two drinking fountains, one above the other, in a room with a speckled floor and beige walls. A person enters the frame from the left, holding a wooden stick. The person uses the stick to push the button located between the two drinking fountains. The person continues to push the stick against the button, ensuring it is pressed. The person then moves the stick to the button on the lower drinking fountain. The person presses the button on the lower drinking fountain with the stick.
    Audio prompt: The audio features a series of distinct, sharp metallic clanks, each with a clear and resonant tone, occurring at regular intervals, creating a rhythmic pattern against a relatively quiet background.

    T2AV

    Video prompt: The video shows a wall-mounted water fountain with two drinking fountains, one above the other, in a room with a speckled floor and beige walls. A person enters the frame from the left, holding a wooden stick. The person uses the stick to push the button located between the two drinking fountains. The person continues to push the stick against the button, ensuring it is pressed. The person then moves the stick to the button on the lower drinking fountain. The person presses the button on the lower drinking fountain with the stick.

    T2V

    In T2V video, the bowl and the hammer are not in proper contact.

    Video prompt: A hand is holding a brass bowl with small perforations on its surface against a plain white background. A red object, likely a mallet, appears from the right side and moves towards the bowl. The red object makes contact with the bowl and begins to strike it. The red object continues to strike the bowl, producing sound. The red object is then moved away from the bowl. The hand continues to hold the bowl steady as the red object is no longer in contact with it.
    Audio prompt: The audio features a single, resonant bell sound with a clear, high-pitched tone that rings out and gradually fades, accompanied by a soft, ambient background noise that remains constant throughout.

    T2AV

    Video prompt: A hand is holding a brass bowl with small perforations on its surface against a plain white background. A red object, likely a mallet, appears from the right side and moves towards the bowl. The red object makes contact with the bowl and begins to strike it. The red object continues to strike the bowl, producing sound. The red object is then moved away from the bowl. The hand continues to hold the bowl steady as the red object is no longer in contact with it.

    T2V

    In T2V video, the hand and the disc are not in proper contact.

    Video prompt: A pair of hands is seen operating a Pioneer DJ controller on a wooden surface, with the left hand adjusting the knobs and sliders while the right hand manipulates the jog wheels. The controller features two jog wheels, various buttons, and a display screen, with the brand 'Pioneer DJ' and 'rekordbox' visible on the front. The hands continue to move rhythmically, adjusting the controls and spinning the jog wheels, indicating a DJ mixing session. The background remains consistent, showing a wooden surface with some cables connected to the controller.
    Audio prompt: The audio features a rhythmic scratching sound, likely from a DJ, with a consistent tempo and a background of soft, indistinct music.

    T2AV

    Video prompt: A pair of hands is seen operating a Pioneer DJ controller on a wooden surface, with the left hand adjusting the knobs and sliders while the right hand manipulates the jog wheels. The controller features two jog wheels, various buttons, and a display screen, with the brand 'Pioneer DJ' and 'rekordbox' visible on the front. The hands continue to move rhythmically, adjusting the controls and spinning the jog wheels, indicating a DJ mixing session. The background remains consistent, showing a wooden surface with some cables connected to the controller.

    T2V

    In T2V video, the electric chainsaw and the tree are not in proper contact.

    Video prompt: A man wearing a green cap, brown sweater, and blue jeans is using a red chainsaw to cut a log on a wooden platform in a forested area. He continues to saw through the log, making steady progress. After completing the cut, he lifts the chainsaw and steps back, moving away from the log. The background consists of trees and foliage, indicating a forest setting.
    Audio prompt: The audio features a continuous, high-pitched whine from a chainsaw cutting through wood, with a background of wind and wind sounds, creating a sense of an outdoor work environment.

    T2AV

    Video prompt: A man wearing a green cap, brown sweater, and blue jeans is using a red chainsaw to cut a log on a wooden platform in a forested area. He continues to saw through the log, making steady progress. After completing the cut, he lifts the chainsaw and steps back, moving away from the log. The background consists of trees and foliage, indicating a forest setting.

    T2V

    In T2V video, the electric chainsaw and the tree are not in proper contact.

    Video prompt: A person wearing orange protective pants and a gray shirt is seen on a grassy area, holding a chainsaw. The person positions the chainsaw on a log that is lying on the ground, with a cut section already visible at the end of the log. The person begins to saw the log, moving the chainsaw forward steadily. Sawdust is visible as the chainsaw cuts through the log. The person continues to saw the log, making progress through the wood.
    Audio prompt: The audio features a continuous, high-pitched whirring sound, characteristic of a chainsaw in operation, with a consistent rhythm and a background of faint, indistinct noise.

    T2AV

    Video prompt: A person wearing orange protective pants and a gray shirt is seen on a grassy area, holding a chainsaw. The person positions the chainsaw on a log that is lying on the ground, with a cut section already visible at the end of the log. The person begins to saw the log, moving the chainsaw forward steadily. Sawdust is visible as the chainsaw cuts through the log. The person continues to saw the log, making progress through the wood.

    T2V

    In T2V video, the knife and vegetables are not in proper contact.

    Video prompt: A person wearing a watch and a patterned shirt is slicing a cylindrical food item on a white cutting board using a knife. The cutting board is placed on a wooden countertop, and there is a bamboo sushi rolling mat next to it. Various ingredients and containers, including a jar of seasoning and a bowl of chopped items, are visible on the countertop. The person continues to slice the food item into smaller pieces, making precise cuts with the knife.
    Audio prompt: A woman speaks with a neutral tone and moderate pitch, saying "Can I say hmm it's not beautiful but it is beautiful", while in the background, there are sounds of cutl cutl, a woman's voice, and a man's voice.

    T2AV

    Video prompt: A person wearing a watch and a patterned shirt is slicing a cylindrical food item on a white cutting board using a knife. The cutting board is placed on a wooden countertop, and there is a bamboo sushi rolling mat next to it. Various ingredients and containers, including a jar of seasoning and a bowl of chopped items, are visible on the countertop. The person continues to slice the food item into smaller pieces, making precise cuts with the knife.

    T2V

    In T2V video, the stick and guiro are not in proper contact.

    Video prompt: In a room with tiled walls and a wooden door, a person wearing a green shirt and shorts stands behind a makeshift drum setup, which includes a metal stand and a white circular object with a black pattern. The person is seen playing the drums with both hands, moving them rhythmically up and down. The room has a bed with a blue blanket and a pillow with red and white text on it, and there are coats hanging on the door.
    Audio prompt: The audio features a lively Latin American music track with a fast tempo, characterized by a prominent horn section, likely trumpets, playing a rhythmic melody, accompanied by a male vocalist singing in a high-pitched, energetic style, with a background of a steady, upbeat drum rhythm and occasional guitar strums.

    T2AV

    Video prompt: In a room with tiled walls and a wooden door, a person wearing a green shirt and shorts stands behind a makeshift drum setup, which includes a metal stand and a white circular object with a black pattern. The person is seen playing the drums with both hands, moving them rhythmically up and down. The room has a bed with a blue blanket and a pillow with red and white text on it, and there are coats hanging on the door.

    T2V

    In T2V video, the hand didn't press the button properly.

    Video prompt: A hand is seen turning the knob on a fan's control panel, which has a dial with markings from 0 to 3. The hand rotates the knob from the 0 position to the 3 position. The knob is then left in the 3 position, and the hand moves away. The background consists of the fan's metal blades and a logo 'creoven TV' in the top right corner.
    Audio prompt: The audio features a male speaker with a neutral tone and moderate pitch, delivering a speech in German about maintaining a cubic meter per hour without exceeding the 75 decibel mark, with a consistent rhythm and no notable variations in background noises.

    T2AV

    Video prompt: A hand is seen turning the knob on a fan's control panel, which has a dial with markings from 0 to 3. The hand rotates the knob from the 0 position to the 3 position. The knob is then left in the 3 position, and the hand moves away. The background consists of the fan's metal blades and a logo 'creoven TV' in the top right corner.

    T2V

    Physics Commonsenses

    In T2V video, the vertical knife-sharpening method violates the laws of physics.

    Video prompt: A person is holding a large knife with a black handle and is sharpening it on a yellow and black sharpening stone placed on a beige countertop. The person moves the knife back and forth across the sharpening stone, applying consistent pressure and maintaining a steady hand. The sharpening stone has markings indicating different grit levels, and the person continues to sharpen the knife in a repetitive motion. The background consists of a light-colored wall with vertical panels, and there is another knife placed on the countertop to the left of the sharpening stone.
    Audio prompt: The audio features the sound of a knife being rubbed against a sharpeninging surface, accompanied by a faint background hum.

    T2AV

    Video prompt: A person is holding a large knife with a black handle and is sharpening it on a yellow and black sharpening stone placed on a beige countertop. The person moves the knife back and forth across the sharpening stone, applying consistent pressure and maintaining a steady hand. The sharpening stone has markings indicating different grit levels, and the person continues to sharpen the knife in a repetitive motion. The background consists of a light-colored wall with vertical panels, and there is another knife placed on the countertop to the left of the sharpening stone.

    T2V

    In T2V video, sharpening a knife on hand violates the laws of physics.

    Video prompt: In a tiled kitchen environment, a person wearing a red apron is seen holding a large knife on a white cutting board with their left hand, which has a gold ring on the ring finger. The person uses their right hand to repeatedly scrape the blade of the knife with a sharpening tool, moving the tool back and forth along the edge of the knife.
    Audio prompt: The audio features a continuous, high-pitched, and rhythmic scratching sound, likely produced by a tool or object being scraped against a hard surface, accompanied by a faint background noise that could be a distant conversation or ambient kitchen sounds.

    T2AV

    Video prompt: In a tiled kitchen environment, a person wearing a red apron is seen holding a large knife on a white cutting board with their left hand, which has a gold ring on the ring finger. The person uses their right hand to repeatedly scrape the blade of the knife with a sharpening tool, moving the tool back and forth along the edge of the knife.

    T2V

    In T2V video, the ball disappearing directly violates the laws of physics.

    Video prompt: A man dressed in black is seen at a bowling alley, preparing to bowl. He takes a few steps forward, positioning himself to throw the bowling ball. He bends down and swings the ball back. He then releases the ball, releasing it towards the bowling lane. The ball rolls down the lane as the man follows through with his arm. He stands up straight, watching the ball's progress. The background shows multiple bowling lanes with banners overhead, and a score display is visible in the bottom right corner of the screen.
    Audio prompt: The audio features a bowling ball rolling down the lane, followed by the satisfying sound of pins crashing, all by a lively background of music and chatter.

    T2AV

    Video prompt: A man dressed in black is seen at a bowling alley, preparing to bowl. He takes a few steps forward, positioning himself to throw the bowling ball. He bends down and swings the ball back. He then releases the ball, releasing it towards the bowling lane. The ball rolls down the lane as the man follows through with his arm. He stands up straight, watching the ball's progress. The background shows multiple bowling lanes with banners overhead, and a score display is visible in the bottom right corner of the screen.

    T2V

    In T2V video, the car suddenly reversing violates the daily rules.

    Video prompt: An ambulance with the word 'AMBULANZA' and emergency symbols on its back is driving down a street lined with parked cars on the left and a sidewalk on the right. The ambulance continues to move forward, passing by various parked vehicles, including sedans and compact cars. The camera follows the ambulance as it progresses down the street, maintaining a steady focus on the vehicle. The background consists of residential buildings with balconies and windows, and the street is marked with a crosswalk at the beginning. The ambulance's license plate and emergency contact number are visible on the back. The camera gradually zooms in on the ambulance as it moves further down the street. Throughout the video, the text 'CAGLIARIMEGGENCY118' is displayed in the bottom right corner of the screen.
    Audio prompt: An emergency vehicle sots is heard, characterized by a high-pitched, urgent tone, with a consistent rhythm, and a background of urban street noise.

    T2AV

    Video prompt: An ambulance with the word 'AMBULANZA' and emergency symbols on its back is driving down a street lined with parked cars on the left and a sidewalk on the right. The ambulance continues to move forward, passing by various parked vehicles, including sedans and compact cars. The camera follows the ambulance as it progresses down the street, maintaining a steady focus on the vehicle. The background consists of residential buildings with balconies and windows, and the street is marked with a crosswalk at the beginning. The ambulance's license plate and emergency contact number are visible on the back. The camera gradually zooms in on the ambulance as it moves further down the street. Throughout the video, the text 'CAGLIARIMEGGENCY118' is displayed in the bottom right corner of the screen.

    T2V

    In T2V video, the position of the welding torch is incorrect.

    Video prompt: A person wearing red gloves is welding a metal joint, producing bright sparks and intense light. The welding process continues with the sparks and light gradually diminishing. The person then removes the welding tool, revealing a glowing weld on the metal joint. The person proceeds to remove a bolt from the metal joint using their gloved hand.
    Audio prompt: The audio features a continuous, high-pitched hissing sound, likely from a spray can, followed by a brief, sharp metallic clink, possibly a can hitting a hard surface.

    T2AV

    Video prompt: A person wearing red gloves is welding a metal joint, producing bright sparks and intense light. The welding process continues with the sparks and light gradually diminishing. The person then removes the welding tool, revealing a glowing weld on the metal joint. The person proceeds to remove a bolt from the metal joint using their gloved hand.

    T2V

    In T2V video, the rotating fan is distorted.

    Video prompt: A hand is holding a white vent with the brand name 'WOLKER' on it, positioned above a white surface. Below the vent, there is a white switch and an orange electrical box with wires. The hand presses the switch, and the vent starts rotating. The hand continues to press the switch, and the vent keeps rotating. The hand releases the switch, and the vent slows down and eventually stops rotating.
    Audio prompt: The audio features a distinct click sound, likely from a switch or button being pressed, followed by a brief period of silence, then another similar click, all throughout the audio.

    T2AV

    Video prompt: A hand is holding a white vent with the brand name 'WOLKER' on it, positioned above a white surface. Below the vent, there is a white switch and an orange electrical box with wires. The hand presses the switch, and the vent starts rotating. The hand continues to press the switch, and the vent keeps rotating. The hand releases the switch, and the vent slows down and eventually stops rotating.

    T2V

    In T2V video, the video and audio are relatively consistent, achieving good AV-Joint generation.

    Video prompt: A man wearing a white t-shirt with a black graphic on the left side is in a park with trees and benches in the background. He is speaking and gesturing with his hands, occasionally pointing and moving his arms. The background shows people sitting on benches and walking around. The man continues to speak and gesture throughout the video.
    Audio prompt: The audio features a beatboxing performance with a consistent rhythm and a low-pitched, resonant tone, accompanied by a faint background noise that seems to be a soft, ambient hum.

    T2AV

    Video prompt: A man wearing a white t-shirt with a black graphic on the left side is in a park with trees and benches in the background. He is speaking and gesturing with his hands, occasionally pointing and moving his arms. The background shows people sitting on benches and walking around. The man continues to speak and gesture throughout the video.

    T2V

    In T2V video, the bell is small, and its movement violates the laws of physics.
    In contrast, in the T2AV video, the bell is clear and its movement is synchronized with audio.

    Video prompt: A brown and white cow with a bell around its neck is walking to the right in a grassy field, passing by a barbed wire fence. In the background, there is a hotel building with a sign that reads 'HOTEL'. The cow continues to walk, moving past a wooden post and further into the field. The camera follows the cow's movement, capturing the scenic background of mountains and greenery.
    Audio prompt: The audio features footsteps on gravel, a man speaking, and a bicycle bell ringing, with the bell's sound becoming more prominent over time.

    T2AV

    Video prompt: A brown and white cow with a bell around its neck is walking to the right in a grassy field, passing by a barbed wire fence. In the background, there is a hotel building with a sign that reads 'HOTEL'. The cow continues to walk, moving past a wooden post and further into the field. The camera follows the cow's movement, capturing the scenic background of mountains and greenery.

    T2V

    In T2V video, the human hand undergoes deformation.

    Video prompt: A person is holding a green circuit board with both hands, adjusting wires connected to it. The background shows an open electronic device with various wires and components. The person continues to adjust the wires, ensuring they are properly connected. The person then picks up an object with their right hand while still holding the circuit board with their left hand.
    Audio prompt: A male voice with a German accent speaks in a calm tone, delivering the line 'Was den -0 tut dir das nicht, bekommst du kein Signal.' The background is quiet with no noticeable ambient noise.

    T2AV

    Video prompt: A person is holding a green circuit board with both hands, adjusting wires connected to it. The background shows an open electronic device with various wires and components. The person continues to adjust the wires, ensuring they are properly connected. The person then picks up an object with their right hand while still holding the circuit board with their left hand.

    T2V

    In T2V video, the generated chainsaw shape is incorrect.

    Video prompt: A person is holding an orange and white chainsaw while surrounded by green foliage and a building with a tiled roof in the background. Smoke begins to appear around the chainsaw, and small flames start to ignite. The flames grow larger, and the person uses their right hand to try to extinguish the fire. The fire continues to burn intensely, and the person keeps attempting to put it out. The flames and smoke become more intense, and the person struggles to control the fire. The fire starts to diminish slightly, but it is still burning. The person continues to try to extinguish the remaining flames.
    Audio prompt: The audio features a consistent, high-pitched, and rhythmic sound of a bicycle wheel spinning, accompanied by a faint background noise that seems to be the ambient sound of the environment.

    T2AV

    Video prompt: A person is holding an orange and white chainsaw while surrounded by green foliage and a building with a tiled roof in the background. Smoke begins to appear around the chainsaw, and small flames start to ignite. The flames grow larger, and the person uses their right hand to try to extinguish the fire. The fire continues to burn intensely, and the person keeps attempting to put it out. The flames and smoke become more intense, and the person struggles to control the fire. The fire starts to diminish slightly, but it is still burning. The person continues to try to extinguish the remaining flames.

    T2V

    In T2AV video, girls' hand movements when pinching plastic are more accurate and synchronized with the sound.

    Video prompt: A girl is sitting at a round table in a living room, which has various small toys and a playset labeled 'Fruit & Veg' on it. She is talking while looking at the toys on the table. She reaches out and picks up a blue toy package from the table. She examines the toy package closely, turning it around in her hands. The background includes a striped armchair, a dining table with chairs, and some scattered toys on the floor.
    Audio prompt: The audio features a child speaking in English, saying "on to our last basket", with a neutral tone and a pitch typical of a young child, accompanied by the sound of crumpling paper in the background.

    T2AV

    Video prompt: A girl is sitting at a round table in a living room, which has various small toys and a playset labeled 'Fruit & Veg' on it. She is talking while looking at the toys on the table. She reaches out and picks up a blue toy package from the table. She examines the toy package closely, turning it around in her hands. The background includes a striped armchair, a dining table with chairs, and some scattered toys on the floor.

    T2V

    In T2V video, the meat hasn't been cut properly.

    Video prompt: A chef wearing a white uniform and gloves is seen chopping a piece of meat into minced meat using a knife, placing the minced meat into a white bowl on a countertop covered with a cloth. The kitchen background includes a refrigerator, microwave, and various kitchen utensils. A woman in a red top and pink headband stands next to the chef, observing the process. The chef continues to mince the meat while explaining, and the subtitles on the screen read '鎴戝€戠敤鐨勯噺涓嶉渶瑕佸緢澶? and '浣犺鐢ㄥ灏戝氨鍒ㄥ灏?. The camera then focuses on the chef's face as he continues to speak, with the subtitle '灏?浣犺澶氬皯灏卞埁澶氬皯' appearing on the screen.
    Audio prompt: The audio features a conversation in a language that sounds like it could be Chinese, with a woman speaking in a higher pitch and a man respondinging in a lower pitch, both maintaining a steady rhythm, and there are no noticeable background noises.

    T2AV

    Video prompt: A chef wearing a white uniform and gloves is seen chopping a piece of meat into minced meat using a knife, placing the minced meat into a white bowl on a countertop covered with a cloth. The kitchen background includes a refrigerator, microwave, and various kitchen utensils. A woman in a red top and pink headband stands next to the chef, observing the process. The chef continues to mince the meat while explaining, and the subtitles on the screen read '鎴戝€戠敤鐨勯噺涓嶉渶瑕佸緢澶? and '浣犺鐢ㄥ灏戝氨鍒ㄥ灏?. The camera then focuses on the chef's face as he continues to speak, with the subtitle '灏?浣犺澶氬皯灏卞埁澶氬皯' appearing on the screen.

    T2V

    In T2V video, The bell didn't sway when struck, which violates the laws of physics;
    As for the T2AV video, the bell's sway is synchronized with the bell sound.

    Video prompt: A large bell is suspended and swinging back and forth in a structure with wooden beams and metal supports. The bell is attached to a metal frame, and a smaller metal object, likely a striker, is also moving in sync with the bell's motion. The background shows a brick wall and various metal and wooden components, indicating an industrial or historical setting. The bell's motion is continuous and rhythmic, with the striker occasionally coming into view as it moves past the bell.
    Audio prompt: The audio features a distinct church bell ringing, accompanied by a soft, ambient background noise, creating a serene and rhythmic atmosphere.

    T2AV

    Video prompt: A large bell is suspended and swinging back and forth in a structure with wooden beams and metal supports. The bell is attached to a metal frame, and a smaller metal object, likely a striker, is also moving in sync with the bell's motion. The background shows a brick wall and various metal and wooden components, indicating an industrial or historical setting. The bell's motion is continuous and rhythmic, with the striker occasionally coming into view as it moves past the bell.

    T2V

    In T2V movie, the position of electronic trimmer is incorrect.

    Video prompt: A man wearing a red polo shirt and a white undershirt is standing in a tiled room, holding an electric ear trimmer to his right ear. He moves the trimmer back and forth in his ear, carefully grooming the hair. The background consists of white tiled walls, and the man continues to trim his ear hair methodically.
    Audio prompt: The audio features a man speaking in a neutral tone over English, accompanied by the continuous sound of an electric shaver buzzing in the background.

    T2AV

    Video prompt: A man wearing a red polo shirt and a white undershirt is standing in a tiled room, holding an electric ear trimmer to his right ear. He moves the trimmer back and forth in his ear, carefully grooming the hair. The background consists of white tiled walls, and the man continues to trim his ear hair methodically.

    T2V

    In T2V movie, continuous firing violates the laws of physics.

    Video prompt: In a barren, desert-like environment, a person is seen handling a large rifle mounted on a wooden platform. The person removes the magazine from the rifle and places it on the platform. The person then lies prone on the ground, aiming the rifle equipped with a scope, surrounded by spent bullet casings. The person fires the rifle, with visible recoil and dust being kicked up from the ground.
    Audio prompt: The audio features a series of distinct impact sounds, likely from objects being dropped or hit, interspersed with a brief, low-pitched male voice speaking in the background.

    T2AV

    Video prompt: In a barren, desert-like environment, a person is seen handling a large rifle mounted on a wooden platform. The person removes the magazine from the rifle and places it on the platform. The person then lies prone on the ground, aiming the rifle equipped with a scope, surrounded by spent bullet casings. The person fires the rifle, with visible recoil and dust being kicked up from the ground.

    T2V

    In T2V video, the model failed to correctly generate the button and the action of pressing it.
    Besides, in T2AV video, the model correctly generated the language "turn on", indicating it possesses a certain ability to understand the world

    Video prompt: A hand is seen pressing a button located on the car door panel, which is next to a circular speaker and a rectangular handle. The hand continues to press the button repeatedly, applying pressure in a consistent manner. The background shows a blurred view of greenery outside the car window.
    Audio prompt: The audio features a male voice speaking in English with a neutral tone, saying "up for four seconds" at a moderate pitch and rhythm, accompanied by the sound of a car engine running in the background.

    T2AV

    Video prompt: A hand is seen pressing a button located on the car door panel, which is next to a circular speaker and a rectangular handle. The hand continues to press the button repeatedly, applying pressure in a consistent manner. The background shows a blurred view of greenery outside the car window.

    T2V

    In T2V video, the paper generated by the model is unformed with a certain degree of distortion.

    Video prompt: The video shows a printer on a desk with a yellow sticky note attached to it, and a computer monitor in the background displaying various images. The printer is actively printing a document, with the paper gradually moving from the top to the bottom of the printer. The document continues to be fed through the printer, and the paper starts to curl slightly as it progresses. The paper is almost completely through the printer, with only a small portion still visible.
    Audio prompt: The audio features a consistent humming sound throughout, likely from an electrical device, with a brief, sharp click occurring near the end, possibly from a switch or button being pressed.

    T2AV

    Video prompt: The video shows a printer on a desk with a yellow sticky note attached to it, and a computer monitor in the background displaying various images. The printer is actively printing a document, with the paper gradually moving from the top to the bottom of the printer. The document continues to be fed through the printer, and the paper starts to curl slightly as it progresses. The paper is almost completely through the printer, with only a small portion still visible.

    T2V

    In T2V video, the paper generated by the model violates the rule of physics.(soft like silk)

    Video prompt: A pair of hands is seen holding a piece of brown paper on a flat surface. The hands begin to lift the paper from the top edge, causing the paper to start bending and folding. As the paper is lifted further, the fold becomes more pronounced, and the paper starts to curl. The paper continues to be lifted and folded, with the bottom edge now visible and the fold extending further down. The paper is almost completely folded, with the bottom edge now fully visible and the fold extending to the bottom of the frame. The paper is now fully folded, with the bottom edge completely visible and the fold extending to the bottom of the frame.
    Audio prompt: The audio features the distinct sound of paper being crumpled, with a consistent rhythm and a slightly high-pitched tone, set against a relatively quiet background.

    T2AV

    Video prompt: A pair of hands is seen holding a piece of brown paper on a flat surface. The hands begin to lift the paper from the top edge, causing the paper to start bending and folding. As the paper is lifted further, the fold becomes more pronounced, and the paper starts to curl. The paper continues to be lifted and folded, with the bottom edge now visible and the fold extending further down. The paper is almost completely folded, with the bottom edge now fully visible and the fold extending to the bottom of the frame. The paper is now fully folded, with the bottom edge completely visible and the fold extending to the bottom of the frame.

    T2V

    In T2V video, the generated flame only burns on the right side, which does not conform to the laws of physics.

    Video prompt: A bundle of sticks is burning with a bright flame on a white surface, with a washing machine in the background. The flame grows taller and more intense as the sticks continue to burn. The camera moves slightly, capturing different angles of the burning sticks and the flame. The flame remains steady and bright, with the sticks gradually burning down.
    Audio prompt: The audio features a male voice speaking in English with a neutral tone, saying 'pretty much out of fluid', in a clear and straightforward manner without notable variations or background noises.

    T2AV

    Video prompt: A bundle of sticks is burning with a bright flame on a white surface, with a washing machine in the background. The flame grows taller and more intense as the sticks continue to burn. The camera moves slightly, capturing different angles of the burning sticks and the flame. The flame remains steady and bright, with the sticks gradually burning down.

    T2V