Can Text-to-Video Generation help Video-Language Alignment?

Abstract

Recent video-language alignment models are trained on sets of videos, each with an associated positive caption and a negative caption generated by large language models. A problem with this procedure is that negative captions may introduce linguistic biases, i.e., concepts are seen only as negatives and never associated with a video. While a solution would be to collect videos for the negative captions, existing databases lack the fine-grained variations needed to cover all possible negatives. In this work, we study whether synthetic videos can help to overcome this issue. Our preliminary analysis with multiple generators shows that, while promising on some tasks, synthetic videos harm the performance of the model on others. We hypothesize this issue is linked to noise (semantic and visual) in the generated videos and develop a method, SynViTA, that accounts for those. SynViTA dynamically weights the contribution of each synthetic video based on how similar its target caption is w.r.t the real counterpart. Moreover, a semantic consistency loss makes the model focus on fine-grained differences across captions, rather than differences in video appearance. Experiments show that, on average, SynViTA improves over existing methods on VideoCon test sets and SSv2-Temporal, SSv2-Events, and ATP-Hard benchmarks, being a first promising step for using synthetic videos when learning video-language models.

Can synthetic videos help VLA?

We first conduct a preliminary study to evaluate whether these generated videos can augment the training set of real videos and enhance performance on various video-related tasks. Our analysis shows that, while adding synthetic videos shows some promise, it does not consistently improve performance on temporally challenging downstream tasks, regardless of the generator.

We also analyze the effects of different misalignment types (i.e., semantically plausible changes in the video captions) on the generated videos.

We notice that videos generated by, e.g., introducing hallucination into the captions or reversing event order, align more with positive captions than with their target captions. Such noisy supervision signals may lead to ineffective learning, limiting improvements on downstream tasks.

Qualitative Results

Image showing video-language alignment scores for the video-language entailment task on VideoCon Human and VideoCon Human Hard using SynViTA.

Examples of video-language alignment scores assigned by SynViTA (mPLUG-Owl 7B) and SynViTA (Video-LLaVA), compared to baselines trained without synthetic videos, on the video-language entailment task for VideoCon Human and VideoCon Human Hard.

Image showing video-language alignment scores for the video question answering task on ATP-Hard using SynViTA.

Examples of video-language alignment scores assigned by SynViTA (mPLUG-Owl 7B) and SynViTA (Video-LLaVA), compared to baselines trained without synthetic videos, on the video question answering task on ATP-Hard.

Image showing rankings based on video-language alignment scores for the text-to-video retrieval task on SSv2-Events using SynViTA (mPLUG-Owl 7B).

Rankings based on video-language alignment scores for the text-to-video retrieval task on SSv2-Events, using SynViTA (mPLUG-Owl 7B) against baselines trained without synthetic videos.

Image showing rankings based on video-language alignment scores for the text-to-video retrieval task on SSv2-Temporal using SynViTA (Video-LLaVA).

Rankings based on video-language alignment scores for the text-to-video retrieval task on SSv2-Temporal, using SynViTA (Video-LLaVA) against baselines trained without synthetic videos.