Training-free Online Video Step Grounding

Abstract

Given a task and a set of steps composing it, Video Step Grounding (VSG) aims to detect which steps are performed in a video. Standard approaches for this task require a labeled training set (e.g., with step-level annotations or narrations), which may be costly to collect. Moreover, they process the full video offline, limiting their applications for scenarios requiring online decisions. Thus, in this work, we explore how to perform VSG online and without training. We achieve this by exploiting the zero-shot capabilities of recent Large Multimodal Models (LMMs). In particular, we use LMMs to predict the step associated with a restricted set of frames, without access to the whole video. We show that this online strategy without task-specific tuning outperforms offline and training-based models. Motivated by this finding, we develop Bayesian Grounding with Large Multimodal Models (BaGLM), further injecting knowledge of past frames into the LMM-based predictions. BaGLM exploits Bayesian filtering principles, modeling step transitions via (i) a dependency matrix extracted through large language models and (ii) an estimation of step progress. Experiments on three datasets show superior performance of BaGLM over state-of-the-art training-based offline methods.

Large Multimodal Models are strong baselines for VSG

Off-the-shelf Large Multimodal Models (LMMs) achieve competitive or superior performance compared to state-of-the-art training-based methods on Video Step Grounding (VSG) benchmarks. We evaluate them by framing VSG as a multi-choice question answering problem: given a video segment and a list of candidate steps, the model must identify which step (if any) is being performed. For each segment, the LMM is prompted with task and step descriptions, producing probabilities across all step options and a none class.

Method overview

While LMMs are effective at Video Step Grounding, they act without memory of past knowledge, performing step prediction by only looking at the current segment. Therefore, uncertain predictions (e.g., due to the segment acquisition) cannot benefit from past evidence (e.g., step performed in the previous segment), leading to potential mistakes. To address this limitation, we introduce Bayesian Grounding with Large Multimodal Models (BaGLM), a Bayesian filtering framework that integrates step dependencies and progress estimates into LMM-based predictions. Specifically, we use a Large Language Model (LLM) to estimate a dependency matrix among procedural steps, which defines step transition probabilities for the predict step of a Bayesian filter. As the video progresses, these transitions are updated using steps' progress estimates from the LMM. The update step of the filter then merges this with the predictions from the LMM, refining the output.

Training-free Online Video Step Grounding

Video Step Grounding (VSG) aims to identify which procedural steps appear in a video. We tackle this task with BaGLM, a training-free approach that combines Bayesian filtering with Large Multimodal Models to enable online inference over video streams.

Abstract

Large Multimodal Models are strong baselines for VSG

Method overview

Qualitative Results