Harnessing Large Language Models for Training-free Video Anomaly Detection

¹ University of Trento ² Snap Inc. ³ Fondazione Bruno Kessler

Abstract

Video anomaly detection (VAD) aims to temporally locate abnormal events in a video. Existing works mostly rely on training deep models to learn the distribution of normality with either video-level supervision, one-class supervision, or in an unsupervised setting. Training-based methods are prone to be domain-specific, thus being costly for practical deployment as any domain change will involve data collection and model training. In this paper, we radically depart from previous efforts and propose LAnguage-based VAD (LAVAD), a method tackling VAD in a novel, training-free paradigm, exploiting the capabilities of pre-trained Large Language Models (LLMs) and existing Vision-Language Models (VLMs). We leverage VLM-based captioning models to generate textual descriptions for each frame of any test video. With the textual scene description, we then devise a prompting mechanism to unlock the capability of LLMs in terms of temporal aggregation and anomaly score estimation, turning LLMs into an effective video anomaly detector. We further leverage modality-aligned VLMs and propose effective techniques based on cross-modal similarity for cleaning noisy captions and refining the LLM-based anomaly scores. We evaluate LAVAD on two large datasets featuring real-world surveillance scenarios (UCF-Crime and XD-Violence), showing that it outperforms both unsupervised and one-class methods without requiring any training or data collection.

Qualitative Results

Each of the following illustrations presents a video from the testing dataset of either UCF-Crime or XD-Violence. For UCF-Crime, the threshold used to classify a frame as normal or abnormal is derived from the area under the receiver operating characteristic curve (AUC). This threshold corresponds to the point that maximizes the difference between the true positive rate and false positive rate. For XD-Violence, the threshold for anomaly detection is based on the area under the precision-recall curve (AP), and it is the point that maximizes the product of precision and recall. When a frame's anomaly score exceeds this threshold, it is classified as anomalous, and this is visually indicated by a red bounding box around the video. In the lower figure, red shaded areas depict temporal ground-truth anomalies, complemented by a red slider indicating the progression of time within the video.

Harnessing Large Language Models for Training-free Video Anomaly Detection

Abstract

Task Definition

Method Overview

Qualitative Results

The video shows a man assaulting a woman in an attempt to steal her handbag, leading to the woman fighting back. LAVAD accurately identifies the anomalous segments in the video, and the captions precisely convey the content of the scene.

The video shows an office environment where individuals are working at their desks without any unusual incidents. LAVAD consistently assigns a low anomaly score throughout the entire video duration.

The video shows a hockey game with players engaged in a fight. LAVAD accurately assigns a high anomaly score to the segments of the video that capture the fighting scenes.

The video shows a fight between martial artists. LAVAD accurately assigns a high anomaly score to the segments capturing the fighting scenes.

The video shows a small airplane flying over a snowy mountain, with a building exploding. LAVAD assigns a higher anomaly score to the segments corresponding to the summary caption indicating an explosion.

The video shows a driving simulation. LAVAD consistently assigns low anomaly scores for more than 17,500 frames.