Harnessing Large Language Models for Training-free Video Anomaly Detection

1 University of Trento 2 Snap Inc. 3 Fondazione Bruno Kessler

Abstract

Video anomaly detection (VAD) aims to temporally locate abnormal events in a video. Existing works mostly rely on training deep models to learn the distribution of normality with either video-level supervision, one-class supervision, or in an unsupervised setting. Training-based methods are prone to be domain-specific, thus being costly for practical deployment as any domain change will involve data collection and model training. In this paper, we radically depart from previous efforts and propose LAnguage-based VAD (LAVAD), a method tackling VAD in a novel, training-free paradigm, exploiting the capabilities of pre-trained Large Language Models (LLMs) and existing Vision-Language Models (VLMs). We leverage VLM-based captioning models to generate textual descriptions for each frame of any test video. With the textual scene description, we then devise a prompting mechanism to unlock the capability of LLMs in terms of temporal aggregation and anomaly score estimation, turning LLMs into an effective video anomaly detector. We further leverage modality-aligned VLMs and propose effective techniques based on cross-modal similarity for cleaning noisy captions and refining the LLM-based anomaly scores. We evaluate LAVAD on two large datasets featuring real-world surveillance scenarios (UCF-Crime and XD-Violence), showing that it outperforms both unsupervised and one-class methods without requiring any training or data collection.

Task Definition

Banner Image

We introduce the first training-free method for Video Anomaly Detection (VAD), diverging from state-of-the-art methods that are ALL training-based with different degrees of supervision. Our proposal, LAVAD, leverages modality-aligned Vision and Language Models (VLMs) to query and enhance the anomaly scores generated by Large Language Models (LLMs).

Method Overview

Banner Image

The architecture of our proposed LAVAD for addressing training-free VAD. For each test video \(\mathbf{V}\), we first employ a captioning model to generate a caption \(C_i\) for each frame \(\mathbf{I}_i \in \mathbf{V}\), forming a caption sequence \(\mathbf{C}\). Our Image-Text Caption Cleaning component addresses noisy and incorrect raw captions based on cross-modal similarity. We replace the raw caption with a caption \(\hat{C}_i \in \mathbf{C} \) whose textual embedding \( \mathcal{E}_T(\hat{C}_i) \) is most aligned to the image embedding \( \mathcal{E}_I(\mathbf{I}_i) \), resulting in a cleaned caption sequence \( \hat{\mathbf{C}} \). To account for scene context and dynamics, our LLM-based Anomaly Scoring component further aggregates the cleaned captions within a temporal window centered around each \(\mathbf{I}_i\) by prompting the LLM to produce a temporal summary \(S_i\), forming a summary sequence \( \mathbf{S} \). The LLM is then queried to provide an anomaly score for each frame based on its \(S_i\), obtaining the initial anomaly scores \( \mathbf{a} \) for all frames. Finally, our Video-Text Score Refinement component refines each \(a_i\) by aggregating the initial anomaly scores of frames whose textual embeddings of the summaries are mostly aligned to the representation \( \mathcal{E}_V(\mathbf{V}_i) \) of the video snippet \(\mathbf{V}_i \) centered around \(\mathbf{I}_i \), leading to the final anomaly scores \( \mathbf{\tilde{a}} \) for detecting the anomalies (anomalous frames are highlighted) within the video.

Qualitative Results

Each of the following illustrations presents a video from the testing dataset of either UCF-Crime or XD-Violence. For UCF-Crime, the threshold used to classify a frame as normal or abnormal is derived from the area under the receiver operating characteristic curve (AUC). This threshold corresponds to the point that maximizes the difference between the true positive rate and false positive rate. For XD-Violence, the threshold for anomaly detection is based on the area under the precision-recall curve (AP), and it is the point that maximizes the product of precision and recall. When a frame's anomaly score exceeds this threshold, it is classified as anomalous, and this is visually indicated by a red bounding box around the video. In the lower figure, red shaded areas depict temporal ground-truth anomalies, complemented by a red slider indicating the progression of time within the video.

The video shows a man assaulting a woman in an attempt to steal her handbag, leading to the woman fighting back. LAVAD accurately identifies the anomalous segments in the video, and the captions precisely convey the content of the scene.

The video shows an office environment where individuals are working at their desks without any unusual incidents. LAVAD consistently assigns a low anomaly score throughout the entire video duration.

The video portrays a police-suspect shootout, and LAVAD accurately assigns a high anomaly score when the video is labeled abnormal. However, in its initial and final segments, labeled as normal, LAVAD also assigns high anomaly scores. This occurs because introductory text at the beginning leads the LLM to attribute elevated anomaly scores. Conversely, in the final part, our method correctly identifies abnormality, as there is a person on the ground who has been shot.

The video shows a group of men attempting burglary. In the portion labeled abnormal, LAVAD assigns a high anomaly score. However, false abnormal instances occur because the summary caption suggests the presence of a man stealing a car. Although the video does depict a man acting suspiciously near a car, this leads to an inaccurate anomaly assignment.

The video shows a hockey game with players engaged in a fight. LAVAD accurately assigns a high anomaly score to the segments of the video that capture the fighting scenes.

The video shows a fight between martial artists. LAVAD accurately assigns a high anomaly score to the segments capturing the fighting scenes.

The video shows a small airplane flying over a snowy mountain, with a building exploding. LAVAD assigns a higher anomaly score to the segments corresponding to the summary caption indicating an explosion.

The video shows a driving simulation. LAVAD consistently assigns low anomaly scores for more than 17,500 frames.