🎥 LAVAD: Training-Free Video Anomaly Detection with LLMs

Posted Sep 8, 2025

By DrFirst

16 min read

🎥 LAVAD: Training-free Video Anomaly Detection with LLM!

LA-VAD = LAnguage-based Video Anomaly Detection. In other words, language model-based video anomaly detection!!

Title: Harnessing Large Language Models for Training-free Video Anomaly Detection
Conference: CVPR 2024
Code/Checkpoints: GitHub – LAVAD, Project
Key Keywords: Training-Free, Video Anomaly Detection, LLM, VLM, Zero-Shot
Summary: Combines pre-trained Vision-Language Models (VLM) and Large Language Models (LLM) to perform video anomaly detection without any additional training (training-free) — the first approach of its kind!

🚀 LAVAD Core Summary

One-liner: “Even without a training dataset, LLMs and VLMs can catch anomalies in video!”

1) Training-Free Zero-Shot VAD

Previous VAD required one-class training or unsupervised training.
LAVAD performs anomaly detection without any retraining, purely with pre-trained models.

2) VLM-based Caption Generation

Each video frame is captioned by a VLM → text description.
Example: “A person is walking”, “A person is fighting”.

3) LLM for Anomaly Detection

LLM takes the caption sequence and classifies normal vs anomalous behavior.
Uses temporal reasoning + prompts to calculate anomaly scores.

4) Noise Reduction & Refinement

Cleans noisy captions via cross-modal similarity.
Stabilizes frame-level scores through temporal smoothing.

🔍 Background in Previous Work

Video Anomaly Detection (VAD): usually relies on one-class training with only normal videos, or weak supervision.
- Videos are sequences of images (frames).
- Traditional VAD calculates anomaly scores per frame (I).
- Dataset labels take the form (V, y):
  - Fully-supervised: y is a vector like [0,0,0,1,0…] (per-frame labels).
  - Weakly-supervised: y is a single 0/1 label for the entire video.
  - One-Class: y only includes normal data; unseen behaviors are considered anomalous.
- Limitations: hard to collect datasets, and retraining is needed whenever adapting to new domains/environments.
Rise of LLMs (e.g., ChatGPT) and strong VLMs opened a new path!

So what if we just plug VLM + LLM into VAD??

Instead of binary classification, ask the LLM to assign anomaly scores (11 levels: 0.0–1.0).
Equation (1): ΦLLM(P_C∘P_F∘ΦC(I))
- P_C: context prompt
  - example: “If you were a law enforcement agency, how would you rate the scene described on a scale from 0 to 1, with 0 representing a standard scene and 1 denoting a scene with suspicious activities?”
- P_F: formatting prompt to control output format
  - example: “Please provide the response in the form of a Python list … choose only one number from [0, 0.1, …, 1.0].”
- ◦ : text concatenation
- ΦC: captions from the VLM
Does this Equation (1) method work?
- Experiments with different VLMs (x-axis) and LLMs (bar colors): results are better than random but far from SOTA.
- Why?
  a. Too noisy at frame-level.
  b. Lacks global temporal context.

🧱 LAVAD Architecture

Five key components:
1. ΦC : VLM-generated captions (image → text)
2. ΦLLM : LLM (text reasoning)
3. E_I : image encoder (image → vector)
4. E_T : text encoder (text → vector)
5. E_V : video encoder (video → vector)
  - D_C : set of captions across the whole video
  - C^_i : captions collected around frame i
  - S : temporally enriched caption summary
Three modules address unsupervised VAD’s main issues:
i) Image-Text Caption Cleaning — denoise captions using E_I & E_T similarity
ii) LLM-based Anomaly Scoring — use ΦLLM with temporal windows for anomaly score
iii) Video-Text Score Refinement — refine anomaly scores via E_V–E_T similarity

i) Image-Text Caption Cleaning

Generate captions C from ΦC per frame → noisy.
Use all video captions, embed them with E_T.
Find the caption most similar (via cosine similarity) to the frame’s E_I embedding.
Note: this is offline VAD (full video available), so global captions can be used.
Build the refined caption set (\hat{C}).

ii) LLM-based Anomaly Scoring

Captions (\hat{C}_i) describe the scene but lack temporal context.
Solution: use LLM to summarize captions from a T-second window around frame i (C^_i).
Prompt: “Please summarize what happened in few sentences, based on the following temporal description of a scene. Do not include any unnecessary details or descriptions.”
The resulting summary S_i is fed to the LLM again to estimate anomaly score a_i.

iii) Video-Text Score Refinement

Now we have frame-level anomaly scores a_i, but only from LLM reasoning.
Encode the video snippet V_i (around frame i) with E_V.
Encode summary S_i with E_T.
Use their similarity to refine anomaly score (\tilde{a}_i).

Essentially: find top-K most similar (video, text) pairs and weight their scores.
Weighted averaging of similar neighbors → stable final anomaly scores.

🧪 Experiments & Results

Setup

VLM: BLIP-2
LLM: Llama-2-13b-chat
Multimodal encoders (E_I, E_T, E_V): ImageBind, temporal window T=10s
Datasets:
- UCF-Crime: 1900 surveillance videos, 13 anomaly classes. Evaluated with AUC-ROC & AP.
- XD-Violence: 4754 videos, 6 anomaly classes.

Results

Achieves SOTA among training-free methods.
Still lags behind supervised baselines.

Ablation

A. All three components are crucial:
i) Image-Text Caption Cleaning
ii) LLM-based Anomaly Scoring
iii) Video-Text Score Refinement
B. Prompt priors experiment:
- Base prompt: “How would you rate the scene described on a scale from 0 to 1, with 0 representing a standard scene and 1 denoting a scene with suspicious activities?”
- Add anomaly prior: “or potentially criminal activities” → little effect.
- Add impersonation: “If you were a law enforcement agency…” → best (+0.96% AUC).
- Both priors → no further improvement.
C. Effect of K (neighbors in refinement):
- Larger K improves performance until saturating around K=9.
- No caption cleaning = sharp performance drop.
- LLM reasoning clearly improves anomaly score quality.
- Outperforms keyword-matching baselines significantly.

✅ Conclusion

LAVAD is the first training-free video anomaly detection framework.
Combines VLM + LLM to detect anomalies without any training data.
Promising for surveillance, security, and safety-critical systems.
As multimodal LLMs advance, performance and scalability will improve further.

💡 My Takeaways

Extending LAVAD to real-time video streams would be very valuable.
Inference speed (fps) may be a bottleneck.
A fun and impactful training-free research direction!

🎥 (한국어) LAVAD: 학습 없이 영상 이상 감지까지, LLM이 해낸다!

LA-VAD = LAnguage-based Video Anomaly Detection. 즉, 언어모델 기반의 비디오 이상탐지!!

제목: Harnessing Large Language Models for Training-free Video Anomaly Detection
학회: CVPR 2024
코드/체크포인트: GitHub – LAVAD, Poroject
핵심 키워드: Training-Free, Video Anomaly Detection, LLM, VLM, Zero-Shot
요약: 사전 학습된 Vision-Language Model(VLM)과 Large Language Model(LLM)을 결합해, 추가 학습 없이(training-free) 영상 내 이상 행동을 탐지하는 최초의 접근!

🚀 LAVAD 핵심 요약

한 줄 요약: “훈련된 데이터셋 없어도, LLM과 VLM으로 비디오 속 이상 상황을 잡아낸다!”

1) Training-Free Zero-Shot VAD

기존 VAD는 one-class 학습이나 unsupervised 학습이 필요했음
LAVAD는 추가 훈련 전혀 없이, 사전 학습된 모델만으로 이상 감지 수행

2) VLM 기반 캡션 생성

입력 영상 프레임을 VLM이 설명 → 텍스트 캡션 생성
예: “A person is walking”, “A person is fighting”

3) LLM 활용 이상 감지

LLM이 캡션 시퀀스를 받아, 정상 vs 이상 행동 분류
시간적 맥락 이해 + prompt 기반 reasoning으로 anomaly score 계산

4) 정제 단계 (Noise Reduction)

캡션 노이즈 → cross-modal similarity로 제거
frame-level anomaly score → temporal smoothing으로 안정화

🔍 기존 연구의 흐름

Video Anomaly Detection(VAD): 주로 대규모 정상 영상으로 one-class 훈련, 또는 weakly supervised 방식 활용
- 비디오는 이미지가 계속 이어짐!
- 전통적인 VAD는 각 Frame별 이미지 I에 대해 anomalous Score를 구함
- 이를 통해서 데이터 Label은 (V,y) 형식으로 구성됨 (비디오, label)
- Fully-supervised : y 는 [0,0,0,1,0…] 과 같이 프레임별 결과 제시
- Weakly-supervised : y 는 0 혹은 1로 비디오 전체에 대한 판단만 제시
- One-Class : y는 정상인것만 있음 (정상인것만 다 학습해서, 못보던 상황이면 이상으로!!)
- 위 방식들의 한계: 데이터셋 구하기 힘들 뿐만 아니라 새로운 환경/도메인에 적용하려면 항상 다시 학습 필요
LLM의 발전 : ChatGPT로 시작되어 여러 LLM이 발전함. VLM도 좋은 성능!!

그럼! 그냥 바로 VLM과 LLM을 가져다가 VAD에 써보면 어떨까??

LLM에게 캡션을 주고, 정상/이상을 판단하는 Classification이 아닌, 이상 점수(anomaly score) 측성을 요청 - 점수는 11단계(0.0~1.0)로 나뉨
Equation(1) - ΦLLM(P_C∘P_F∘ΦC(I))
- P_C : 상황 맥락 프롬포트
  - example : ““If you were a law enforcement agency, how would you rate the scene described on a scale from 0 to 1, with 0 representing a standard scene and 1 denoting a scene with suspicious activities?”
- P_F : LLM에게 일정 형식의 output을 요청하는 프롬포트
  - example: “Please provide the response in the form of a Python list … choose only one number from [0, 0.1, …, 1.0].”
- ◦ : 텍스트 연결
- ΦC : VLM에서 나온 캡션
이 Equation (1) 의 방법으로 하면 잘될까!?
- VLM(x축) 과 LLM(Bar 색상)을 바꿔가며 테스트해봄, 결과는 random 보다는 좋지만 SOTA에 한참 못 미친다!
- 왜일까?
  a. Frame-level로 하면 잡음이 많아!
  b. 영상의 전체 맥락을 파악 못해!