Post

๐ŸŽฅ LAVAD: Training-Free Video Anomaly Detection with LLMs

๐ŸŽฅ LAVAD: Training-Free Video Anomaly Detection with LLMs

๐ŸŽฅ LAVAD: Training-free Video Anomaly Detection with LLM!

LA-VAD = LAnguage-based Video Anomaly Detection. In other words, language model-based video anomaly detection!!

Image

  • Title: Harnessing Large Language Models for Training-free Video Anomaly Detection
  • Conference: CVPR 2024
  • Code/Checkpoints: GitHub โ€“ LAVAD, Project
  • Key Keywords: Training-Free, Video Anomaly Detection, LLM, VLM, Zero-Shot
  • Summary: Combines pre-trained Vision-Language Models (VLM) and Large Language Models (LLM) to perform video anomaly detection without any additional training (training-free) โ€” the first approach of its kind!

๐Ÿš€ LAVAD Core Summary

One-liner: โ€œEven without a training dataset, LLMs and VLMs can catch anomalies in video!โ€

1) Training-Free Zero-Shot VAD

  • Previous VAD required one-class training or unsupervised training.
  • LAVAD performs anomaly detection without any retraining, purely with pre-trained models.

2) VLM-based Caption Generation

  • Each video frame is captioned by a VLM โ†’ text description.
  • Example: โ€œA person is walkingโ€, โ€œA person is fightingโ€.

3) LLM for Anomaly Detection

  • LLM takes the caption sequence and classifies normal vs anomalous behavior.
  • Uses temporal reasoning + prompts to calculate anomaly scores.

4) Noise Reduction & Refinement

  • Cleans noisy captions via cross-modal similarity.
  • Stabilizes frame-level scores through temporal smoothing.

๐Ÿ” Background in Previous Work

  • Video Anomaly Detection (VAD): usually relies on one-class training with only normal videos, or weak supervision.
    • Videos are sequences of images (frames).
    • Traditional VAD calculates anomaly scores per frame (I).
    • Dataset labels take the form (V, y):
      • Fully-supervised: y is a vector like [0,0,0,1,0โ€ฆ] (per-frame labels).
      • Weakly-supervised: y is a single 0/1 label for the entire video.
      • One-Class: y only includes normal data; unseen behaviors are considered anomalous.
    • Limitations: hard to collect datasets, and retraining is needed whenever adapting to new domains/environments.
  • Rise of LLMs (e.g., ChatGPT) and strong VLMs opened a new path!

So what if we just plug VLM + LLM into VAD??

  • Instead of binary classification, ask the LLM to assign anomaly scores (11 levels: 0.0โ€“1.0).
  • Equation (1): ฮฆLLMโ€‹(P_Cโ€‹โˆ˜P_Fโ€‹โˆ˜ฮฆCโ€‹(I))

    • P_C: context prompt
      • example: โ€œIf you were a law enforcement agency, how would you rate the scene described on a scale from 0 to 1, with 0 representing a standard scene and 1 denoting a scene with suspicious activities?โ€
    • P_F: formatting prompt to control output format
      • example: โ€œPlease provide the response in the form of a Python list โ€ฆ choose only one number from [0, 0.1, โ€ฆ, 1.0].โ€
    • โ—ฆ : text concatenation
    • ฮฆC: captions from the VLM
  • Does this Equation (1) method work?
    • Experiments with different VLMs (x-axis) and LLMs (bar colors): results are better than random but far from SOTA.
      Image

    • Why?
      a. Too noisy at frame-level.
      b. Lacks global temporal context.


๐Ÿงฑ LAVAD Architecture

Image

  • Five key components:
    1. ฮฆC : VLM-generated captions (image โ†’ text)
    2. ฮฆLLM : LLM (text reasoning)
    3. E_I : image encoder (image โ†’ vector)
    4. E_T : text encoder (text โ†’ vector)
    5. E_V : video encoder (video โ†’ vector)
      • D_C : set of captions across the whole video
      • C^_i : captions collected around frame i
      • S : temporally enriched caption summary
  • Three modules address unsupervised VADโ€™s main issues:
    i) Image-Text Caption Cleaning โ€” denoise captions using E_I & E_T similarity
    ii) LLM-based Anomaly Scoring โ€” use ฮฆLLM with temporal windows for anomaly score
    iii) Video-Text Score Refinement โ€” refine anomaly scores via E_Vโ€“E_T similarity

i) Image-Text Caption Cleaning

  • Generate captions C from ฮฆC per frame โ†’ noisy.
  • Use all video captions, embed them with E_T.
  • Find the caption most similar (via cosine similarity) to the frameโ€™s E_I embedding.
  • Note: this is offline VAD (full video available), so global captions can be used.
  • Build the refined caption set (\hat{C}).

ii) LLM-based Anomaly Scoring

  • Captions (\hat{C}_i) describe the scene but lack temporal context.
  • Solution: use LLM to summarize captions from a T-second window around frame i (C^_i).
  • Prompt: โ€œPlease summarize what happened in few sentences, based on the following temporal description of a scene. Do not include any unnecessary details or descriptions.โ€
  • The resulting summary S_i is fed to the LLM again to estimate anomaly score a_i.

iii) Video-Text Score Refinement

  • Now we have frame-level anomaly scores a_i, but only from LLM reasoning.
  • Encode the video snippet V_i (around frame i) with E_V.
  • Encode summary S_i with E_T.
  • Use their similarity to refine anomaly score (\tilde{a}_i).

Image

  • Essentially: find top-K most similar (video, text) pairs and weight their scores.
  • Weighted averaging of similar neighbors โ†’ stable final anomaly scores.

๐Ÿงช Experiments & Results

Setup

  • VLM: BLIP-2
  • LLM: Llama-2-13b-chat
  • Multimodal encoders (E_I, E_T, E_V): ImageBind, temporal window T=10s
  • Datasets:
    • UCF-Crime: 1900 surveillance videos, 13 anomaly classes. Evaluated with AUC-ROC & AP.
    • XD-Violence: 4754 videos, 6 anomaly classes.

Results

Image

  • Achieves SOTA among training-free methods.
  • Still lags behind supervised baselines.

Ablation

Image

  • A. All three components are crucial:
    i) Image-Text Caption Cleaning
    ii) LLM-based Anomaly Scoring
    iii) Video-Text Score Refinement

  • B. Prompt priors experiment:
    • Base prompt: โ€œHow would you rate the scene described on a scale from 0 to 1, with 0 representing a standard scene and 1 denoting a scene with suspicious activities?โ€
    • Add anomaly prior: โ€œor potentially criminal activitiesโ€ โ†’ little effect.
    • Add impersonation: โ€œIf you were a law enforcement agencyโ€ฆโ€ โ†’ best (+0.96% AUC).
    • Both priors โ†’ no further improvement.
  • C. Effect of K (neighbors in refinement):
    • Larger K improves performance until saturating around K=9.
    • No caption cleaning = sharp performance drop.
    • LLM reasoning clearly improves anomaly score quality.
    • Outperforms keyword-matching baselines significantly.

โœ… Conclusion

  • LAVAD is the first training-free video anomaly detection framework.
  • Combines VLM + LLM to detect anomalies without any training data.
  • Promising for surveillance, security, and safety-critical systems.
  • As multimodal LLMs advance, performance and scalability will improve further.

๐Ÿ’ก My Takeaways

  • Extending LAVAD to real-time video streams would be very valuable.
  • Inference speed (fps) may be a bottleneck.
  • A fun and impactful training-free research direction!

๐ŸŽฅ (ํ•œ๊ตญ์–ด) LAVAD: ํ•™์Šต ์—†์ด ์˜์ƒ ์ด์ƒ ๊ฐ์ง€๊นŒ์ง€, LLM์ด ํ•ด๋‚ธ๋‹ค!

LA-VAD = LAnguage-based Video Anomaly Detection. ์ฆ‰, ์–ธ์–ด๋ชจ๋ธ ๊ธฐ๋ฐ˜์˜ ๋น„๋””์˜ค ์ด์ƒํƒ์ง€!!

Image

  • ์ œ๋ชฉ: Harnessing Large Language Models for Training-free Video Anomaly Detection
  • ํ•™ํšŒ: CVPR 2024
  • ์ฝ”๋“œ/์ฒดํฌํฌ์ธํŠธ: GitHub โ€“ LAVAD, Poroject
  • ํ•ต์‹ฌ ํ‚ค์›Œ๋“œ: Training-Free, Video Anomaly Detection, LLM, VLM, Zero-Shot
  • ์š”์•ฝ: ์‚ฌ์ „ ํ•™์Šต๋œ Vision-Language Model(VLM)๊ณผ Large Language Model(LLM)์„ ๊ฒฐํ•ฉํ•ด, ์ถ”๊ฐ€ ํ•™์Šต ์—†์ด(training-free) ์˜์ƒ ๋‚ด ์ด์ƒ ํ–‰๋™์„ ํƒ์ง€ํ•˜๋Š” ์ตœ์ดˆ์˜ ์ ‘๊ทผ!

๐Ÿš€ LAVAD ํ•ต์‹ฌ ์š”์•ฝ

ํ•œ ์ค„ ์š”์•ฝ: โ€œํ›ˆ๋ จ๋œ ๋ฐ์ดํ„ฐ์…‹ ์—†์–ด๋„, LLM๊ณผ VLM์œผ๋กœ ๋น„๋””์˜ค ์† ์ด์ƒ ์ƒํ™ฉ์„ ์žก์•„๋‚ธ๋‹ค!โ€

1) Training-Free Zero-Shot VAD

  • ๊ธฐ์กด VAD๋Š” one-class ํ•™์Šต์ด๋‚˜ unsupervised ํ•™์Šต์ด ํ•„์š”ํ–ˆ์Œ
  • LAVAD๋Š” ์ถ”๊ฐ€ ํ›ˆ๋ จ ์ „ํ˜€ ์—†์ด, ์‚ฌ์ „ ํ•™์Šต๋œ ๋ชจ๋ธ๋งŒ์œผ๋กœ ์ด์ƒ ๊ฐ์ง€ ์ˆ˜ํ–‰

2) VLM ๊ธฐ๋ฐ˜ ์บก์…˜ ์ƒ์„ฑ

  • ์ž…๋ ฅ ์˜์ƒ ํ”„๋ ˆ์ž„์„ VLM์ด ์„ค๋ช… โ†’ ํ…์ŠคํŠธ ์บก์…˜ ์ƒ์„ฑ
  • ์˜ˆ: โ€œA person is walkingโ€, โ€œA person is fightingโ€

3) LLM ํ™œ์šฉ ์ด์ƒ ๊ฐ์ง€

  • LLM์ด ์บก์…˜ ์‹œํ€€์Šค๋ฅผ ๋ฐ›์•„, ์ •์ƒ vs ์ด์ƒ ํ–‰๋™ ๋ถ„๋ฅ˜
  • ์‹œ๊ฐ„์  ๋งฅ๋ฝ ์ดํ•ด + prompt ๊ธฐ๋ฐ˜ reasoning์œผ๋กœ anomaly score ๊ณ„์‚ฐ

4) ์ •์ œ ๋‹จ๊ณ„ (Noise Reduction)

  • ์บก์…˜ ๋…ธ์ด์ฆˆ โ†’ cross-modal similarity๋กœ ์ œ๊ฑฐ
  • frame-level anomaly score โ†’ temporal smoothing์œผ๋กœ ์•ˆ์ •ํ™”

๐Ÿ” ๊ธฐ์กด ์—ฐ๊ตฌ์˜ ํ๋ฆ„

  • Video Anomaly Detection(VAD): ์ฃผ๋กœ ๋Œ€๊ทœ๋ชจ ์ •์ƒ ์˜์ƒ์œผ๋กœ one-class ํ›ˆ๋ จ, ๋˜๋Š” weakly supervised ๋ฐฉ์‹ ํ™œ์šฉ
    • ๋น„๋””์˜ค๋Š” ์ด๋ฏธ์ง€๊ฐ€ ๊ณ„์† ์ด์–ด์ง!
    • ์ „ํ†ต์ ์ธ VAD๋Š” ๊ฐ Frame๋ณ„ ์ด๋ฏธ์ง€ I์— ๋Œ€ํ•ด anomalous Score๋ฅผ ๊ตฌํ•จ
    • ์ด๋ฅผ ํ†ตํ•ด์„œ ๋ฐ์ดํ„ฐ Label์€ (V,y) ํ˜•์‹์œผ๋กœ ๊ตฌ์„ฑ๋จ (๋น„๋””์˜ค, label)
    • Fully-supervised : y ๋Š” [0,0,0,1,0โ€ฆ] ๊ณผ ๊ฐ™์ด ํ”„๋ ˆ์ž„๋ณ„ ๊ฒฐ๊ณผ ์ œ์‹œ
    • Weakly-supervised : y ๋Š” 0 ํ˜น์€ 1๋กœ ๋น„๋””์˜ค ์ „์ฒด์— ๋Œ€ํ•œ ํŒ๋‹จ๋งŒ ์ œ์‹œ
    • One-Class : y๋Š” ์ •์ƒ์ธ๊ฒƒ๋งŒ ์žˆ์Œ (์ •์ƒ์ธ๊ฒƒ๋งŒ ๋‹ค ํ•™์Šตํ•ด์„œ, ๋ชป๋ณด๋˜ ์ƒํ™ฉ์ด๋ฉด ์ด์ƒ์œผ๋กœ!!)
    • ์œ„ ๋ฐฉ์‹๋“ค์˜ ํ•œ๊ณ„: ๋ฐ์ดํ„ฐ์…‹ ๊ตฌํ•˜๊ธฐ ํž˜๋“ค ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ์ƒˆ๋กœ์šด ํ™˜๊ฒฝ/๋„๋ฉ”์ธ์— ์ ์šฉํ•˜๋ ค๋ฉด ํ•ญ์ƒ ๋‹ค์‹œ ํ•™์Šต ํ•„์š”
  • LLM์˜ ๋ฐœ์ „ : ChatGPT๋กœ ์‹œ์ž‘๋˜์–ด ์—ฌ๋Ÿฌ LLM์ด ๋ฐœ์ „ํ•จ. VLM๋„ ์ข‹์€ ์„ฑ๋Šฅ!!

๊ทธ๋Ÿผ! ๊ทธ๋ƒฅ ๋ฐ”๋กœ VLM๊ณผ LLM์„ ๊ฐ€์ ธ๋‹ค๊ฐ€ VAD์— ์จ๋ณด๋ฉด ์–ด๋–จ๊นŒ??

  • LLM์—๊ฒŒ ์บก์…˜์„ ์ฃผ๊ณ , ์ •์ƒ/์ด์ƒ์„ ํŒ๋‹จํ•˜๋Š” Classification์ด ์•„๋‹Œ, ์ด์ƒ ์ ์ˆ˜(anomaly score) ์ธก์„ฑ์„ ์š”์ฒญ - ์ ์ˆ˜๋Š” 11๋‹จ๊ณ„(0.0~1.0)๋กœ ๋‚˜๋‰จ
  • Equation(1) - ฮฆLLMโ€‹(P_Cโ€‹โˆ˜P_Fโ€‹โˆ˜ฮฆCโ€‹(I))

    • P_C : ์ƒํ™ฉ ๋งฅ๋ฝ ํ”„๋กฌํฌํŠธ
      • example : โ€œโ€œIf you were a law enforcement agency, how would you rate the scene described on a scale from 0 to 1, with 0 representing a standard scene and 1 denoting a scene with suspicious activities?โ€
    • P_F : LLM์—๊ฒŒ ์ผ์ • ํ˜•์‹์˜ output์„ ์š”์ฒญํ•˜๋Š” ํ”„๋กฌํฌํŠธ
      • example: โ€œPlease provide the response in the form of a Python list โ€ฆ choose only one number from [0, 0.1, โ€ฆ, 1.0].โ€
    • โ—ฆ : ํ…์ŠคํŠธ ์—ฐ๊ฒฐ
    • ฮฆC : VLM์—์„œ ๋‚˜์˜จ ์บก์…˜
  • ์ด Equation (1) ์˜ ๋ฐฉ๋ฒ•์œผ๋กœ ํ•˜๋ฉด ์ž˜๋ ๊นŒ!?
    • VLM(x์ถ•) ๊ณผ LLM(Bar ์ƒ‰์ƒ)์„ ๋ฐ”๊ฟ”๊ฐ€๋ฉฐ ํ…Œ์ŠคํŠธํ•ด๋ด„, ๊ฒฐ๊ณผ๋Š” random ๋ณด๋‹ค๋Š” ์ข‹์ง€๋งŒ SOTA์— ํ•œ์ฐธ ๋ชป ๋ฏธ์นœ๋‹ค!
      Image

    • ์™œ์ผ๊นŒ?
      a. Frame-level๋กœ ํ•˜๋ฉด ์žก์Œ์ด ๋งŽ์•„!
      b. ์˜์ƒ์˜ ์ „์ฒด ๋งฅ๋ฝ์„ ํŒŒ์•… ๋ชปํ•ด!


๐Ÿงฑ LAVAD ๊ตฌ์กฐ (Architecture)

Image

  • ๋‹ค์„ฏ๊ฐ€์ง€ ์ฃผ์š” ์š”์†Œ๋กœ ๊ตฌ์„ฑ๋œ๋‹ค!!
    1. ฮฆC : VLM์—์„œ ๋‚˜์˜จ ์บก์…˜. ์ด๋ฏธ์ง€๋ฅผ ํ…์ŠคํŠธ๋กœ.
    2. ฮฆLLM : LLM. ํ…์ŠคํŠธ๋ฅผ ํ…์Šคํ‹€.
    3. E_I : ์ด๋ฏธ์ง€ ์ธ์ฝ”๋”. ์ด๋ฏธ์ง€๋ฅผ ๋ฒกํ„ฐ๋กœ.
    4. E_T : ํ…์ŠคํŠธ ์ธ์ฝ”๋”. ํ…์ŠคํŠธ๋ฅผ ๋ฒกํ„ฐ๋กœ.
    5. E_V : ๋น„๋””์˜ค ์ธ์ฝ”๋”. ๋น„๋””์˜ค๋ฅผ ๋ฒกํ„ฐ๋กœ.
      • D_C : ์˜์ƒ ์ „์ฒด ํ”„๋ž˜์ž„์˜ ์บก์…˜์˜ ์ง‘ํ•ฉ
      • C^_i : i ํ”„๋ ˆ์ž„์—์„œ ์ „ํ›„๋กœ ์บก์…˜ ๋ฐ›์•„์„œ ๋ชจ์€๊ฒƒ
      • S : ๋งฅ๋ฝ์ •๋ณด๊ฐ€ ๋ฐ˜์˜๋œ ์บก์…˜
  • ์ด์ œ ์•„๋ž˜์˜ 3๊ฐœ ์š”์†Œ๊ฐ€ ์•ž์„œ ๋ณด์•˜๋˜ Unsupervised์˜ ๋ฌธ์ œ์ ์„ ํ•ด๊ฒฐํ•ด์ค€๋‹ค!
    i) Image-Text Caption Cleaning: E_I์™€ E_T๋ฅผ ํ†ตํ•ธ ์ด๋ฏธ์ง€ ํ…์ŠคํŠธ ์บก์…˜ ์ •์ œ
    ii) LLM-based Anomaly Scoring: ฮฆLLM์„ ํ†ตํ•ด ์‹œ๊ฐ„์  ์ •๋ณด๋ฅผ ์ธ์ฝ”๋”ฉํ•˜์—ฌ ์ด์ƒ ์ ์ˆ˜ ์‚ฐ์ถœ
    iii) Video-Text Score Refinement: E_V์™€ E_T๋ฅผ ์‚ฌ์šฉํ•œ ๋น„๋””์˜ค-ํ…์ŠคํŠธ ์œ ์‚ฌ๋„๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ ์ด์ƒ ์ ์ˆ˜ ๋ณด์ •

i) Image-Text Caption Cleaning

  • ๋น„๋””์˜ค์—์„œ ํ”„๋ ˆ์ž„๋ณ„๋กœ ฮฆC๋ฅผ ํ™œ์šฉ, ์บก์„ (C)์„ ๋งŒ๋“ฌ -> ์•ž์„œ ๋ณธ๊ฒƒ์ฒ˜๋Ÿผ noisy ํ•˜๊ฒ ์ง€?
  • ์ด ์žก์Œ ์ œ๊ฑฐ๋ฅผ ์œ„ํ•ด ์ „์ฒด ํ”„๋ ˆ์ž„์˜ ์บก์…˜์„ ๊ฐ€์ ธ์™€์„œ, E_T ๋กœ ๋ฒกํ„ฐ๋ฅผ ๋ฝ‘๊ณ 
  • E_I๋กœ ์‚ฐ์ถœ๋œ ํ•ด๋‹น ํ”„๋ ˆ์ž„ ์ด๋ฏธ์ง€์™€ ๊ฐ€์žฅ cosine similarity๊ฐ€ ๋†’์€ ํ•˜๋‚˜์˜ ์บก์…˜์„ ๋ฝ‘๋Š”๋‹ค.
  • ์ฐธ๊ณ : ์‹ค์‹œ๊ฐ„ ์˜์ƒ๋ถ„์„์ด ์•„๋‹ˆ๋ผ ํ…Œ์ŠคํŠธ ๋น„๋””์˜ค ์ „์ฒด๋ฅผ ์ด๋ฏธ ๊ฐ€์ง€๊ณ  ์žˆ๋‹คโ€๋Š” offline VAD ์ƒํ™ฉ์ž„. ๊ทธ๋ž˜์„œ ์ „์ฒด ์บก์…˜ ์ง‘ํ•ฉ ์‚ฌ์šฉ์ด ๊ฐ€๋Šฅํ•จ!
  • ๊ฒฐ๊ตญ ์ „์ฒด ํ”„๋ ˆ์ž„์— ๋Œ€ํ•ด์„œ ์ •์ œ๋œ ์บก์…˜์ง‘ํ•ฉ C_i ๋ฅผ ํ™•๋ณดํ•จ

ii) LLM-based Anomaly Scoring

  • C_i ๊ฐ€ ์žฅ๋ฉด์ •๋ณด๊ฐ€ ์žˆ์ง€๋งŒ, ์ƒํ™ฉ์— ๋Œ€ํ•œ ์ •๋ณด๋Š” ์—†๋‹ค!
  • ๊ทธ๋ž˜์„œ LLM์œผ๋กœ ์š”์•ฝํ•œ๋‹ค. ํ˜„ ํ”„๋ž˜์ž„ ์ „ํ›„์˜ T์ดˆ๋ฅผ window๋กœํ•ด์„œ, ํ˜„ ์ƒํ™ฉ์—์„œ์˜ ์ƒํ™ฉ์ •๋ณด๋ฅผ ๋ฐ›๋Š”๋‹ค.(C^_i)
  • ํ”„๋กฌํฌํŠธ๋Š”!! โ€œPlease summarize what happened in few sentences, based on the following temporal description of a scene. Do not include any unnecessary details or descriptions.โ€
  • ์ด๋ฅผ ํ†ตํ•ด์„œ ํ˜„ํ”„๋ ˆ์ž„ i ์—์„œ์˜ ๋ฌ˜์‚ฌ๋ฅผํ•œ S_i ๋ฅผ ๋‹ค์‹œ LLM์— ๋„ฃ์–ด์„œ anomaly score(a)๋ฅผ ๊ตฌํ•œ๋‹ค.

iii) Video-Text Score Refinement

  • ์ด์ œ, ํ”„๋ ˆ์ž„๋ณ„ ์ ์ˆ˜ a๊ฐ€ ๋‚˜์™”์ง€๋งŒ, ์ด๋Š” LLM์„ ํ†ตํ•œ๊ฑฐ์•ผ.
  • ๋น„๋””์˜ค ์ธ์ฝ”๋” E_V๋ฅผ ํ†ตํ•ด์„œ i ํ”„๋ ˆ์ž„ ์ค‘์‹ฌ์˜ ์˜์ƒ์„ ์ธ์ฝ”๋”ฉํ•˜๊ณ ,
  • ํ…์ŠคํŠธ ์ธ์ฝ”๋” E_T๋ฅผ ํ†ตํ•ด์„œ S_i๋ฅผ ์ธ์ฝ”๋”ฉํ•œ ๊ฐ’์œผ๋กœ ๋‹ค์‹œ ํ•œ๋ฒˆ ์ ์ˆ˜ a~_i๋ฅผ ๊ตฌํ•œ๋‹ค.

Image

  • ๊ทธ๋ž˜์„œ! i ํ”„๋ ˆ์ž„์˜ ์ „ํ›„ ๋น„๋””์˜ค์ธ์ฝ”๋”ฉ๊ฐ’์ด๋ž‘ ์ „ํ›„์˜ ํ…์ŠคํŠธ ์ธ์ฝ”๋”ฉ ์œ ์‚ฌ๋„๋ฅผ ๊ตฌํ•ด์„œ,
  • ํ•ด๋‹น ์œ ์‚ฌ๋„๋ฅผ ๊ฐ€์ค‘ํ‰๊ท ํ•ด์„œ a~_i ๋“ค์„ ๋”ํ•œ๋‹ค!! ์ด๋ฅผ ํ†ตํ•ด ์ตœ์ข… ์ ์ˆ˜๋ฅผ ์‚ฐ์ถœ!!

๐Ÿงช ์‹คํ—˜ ๋ฐ ๊ฒฐ๊ณผ ๋ถ„์„

๐Ÿงช ์‹คํ—˜ ์š”๊ฑด

  • ์‹คํ—˜ ์„ธ๋ถ€์‚ฌํ•ญ!!
    • VLM์€ BLIP-2 ์‚ฌ์šฉ
    • LLM์€ Llama-2-13b-chat ์‚ฌ์šฉ
    • ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์ธ์ฝ”๋”(์ด๋ฏธ์ง€,ํ…์ŠคํŠธ,๋น„๋””์˜ค)๋Š” ImageBind ์‚ฌ์šฉ, ์‹œ๊ฐ„ window๋Š” 10์ดˆ!
  • ๊ธฐ์กด์˜ VAD์šฉ ๋ฐ์ดํ„ฐ์…‹์ธ UCF-Crime and XD-Violence ํ™œ์šฉ, AUC-ROC ์ง€ํ‘œ๋กœ ํ‰๊ฐ€
    • UCF-Crime : 1900๊ฐœ์˜ ์‹ค์ œ์ƒํ™ฉ ๊ฐ์‹œ์นด๋ฉ”๋ผ ๋น„๋””์˜ค, 13๊ฐ€์ง€์˜ ์ด์ƒํด๋ž˜์Šค๊ฐ€ ์กด์žฌ. AUC-ROC + Average Precision(AP) ๋‘˜ ๋‹ค ํ‰๊ฐ€
    • XD-Violence : 4754 ๋น„๋””์˜ค. 6๊ฐœ์˜ ์ด์ƒํด๋ž˜์Šค ์กด์žฌ.

๐Ÿงช ์‹คํ—˜ ๊ฒฐ๊ณผ

Image

  • Training-Free์ค‘์—์„œ๋Š” ์ตœ๊ณ ๋‹ค!!
    • ๋‹ค๋งŒ, Supervised์— ๋น„ํ•ด์„œ๋Š” ์•ˆ์ข‹๋„ค!

๐Ÿงช Ablation ๋ถ„์„

Image

  • A. ์•„๋ž˜ ์„ธ๊ฐ€์ง€ ์š”์†Œ๋Š” ๋ชจ๋‘ ์ค‘์š”ํ–ˆ๋‹ค!!
    i) Image-Text Caption Cleaning
    ii) LLM-based Anomaly Scoring
    iii) Video-Text Score Refinement

  • B. ํ”„๋กฌํฌํŠธ๊ฐ€ ์ค‘์š”ํ• ๊นŒ?์— ๋Œ€ํ•œ ์‹คํ—˜
    • 1๋ฒˆ์จฐ ์ค„์€ ์ œ์ผ ๊ธฐ๋ณธ : How would you rate the scene described on a scale from 0 to 1, with 0 representing a standard scene and 1 denoting a scene with suspicious activities?
    • 2๋ฒˆ์จฐ ์ค„์€ โ€œor potentially criminal activities ์ถ”๊ฐ€
    • 3๋ฒˆ์จฐ ์ค„์€ โ€œIf you were a law enforcement agency,โ€ ์ถ”๊ฐ€
    • 4๋ฒˆ์งธ ์ค„์€ ๋ชจ๋‘ ์ถ”๊ฐ€!!
    • ์‹ ๊ธฐํ•˜๊ฒŒ๋„ 3๋ฒˆ์จฐ๊ฐ€ ์„ฑ๋Šฅ์ด ์ œ์ผ ์ข‹์•˜๋‹ค! ์•„๋งˆ๋„ ๋„ˆ๋ฌด ์ œ์•ฝ์ด ๋งŽ์œผ๋ฉด ํ•œ๊ณ„๊ฐ€ ์žˆ์„์ˆ˜๋„..
  • C. window K์— ๋Œ€ํ•œ ๋ถ„์„
  • ๋…ธ์ด์ฆˆ ์ •์ œ ๋‹จ๊ณ„ ์—†์œผ๋ฉด ์„ฑ๋Šฅ ๊ธ‰๋ฝ
  • LLM reasoning์ด anomaly score ํ’ˆ์งˆ์— ํฐ ๊ธฐ์—ฌ
  • ๋‹จ์ˆœ ํ‚ค์›Œ๋“œ ๋งค์นญ ๋ฐฉ์‹ ๋Œ€๋น„ ์›”๋“ฑํžˆ ๋†’์€ ์ •ํ™•๋„

โœ… ๊ฒฐ๋ก 

  • LAVAD๋Š” ์ตœ์ดˆ์˜ Training-Free Video Anomaly Detection ํ”„๋ ˆ์ž„์›Œํฌ
  • VLM๊ณผ LLM์„ ์กฐํ•ฉํ•ด, ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ ์—†์ด๋„ ์‹ค์‹œ๊ฐ„ ์ด์ƒ ํ–‰๋™ ํƒ์ง€ ๊ฐ€๋Šฅ
  • ๊ฐ์‹œ, ๋ณด์•ˆ, ์•ˆ์ „-critical ์‹œ์Šคํ…œ ๋“ฑ์—์„œ ์ฆ‰๊ฐ์  ํ™œ์šฉ ๊ฐ€๋Šฅ์„ฑ
  • ์•ž์œผ๋กœ ๋” ๊ฐ•๋ ฅํ•œ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ LLM์ด ๋“ฑ์žฅํ•˜๋ฉด ์„ฑ๋Šฅยทํ™•์žฅ์„ฑ์€ ๋”์šฑ ์ปค์งˆ ์ „๋ง!

๋‚ด ์ถ”๊ฐ€์ƒ๊ฐ!!

  • ์‹ค์‹œ๊ฐ„ ์˜์ƒ์— ๋Œ€ํ•œ ๋ถ„์„๋„ ํ•ด๋ฉด ๋” ์ข‹๊ฒ ๋„ค!!
  • fps๊ฐ€ ์—„์ฒญ ์•ˆ๋‚˜์˜ค๊ณ˜๋Š”๊ฑธ
  • ์žฌ๋ฏธ์žˆ๋Š” TF์—ฐ๊ตฌ๋„ค!
This post is licensed under CC BY 4.0 by the author.