Post

๐Ÿ–ผ๏ธ Qwen2.5-VL: Next-Gen Vision-Language Model with Dynamic Resolution & Long Video Understanding

๐Ÿ–ผ๏ธ Qwen2.5-VL: Next-Gen Vision-Language Model with Dynamic Resolution & Long Video Understanding

๐Ÿ–ผ๏ธ (ํ•œ๊ตญ์–ด) Qwen2.5-VL: ๋‹ค์ด๋‚˜๋ฏน ํ•ด์ƒ๋„์™€ ์ดˆ์žฅ๊ธฐ ๋น„๋””์˜ค ์ดํ•ด๊นŒ์ง€!

Image

  • ์ œ๋ชฉ: Qwen2.5-VL Technical Report
  • ํ•™ํšŒ: arXiv (2025๋…„ 2์›”, Alibaba Qwen Team)
  • ์ฝ”๋“œ/์ฒดํฌํฌ์ธํŠธ: GitHub โ€“ Qwen2.5-VL
  • ํ•ต์‹ฌ ํ‚ค์›Œ๋“œ: Vision-Language Model, Dynamic Resolution, Long-Video, Document Parsing, Grounding, Agent
  • ์š”์•ฝ: Qwen2.5-VL์€ Qwen ์‹œ๋ฆฌ์ฆˆ์˜ ์ฐจ์„ธ๋Œ€ VLM์œผ๋กœ, ์ •๋ฐ€ํ•œ ๊ฐ์ฒด ์ธ์‹ยท์œ„์น˜์ถ”์ , ๊ฐ•๋ ฅํ•œ ๋ฌธ์„œ/์ฐจํŠธ ํŒŒ์‹ฑ, ์ˆ˜ ์‹œ๊ฐ„์งœ๋ฆฌ ๋น„๋””์˜ค ์ดํ•ด๋ฅผ ํ•™์Šต ํšจ์œจ์„ฑ ๊ฐœ์„ ๊ณผ ํ•จ๊ป˜ ๋‹ฌ์„ฑํ•œ ๋ชจ๋ธ. GPT-4o, Claude 3.5 Sonnet์— ๋งž๋จน๋Š” SOTA ์„ฑ๋Šฅ์„ ์˜คํ”ˆ์†Œ์Šค๋กœ ๊ณต๊ฐœ! :contentReference[oaicite:0]{index=0}

๐Ÿš€ Qwen2.5-VL ํ•ต์‹ฌ ์š”์•ฝ

ํ•œ ์ค„ ์š”์•ฝ: โ€œ์ด๋ฏธ์ง€ยท๋ฌธ์„œยท๋น„๋””์˜คยท์—์ด์ „ํŠธ๊นŒ์ง€, ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๋ชจ๋“  ๊ฒƒ์„ ์ฒ˜๋ฆฌํ•˜๋Š” ๋ฒ”์šฉ VLM!โ€

1) ์ •๋ฐ€ ๊ฐ์ฒด ์œ„์น˜ ์ง€์ • (Grounding)

  • ๋ฐ”์šด๋”ฉ ๋ฐ•์Šค / ํฌ์ธํŠธ ๋‹จ์œ„ ์ธ์‹
  • JSON, ์ ˆ๋Œ€ ์ขŒํ‘œ ๊ธฐ๋ฐ˜ ํฌ๋งท ์ง€์› โ†’ ์ •๋ฐ€ ๊ณต๊ฐ„ ์ถ”๋ก  ๊ฐ€๋Šฅ:contentReference[oaicite:1]{index=1}

2) ๋ฌธ์„œ ํŒŒ์‹ฑ (Omni-Parsing)

  • OCR๋ฅผ ๋„˜์–ด ๋‹ค๊ตญ์–ด + ์ˆ˜์‹ + ํ™”ํ•™์‹ + ์Œ์•… ์•…๋ณด๊นŒ์ง€ ํ†ตํ•ฉ ์ฒ˜๋ฆฌ
  • HTML ๊ธฐ๋ฐ˜ ๋ ˆ์ด์•„์›ƒ ํ•™์Šต์œผ๋กœ ๋ฌธ์„œ ์ „์ฒด ๊ตฌ์กฐ ์ดํ•ด:contentReference[oaicite:2]{index=2}

3) ์ดˆ์žฅ๊ธฐ ๋น„๋””์˜ค ์ดํ•ด

  • Dynamic FPS Sampling + Absolute Time Encoding (MRoPE)
  • ์ˆ˜ ์‹œ๊ฐ„์งœ๋ฆฌ ๋น„๋””์˜ค์—์„œ ์ดˆ ๋‹จ์œ„ ์ด๋ฒคํŠธ ์ถ”์ถœ ๊ฐ€๋Šฅ:contentReference[oaicite:3]{index=3}

4) Agent ๊ธฐ๋Šฅ ๊ฐ•ํ™”

  • PCยท๋ชจ๋ฐ”์ผ UI grounding ๋ฐ ์กฐ์ž‘ ์ˆ˜ํ–‰
  • ๋‹ค์ค‘ step reasoning + function call ๊ธฐ๋ฐ˜ ์‹ค์„ธ๊ณ„ task ์ฒ˜๋ฆฌ:contentReference[oaicite:4]{index=4}

๐Ÿ” ๊ธฐ์กด ์—ฐ๊ตฌ์˜ ํ•œ๊ณ„์™€ Qwen2.5-VL์˜ ์ฐจ๋ณ„์ 

  • ๊ธฐ์กด VLM: ํ•ด์ƒ๋„ ์ œ์•ฝ, ๊ธด ๋น„๋””์˜ค ํ•œ๊ณ„, ๋ฌธ์„œ ํŒŒ์‹ฑ ๋ถ„์ ˆ์ 
  • Qwen2.5-VL:
    • ์œˆ๋„์šฐ ์–ดํ…์…˜ ViT โ†’ ํ•ด์ƒ๋„ ์œ ์ง€ํ•˜๋ฉฐ ์—ฐ์‚ฐ ๋น„์šฉ ์ ˆ๊ฐ
    • Native Dynamic Resolution โ†’ ์ž…๋ ฅ ํฌ๊ธฐ ๊ทธ๋Œ€๋กœ ์ฒ˜๋ฆฌ
    • Absolute Time Encoding โ†’ FPS ์ƒ๊ด€์—†์ด ์ผ์ •ํ•œ ์‹œ๊ฐ„ ์ดํ•ด
    • 4.1T ํ† ํฐ ํ”„๋ฆฌํŠธ๋ ˆ์ด๋‹ โ†’ ๋ฌธ์„œยท๋น„๋””์˜คยทOCRยท์—์ด์ „ํŠธ ๋ฐ์ดํ„ฐ ๋ชจ๋‘ ํฌํ•จ:contentReference[oaicite:5]{index=5}

๐Ÿงฑ Qwen2.5-VL ๊ตฌ์กฐ (Architecture)

Image

1) Vision Encoder (ViT ๊ฐœ์„ ํŒ)

  • Window Attention + 2D/3D RoPE
  • ์›๋ณธ ํ•ด์ƒ๋„ ์œ ์ง€ + ์˜์ƒ ์—ฐ์† ํ”„๋ ˆ์ž„ grouping:contentReference[oaicite:6]{index=6}

2) MLP-based Vision-Language Merger

  • ํŒจ์น˜ feature ๊ทธ๋ฃนํ™” โ†’ ํšจ์œจ์  LLM ์ž…๋ ฅ:contentReference[oaicite:7]{index=7}

3) Qwen2.5 LM Decoder

  • Qwen2.5 LLM ๊ธฐ๋ฐ˜
  • Multimodal Rotary Position Embedding (MRoPE) โ†’ ์ ˆ๋Œ€ ์‹œ๊ฐ„ ์ •๋ ฌ:contentReference[oaicite:8]{index=8}

๐Ÿงช ์‹คํ—˜ ๊ฒฐ๊ณผ

๐ŸŽฏ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๋ฒค์น˜๋งˆํฌ

  • MMBench-EN: 88.6 (InternVL2.5, Claude-3.5 Sonnet ๋Šฅ๊ฐ€)
  • MMStar: 70.8 (์ตœ๊ณ  ์„ฑ๋Šฅ)
  • RealWorldQA: 78.7 (ํ˜„์‹ค ์‹œ๋‚˜๋ฆฌ์˜ค ์ ์‘ ์šฐ์ˆ˜):contentReference[oaicite:9]{index=9}

๐ŸŽฏ OCR / ๋ฌธ์„œ ์ดํ•ด

  • CC-OCR, OmniDocBench: SOTA ๋‹ฌ์„ฑ
  • OCRBench_v2: Gemini 1.5 Pro ๋Œ€๋น„ +9.6%(EN), +20.6%(ZH):contentReference[oaicite:10]{index=10}

๐ŸŽฏ Grounding

  • RefCOCO/+/g ์ „๋ถ€์—์„œ GroundingDINO์— ๊ทผ์ ‘ํ•œ ์„ฑ๋Šฅ
  • ODinW-13 (open-vocab detection): 43.1 mAP
  • CountBench: 93.6 (ํƒ์ง€โ†’์นด์šดํŠธ ๋ฐฉ์‹):contentReference[oaicite:11]{index=11}

๐ŸŽฏ ๋น„๋””์˜ค ์ดํ•ด

  • Charades-STA: mIoU 50.9 (GPT-4o ๋Šฅ๊ฐ€)
  • LVBench, MLVU: ์žฅ๊ธฐ ๋น„๋””์˜ค QA์—์„œ ์ตœ๊ณ  ์„ฑ๋Šฅ:contentReference[oaicite:12]{index=12}

๐ŸŽฏ Agent

  • ScreenSpot Pro: 43.6 (Qwen2-VL์˜ 1.6 โ†’ ๋Œ€ํญ ํ–ฅ์ƒ)
  • Android Control, MobileMiniWob++: GPT-4o, Gemini 2.0 ๋Šฅ๊ฐ€:contentReference[oaicite:13]{index=13}

๐Ÿ‘€ ์ •์„ฑ ๋น„๊ต

  • ๋ฌธ์„œ: ๋‹จ์ˆœ ํ…์ŠคํŠธ ์ถ”์ถœ์ด ์•„๋‹ˆ๋ผ ๋ ˆ์ด์•„์›ƒ, ํ‘œ, ์ฐจํŠธ, ์ˆ˜์‹๊นŒ์ง€ ๊ตฌ์กฐ์ ์œผ๋กœ ํŒŒ์‹ฑ
  • ๋น„๋””์˜ค: ์ ˆ๋Œ€ ์‹œ๊ฐ„ ๊ธฐ๋ฐ˜ ์ด๋ฒคํŠธ grounding โ†’ โ€œ์–ธ์ œ ๋ฌด์—‡์ด ์ผ์–ด๋‚ฌ๋Š”์ง€โ€ ์„ค๋ช… ๊ฐ€๋Šฅ
  • Agent: ์‹ค์ œ ๊ธฐ๊ธฐ UI grounding + reasoning โ†’ ์ž๋™ํ™”๋œ ์กฐ์ž‘ ๊ฐ€๋Šฅ

๐Ÿงช Ablation ๋ถ„์„

  • Absolute Time Encoding ์—†์„ ๋•Œ โ†’ ๋น„๋””์˜ค ์ด๋ฒคํŠธ ์ •๋ ฌ ์„ฑ๋Šฅ ๊ธ‰๋ฝ
  • Window Attention ์—†์„ ๋•Œ โ†’ ์—ฐ์‚ฐ๋Ÿ‰ ์ฆ๊ฐ€ + ์†๋„ ์ €ํ•˜
  • Dynamic Resolution ์ œ๊ฑฐ ์‹œ โ†’ ๋‹ค์–‘ํ•œ ์ž…๋ ฅ ํ•ด์ƒ๋„์—์„œ ์„ฑ๋Šฅ ๋ถˆ์•ˆ์ •:contentReference[oaicite:14]{index=14}

โœ… ๊ฒฐ๋ก 

  • Qwen2.5-VL์€ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์˜ฌ์ธ์› ๋ชจ๋ธ๋กœ,
    • ์ด๋ฏธ์ง€ยท๋ฌธ์„œยท๋น„๋””์˜คยท์—์ด์ „ํŠธ๊นŒ์ง€ ์ „ ์˜์—ญ ์ฒ˜๋ฆฌ
    • GPT-4o, Claude 3.5์™€ ๊ฒฝ์Ÿํ•˜๋Š” ์˜คํ”ˆ์†Œ์Šค VLM
  • ์ฃผ์š” ๊ธฐ์—ฌ:
    1. Window Attention ViT
    2. Dynamic Resolution + Absolute Time Encoding
    3. Document Omni-Parsing
    4. Long-Video + Agent ์ง€์›
  • โ†’ ์ฐจ์„ธ๋Œ€ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ AI ํ‘œ์ค€์œผ๋กœ ์ž๋ฆฌ๋งค๊น€! ๐Ÿš€

This post is licensed under CC BY 4.0 by the author.