🖼️ Qwen2.5-VL: Next-Gen Vision-Language Model with Dynamic Resolution & Long Video Understanding

Posted Sep 8, 2025

By DrFirst

5 min read

🖼️ (한국어) Qwen2.5-VL: 다이나믹 해상도와 초장기 비디오 이해까지!

제목: Qwen2.5-VL Technical Report
학회: arXiv (2025년 2월, Alibaba Qwen Team)
코드/체크포인트: GitHub – Qwen2.5-VL
핵심 키워드: Vision-Language Model, Dynamic Resolution, Long-Video, Document Parsing, Grounding, Agent
요약: Qwen2.5-VL은 Qwen 시리즈의 차세대 VLM으로, 정밀한 객체 인식·위치추적, 강력한 문서/차트 파싱, 수 시간짜리 비디오 이해를 학습 효율성 개선과 함께 달성한 모델. GPT-4o, Claude 3.5 Sonnet에 맞먹는 SOTA 성능을 오픈소스로 공개! :contentReference[oaicite:0]{index=0}

🚀 Qwen2.5-VL 핵심 요약

한 줄 요약: “이미지·문서·비디오·에이전트까지, 멀티모달 모든 것을 처리하는 범용 VLM!”

1) 정밀 객체 위치 지정 (Grounding)

바운딩 박스 / 포인트 단위 인식
JSON, 절대 좌표 기반 포맷 지원 → 정밀 공간 추론 가능:contentReference[oaicite:1]{index=1}

2) 문서 파싱 (Omni-Parsing)

OCR를 넘어 다국어 + 수식 + 화학식 + 음악 악보까지 통합 처리
HTML 기반 레이아웃 학습으로 문서 전체 구조 이해:contentReference[oaicite:2]{index=2}

3) 초장기 비디오 이해

Dynamic FPS Sampling + Absolute Time Encoding (MRoPE)
수 시간짜리 비디오에서 초 단위 이벤트 추출 가능:contentReference[oaicite:3]{index=3}

4) Agent 기능 강화

PC·모바일 UI grounding 및 조작 수행
다중 step reasoning + function call 기반 실세계 task 처리:contentReference[oaicite:4]{index=4}

🔍 기존 연구의 한계와 Qwen2.5-VL의 차별점

기존 VLM: 해상도 제약, 긴 비디오 한계, 문서 파싱 분절적
Qwen2.5-VL:
- 윈도우 어텐션 ViT → 해상도 유지하며 연산 비용 절감
- Native Dynamic Resolution → 입력 크기 그대로 처리
- Absolute Time Encoding → FPS 상관없이 일정한 시간 이해
- 4.1T 토큰 프리트레이닝 → 문서·비디오·OCR·에이전트 데이터 모두 포함:contentReference[oaicite:5]{index=5}

🧱 Qwen2.5-VL 구조 (Architecture)

1) Vision Encoder (ViT 개선판)

Window Attention + 2D/3D RoPE
원본 해상도 유지 + 영상 연속 프레임 grouping:contentReference[oaicite:6]{index=6}

2) MLP-based Vision-Language Merger

패치 feature 그룹화 → 효율적 LLM 입력:contentReference[oaicite:7]{index=7}

3) Qwen2.5 LM Decoder

Qwen2.5 LLM 기반
Multimodal Rotary Position Embedding (MRoPE) → 절대 시간 정렬:contentReference[oaicite:8]{index=8}

🧪 실험 결과

🎯 멀티모달 벤치마크

MMBench-EN: 88.6 (InternVL2.5, Claude-3.5 Sonnet 능가)
MMStar: 70.8 (최고 성능)
RealWorldQA: 78.7 (현실 시나리오 적응 우수):contentReference[oaicite:9]{index=9}

🎯 OCR / 문서 이해

CC-OCR, OmniDocBench: SOTA 달성
OCRBench_v2: Gemini 1.5 Pro 대비 +9.6%(EN), +20.6%(ZH):contentReference[oaicite:10]{index=10}

🎯 Grounding

RefCOCO/+/g 전부에서 GroundingDINO에 근접한 성능
ODinW-13 (open-vocab detection): 43.1 mAP
CountBench: 93.6 (탐지→카운트 방식):contentReference[oaicite:11]{index=11}

🎯 비디오 이해

Charades-STA: mIoU 50.9 (GPT-4o 능가)
LVBench, MLVU: 장기 비디오 QA에서 최고 성능:contentReference[oaicite:12]{index=12}

🎯 Agent

ScreenSpot Pro: 43.6 (Qwen2-VL의 1.6 → 대폭 향상)
Android Control, MobileMiniWob++: GPT-4o, Gemini 2.0 능가:contentReference[oaicite:13]{index=13}

👀 정성 비교

문서: 단순 텍스트 추출이 아니라 레이아웃, 표, 차트, 수식까지 구조적으로 파싱
비디오: 절대 시간 기반 이벤트 grounding → “언제 무엇이 일어났는지” 설명 가능
Agent: 실제 기기 UI grounding + reasoning → 자동화된 조작 가능

🧪 Ablation 분석

Absolute Time Encoding 없을 때 → 비디오 이벤트 정렬 성능 급락
Window Attention 없을 때 → 연산량 증가 + 속도 저하
Dynamic Resolution 제거 시 → 다양한 입력 해상도에서 성능 불안정:contentReference[oaicite:14]{index=14}

✅ 결론

Qwen2.5-VL은 멀티모달 올인원 모델로,
- 이미지·문서·비디오·에이전트까지 전 영역 처리
- GPT-4o, Claude 3.5와 경쟁하는 오픈소스 VLM
주요 기여:
1. Window Attention ViT
2. Dynamic Resolution + Absolute Time Encoding
3. Document Omni-Parsing
4. Long-Video + Agent 지원
→ 차세대 멀티모달 AI 표준으로 자리매김! 🚀

AI, Research

This post is licensed under CC BY 4.0 by the author.