📝Understanding YOLO-World - 실시간 Open-Vocabulary Object Detection의 혁신!!!

Posted Jun 24, 2025

By DrFirst

24 min read

🧠 Understanding YOLO-World!!

🔍 YOLO finally enters the Zero-Shot world!!!

Paper: YOLO-World: Real-Time Open-Vocabulary Object Detection
Conference: CVPR 2024 (Tencent AI Lab, Cheng, Tianheng, et al.)
Code: AILab-CVC/YOLO-World

🔎 Key Summary

💡 YOLO with Open-Vocabulary capabilities added - can detect anything!!
⚡ Real-time Zero-shot detection while maintaining inference speed!!!
✅ Prompt-then-Detect - decode prompts once and keep using them for speed!

🤔 Problems with Existing Research

1️⃣ Fatal Limitations of Closed-Set Detectors

Fixed vocabulary: Can only detect predefined objects like COCO 80, Objects365 365 😔
Zero scalability: Need data collection + labeling + retraining for new objects 🔄
Lack of practicality: Cannot handle infinitely diverse objects in real environments 🌍

2️⃣ Heavy Reality of Open-Vocabulary Models

Massive backbones: GLIP, Grounding DINO use large backbones like Swin-L 💥
Slow inference: Real-time text prompt encoding makes them extremely slow 🐌
Deployment hell: Nearly impossible to use on edge devices or real-time applications 📱❌
Computational explosion: Dilemma of sacrificing practicality for high accuracy ⚖️

💡 Dilemma: “Either fast but limited (closed set) or flexible (open vocabulary) but extremely slow!”

🏗️ How Does It Work?

YOLO-World has a structure that organically connects image and text encoders:

1️⃣ YOLO Detector (Image encoder, using YOLOv8)

Input Image → Darknet Backbone → Multi-scale Features {C3, C4, C5}
                    ↓
        Feature Pyramid Network (PAN) → {P3, P4, P5}
                    ↓
          Detection Head → Bounding Boxes + Object Embeddings

YOLOv8-based: Utilizes proven real-time object detection architecture YOLOv8 ⚡
Multi-scale processing: Handles all sizes from small to large objects (C3,C4,C5) 📏
Object Embeddings: Represents each object as D-dimensional vectors (for text matching) 🔗

2️⃣ Text Encoder (Language Understanding, using CLIP)

User Text → n-gram Noun Extraction → CLIP Text Encoder → Text Embeddings W

CLIP utilization: Powerful text understanding from vision-language pre-training 🧠
Noun phrase extraction: “person with red hat” → [“person”, “red hat”] extraction 🎯
Embedding conversion: Maps text to D-dimensional vector space 📊

3️⃣ RepVL-PAN (Vision-Language Fusion Engine) 🔥

Image Features ←→ Cross-Modal Fusion ←→ Text Embeddings
       ↓                                      ↓
Text-guided CSPLayer              Image-Pooling Attention
       ↓                                      ↓
Enhanced Visual Features ←────── Enhanced Text Features

🎯 Text-guided CSPLayer (Injecting text information into images)

Max-Sigmoid Attention: Focus on image regions related to text
Formula: X'ₗ = Xₗ · σ(max(XₗW^T))
Effect: If there’s “cat” text, focus more on cat regions in the image! 🐱

🖼️ Image-Pooling Attention (Injecting image information into text)

27 patch tokens: Compress image into 3×3 regions for efficient processing
Multi-Head Attention: Text understands visual context of the image
Effect: “Cat” text reflects actual cat appearance/color information! 🎨

4️⃣ Text Contrastive Head (Matching Engine)

Object Embedding eₖ ←→ Similarity Score ←→ Text Embedding wⱼ
                        ↓
              s_{k,j} = α·cos(eₖ, wⱼ) + β
                        ↓
                 Final Detection Result!

Contrastive Learning: Distinguishes positive/negative like InfoNCE ⚖️
L2 Normalization: Calculates similarity by direction (meaning) not magnitude 🧭
Affine Transformation: Training stabilization with α(scaling) + β(shift) 📈

How Was the Data Collected?

YOLO-World was trained on 3 large-scale dataset types! 🎯

🗂️ 3 Data Sources

Data Type	Example	Features
Detection Data	COCO, Objects365	Accurate BBox + class labels ✅
Grounding Data	Visual Genome	Natural language descriptions + BBox 🔗
Image-Text Data	CC3M	Image + caption (no BBox) ❌

🎭 Core Problem: Image-Text Data Dilemma

Image-Text data: Massive but... no BBox! 😱
"A red car driving on the highway" + 🖼️ = Where's the BBox? 

🤖 Brilliant Solution: 3-Step Pseudo Labeling

Step 1: Noun Phrase Extraction 🔍

  
# Extract object words using n-gram algorithm
caption = "A red car driving on the highway"
noun_phrases = extract_nouns(caption)
# Result: ["red car", "highway"]

Step 2: Pseudo Box Generation 📦

  
# Generate fake BBox using open-vocabulary models like GLIP
for phrase in noun_phrases:
    pseudo_boxes = GLIP_model.detect(image, phrase)
    
# Result: "red car" → [x1, y1, x2, y2] coordinates generated!

Step 3: Quality Verification & Filtering ✅

  
# Calculate relevance score using CLIP
relevance_score = CLIP.similarity(image_region, text_phrase)

if relevance_score > threshold:
    keep_annotation()  # Keep only high-quality ones
else:
    discard_annotation()  # Discard poor ones

# + Remove duplicate BBox with NMS
final_boxes = non_maximum_suppression(pseudo_boxes)

📊 Final Dataset Scale

From CC3M dataset:
├── Sampled images: 246,000 📸
├── Generated Pseudo labels: 821,000 🏷️
└── Average: 3.3 objects per image

🔥 Training Strategy: Region-Text Contrastive Loss

🎯 Step-by-Step Understanding of Overall Training Process

Step 1: What the Model Predicts 📦

  
# What YOLO-World predicts from images:
predictions = {
    'boxes': [B1, B2, ..., BK],      # K bounding boxes
    'scores': [s1, s2, ..., sK],     # Confidence for each box
    'embeddings': [e1, e2, ..., eK]  # Feature vector for each object
}

# Actual ground truth data:
ground_truth = {
    'boxes': [B1_gt, B2_gt, ..., BN_gt],    # N ground truth boxes
    'texts': [t1, t2, ..., tN]              # Text label for each box
}

Step 2: Matching Predictions with Ground Truth 🔗

  
# Using Task-aligned Assignment
# "Which prediction box corresponds to which ground truth box?"

for prediction_k in predictions:
    best_match = find_best_groundtruth(prediction_k)
    if IoU(prediction_k, best_match) > threshold:
        positive_pairs.append((prediction_k, best_match))
        assign_text_label(prediction_k, best_match.text)

Step 3: Contrastive Loss Calculation ⚖️

  
# Calculate similarity between object embeddings and text embeddings
for object_embedding, text_embedding in positive_pairs:
    similarity = cosine_similarity(object_embedding, text_embedding)
    
    # Calculate Loss with Cross Entropy
    # Positive: High similarity with actual matching text
    # Negative: Low similarity with other texts
    contrastive_loss += cross_entropy(similarity, true_text_index)

Loss Function Composition

  
# Total Loss = Contrastive + Regression
total_loss = L_contrastive + λ * (L_IoU + L_focal)

# Role of each Loss:
# - L_contrastive: "Is this object a 'cat' or 'dog'?" (semantic learning)
# - L_IoU: "Is the bounding box location accurate?" (location learning)  
# - L_focal: "Is there an object or not?" (existence learning)

# λ (lambda) value training strategy:
# - Detection/Grounding data: λ = 1 (use all losses)
# - Image-Text data: λ = 0 (contrastive only)

Why λ = 0? 🤔

Situation 1: Detection data (COCO, Objects365)
├── Accurate BBox ✅ → Location learning possible
├── Accurate labels ✅ → Semantic learning possible  
└── λ = 1 to use all losses!

Situation 2: Image-Text data (CC3M + Pseudo Labels)
├── Inaccurate BBox ❌ → Location learning would be harmful
├── Accurate text ✅ → Semantic learning possible
└── λ = 0 to use only Contrastive Loss!

Conclusion: "Location with accurate data only, semantics with all data!" 🎯

🔍 Real Training Examples

  
# Example: Cat image training

# Case 1: COCO data (accurate BBox)
image = "cat_photo.jpg"
ground_truth = {
    'box': [100, 50, 200, 150],  # Accurate coordinates
    'text': "cat"
}
→ λ = 1 to learn both location + semantics! ✅

# Case 2: CC3M data (Pseudo Box)  
image = "cat_photo.jpg"
pseudo_labels = {
    'box': [90, 45, 210, 160],   # Inaccurate coordinates made by GLIP
    'text': "cat"
}
→ λ = 0 to learn semantics only! (ignore location) ✅

🎨 Mosaic Augmentation Utilization

Learning by combining multiple images at once:
┌─────────┬─────────┐
│ 🐱 cat  │ 🚗 car   │
├─────────┼─────────┤  
│ 🐕 dog  │ 👤 person│
└─────────┴─────────┘
→ Learning 4 objects at once for efficiency UP! ⚡

💡 Core Idea of Data Collection

“Accurate data is small but quality guaranteed, massive data is automatically labeled for utilization!”

Small precise data: Detection + Grounding (accurate BBox)
Large automatic data: Image-Text → Pseudo Labeling (scale acquisition)
Balanced learning: Mix both types for optimal performance! 🎯

With this clever data strategy, YOLO-World achieved fast yet accurate Open-Vocabulary detection! 🚀

Experimental Results!! ✨

Item	Description
Real-time Performance	Achieved 35.4 AP @ 52.0 FPS on LVIS (V100 GPU)
Prompt-then-Detect	No need for real-time text encoding with offline vocabulary embeddings
Zero-Shot Ability	Can detect objects not seen in training with text prompts only
Lightweight	20x faster and 5x smaller than existing Open-Vocabulary models

🎯 Key Technical Innovations

Core Components of RepVL-PAN

🎯 Text-guided CSPLayer (T-CSPLayer)
Adds text guidance to YOLOv8’s C2f layer
Focus on text-related regions with Max-Sigmoid Attention
🖼️ Image-Pooling Attention
Compresses multi-scale image features into 27 patch tokens
Enhances text embeddings with visual context

📊 Performance Comparison

Zero-shot LVIS Benchmark

Model	Backbone	FPS	AP	AP_r	AP_c	AP_f
GLIP-T	Swin-T	0.12	26.0	20.8	21.4	31.0
Grounding DINO-T	Swin-T	1.5	27.4	18.1	23.3	32.7
DetCLIP-T	Swin-T	2.3	34.4	26.9	33.9	36.3
YOLO-World-L	YOLOv8-L	52.0	35.4	27.6	34.1	38.0

High FPS! Extremely fast - can process 52 images per second!
While maintaining high accuracy (AP)!

⚠️ Limitations

🎭 Limitations in Complex Interaction Expression
Simple text prompts may struggle with complex relationship expressions
📏 Resolution Dependency
High-resolution input may be required for small object detection
💾 Memory Usage
Additional memory overhead during re-parameterization process

✅ Summary

YOLO-World is a groundbreaking object detection model that simultaneously achieves real-time performance and Open-Vocabulary capabilities.

📌 YOLO’s speed + CLIP’s language understanding!
Solves the heavy and slow problems of existing Open-Vocabulary models,
Presenting a practical solution ready for immediate use in industrial settings!

With YOLO-World’s emergence, Zero-shot Object Detection on Edge devices has become reality! 🎉

🧠 (한국어) YOLO-World 알아보기?!!

🔍 YOLO가 드디어 Zero-Shot의 세계로!!!

논문: YOLO-World: Real-Time Open-Vocabulary Object Detection
발표: CVPR 2024 (Tencent AI Lab, Cheng, Tianheng, et al.)
코드: AILab-CVC/YOLO-World

🔎 핵심 요약

💡 YOLO에 Open-Vocabulary 능력을 추가, 모든것을 탐지 가능!!
⚡ Real-time Zero-shot 탐지를 하면서도 추론 속도 유지!!!
✅ Prompt-then-Detect 한번 prompt decoding 해둔걸 계속 사용할수 있어 빠르다!

🤔 기존 연구의 문제점

1️⃣ Close Set Detector의 치명적 한계

고정된 어휘: COCO 80개, Objects365 365개처럼 미리 정해진 객체만 탐지 가능 😔
확장성 제로: 새로운 객체를 탐지하려면 데이터 수집 + 라벨링 + 재학습 필요 🔄
실용성 부족: 실제 환경에서는 무한히 다양한 객체가 존재하는데 대응 불가 🌍

2️⃣ Open-Vocabulary 모델들의 무거운 현실

거대한 백본: GLIP, Grounding DINO 등이 Swin-L 같은 대형 백본 사용 💥
느린 추론: 매번 텍스트 프롬프트를 실시간으로 인코딩해서 엄청 느림 🐌
배포 지옥: 엣지 디바이스나 실시간 응용에서 사용하기 거의 불가능 📱❌
계산량 폭증: 높은 정확도를 위해 실용성을 포기해야 하는 딜레마 ⚖️

💡 딜레마: “빠르지만 제한적(close set)이거나, 유연(Open vocabulary)하지만 엄청 느리거나!”

🏗️ 어떻게 작동할까?

YOLO-World는 이미지 인코더와 텍스트 인코더를 유기적으로 연결한 구조입니다:

1️⃣ YOLO Detector (Image encoder, YOLOv8 사용)

입력 이미지 → Darknet Backbone → Multi-scale Features {C3, C4, C5}
                    ↓
        Feature Pyramid Network (PAN) → {P3, P4, P5}
                    ↓
          Detection Head → Bounding Boxes + Object Embeddings

YOLOv8 기반: 검증된 실시간 객체 탐지 아키텍처 YOLOv8 활용 ⚡
멀티스케일 처리: 작은 객체부터 큰 객체까지(C3,C4,C5) 모든 크기 대응 📏
Object Embeddings: 각 객체를 D차원 벡터로 표현 (텍스트와 매칭용) 🔗

2️⃣ Text Encoder (Language Understanding, CLIP 사용)

사용자 텍스트 → n-gram 명사 추출 → CLIP Text Encoder → Text Embeddings W

CLIP 활용: 시각-언어 사전학습된 강력한 텍스트 이해 🧠
명사구 추출: “빨간 모자를 쓴 사람” → [“person”, “red hat”] 추출 🎯
임베딩 변환: 텍스트를 D차원 벡터 공간으로 매핑 📊

3️⃣ RepVL-PAN (Vision-Language Fusion Engine) 🔥

Image Features ←→ Cross-Modal Fusion ←→ Text Embeddings
       ↓                                      ↓
Text-guided CSPLayer              Image-Pooling Attention
       ↓                                      ↓
Enhanced Visual Features ←────── Enhanced Text Features

🎯 Text-guided CSPLayer (이미지에 텍스트 정보 주입)

Max-Sigmoid Attention: 텍스트와 관련된 이미지 영역에 집중
수식: X'ₗ = Xₗ · σ(max(XₗW^T))
효과: “고양이”라는 텍스트가 있으면 이미지에서 고양이 영역에 더 집중! 🐱

🖼️ Image-Pooling Attention (텍스트에 이미지 정보 주입)

27개 패치 토큰: 3×3 영역으로 이미지를 압축하여 효율적 처리
Multi-Head Attention: 텍스트가 이미지의 시각적 컨텍스트 이해
효과: “고양이” 텍스트가 실제 고양이 모양/색깔 정보를 반영! 🎨

4️⃣ Text Contrastive Head (매칭 엔진)

Object Embedding eₖ ←→ Similarity Score ←→ Text Embedding wⱼ
                        ↓
              s_{k,j} = α·cos(eₖ, wⱼ) + β
                        ↓
                 최종 탐지 결과 결정!

Contrastive Learning: InfoNCE와 같은 원리로 positive/negative 구분 ⚖️
L2 정규화: 크기가 아닌 방향(의미)으로 유사도 계산 🧭
아핀 변환: α(스케일링) + β(이동)로 훈련 안정화 📈

데이터는 어떻게 모았지?

YOLO-World는 대규모 데이터셋 3종 세트로 학습했습니다! 🎯

🗂️ 3가지 데이터 소스

데이터 타입	예시	특징
Detection Data	COCO, Objects365	정확한 BBox + 클래스 라벨 ✅
Grounding Data	Visual Genome	자연어 설명 + BBox 🔗
Image-Text Data	CC3M	이미지 + 캡션 (BBox 없음) ❌

🎭 핵심 문제: Image-Text 데이터의 딜레마

Image-Text 데이터: 엄청 많지만... BBox가 없다! 😱
"A red car driving on the highway" + 🖼️ = BBox는 어디에? 

🤖 천재적 해결책: 3단계 Pseudo Labeling

Step 1: 명사구 추출 🔍

  
# n-gram 알고리즘으로 객체 단어 추출
caption = "A red car driving on the highway"
noun_phrases = extract_nouns(caption)
# 결과: ["red car", "highway"]

Step 2: Pseudo Box 생성 📦

  
# GLIP 같은 오픈 어휘 모델로 가짜 BBox 생성
for phrase in noun_phrases:
    pseudo_boxes = GLIP_model.detect(image, phrase)
    
# 결과: "red car" → [x1, y1, x2, y2] 좌표 생성!

Step 3: 품질 검증 & 필터링 ✅

  
# CLIP으로 관련성 점수 계산
relevance_score = CLIP.similarity(image_region, text_phrase)

if relevance_score > threshold:
    keep_annotation()  # 품질 좋은 것만 유지
else:
    discard_annotation()  # 엉성한 것은 버림

# + NMS로 중복 BBox 제거
final_boxes = non_maximum_suppression(pseudo_boxes)

📊 최종 데이터셋 규모

CC3M 데이터셋에서:
├── 샘플링된 이미지: 246,000장 📸
├── 생성된 Pseudo 라벨: 821,000개 🏷️
└── 평균: 이미지당 3.3개 객체 탐지

🔥 학습 전략: Region-Text Contrastive Loss

🎯 전체 학습 과정 단계별 이해

Step 1: 모델이 예측하는 것들 📦

  
# YOLO-World가 이미지를 보고 예측하는 것:
predictions = {
    'boxes': [B1, B2, ..., BK],      # K개의 바운딩 박스
    'scores': [s1, s2, ..., sK],     # 각 박스의 confidence
    'embeddings': [e1, e2, ..., eK]  # 각 객체의 특징 벡터
}

# 실제 정답 데이터:
ground_truth = {
    'boxes': [B1_gt, B2_gt, ..., BN_gt],    # N개의 정답 박스
    'texts': [t1, t2, ..., tN]              # 각 박스의 텍스트 라벨
}

Step 2: 예측과 정답 매칭하기 🔗

  
# Task-aligned Assignment 사용
# "어떤 예측 박스가 어떤 정답 박스와 대응되는가?"

for prediction_k in predictions:
    best_match = find_best_groundtruth(prediction_k)
    if IoU(prediction_k, best_match) > threshold:
        positive_pairs.append((prediction_k, best_match))
        assign_text_label(prediction_k, best_match.text)

Step 3: Contrastive Loss 계산 ⚖️

  
# 각 객체 임베딩과 텍스트 임베딩 간의 유사도 계산
for object_embedding, text_embedding in positive_pairs:
    similarity = cosine_similarity(object_embedding, text_embedding)
    
    # Cross Entropy로 Loss 계산
    # Positive: 실제 매칭되는 텍스트와 높은 유사도
    # Negative: 다른 텍스트들과는 낮은 유사도
    contrastive_loss += cross_entropy(similarity, true_text_index)

Loss Function 구성

  
# 전체 Loss = Contrastive + Regression
total_loss = L_contrastive + λ * (L_IoU + L_focal)

# 각 Loss 역할:
# - L_contrastive: "이 객체가 '고양이'인가 '강아지'인가?" (의미 학습)
# - L_IoU: "바운딩 박스 위치가 정확한가?" (위치 학습)  
# - L_focal: "객체가 있는가 없는가?" (존재 여부 학습)

# λ (lambda) 값에 따른 학습 전략:
# - Detection/Grounding 데이터: λ = 1 (모든 loss 사용)
# - Image-Text 데이터: λ = 0 (contrastive만 사용)

왜 λ = 0 인가? 🤔

상황 1: Detection 데이터 (COCO, Objects365)
├── 정확한 BBox ✅ → 위치 학습 가능
├── 정확한 라벨 ✅ → 의미 학습 가능  
└── λ = 1로 모든 Loss 사용!

상황 2: Image-Text 데이터 (CC3M + Pseudo Labels)
├── 부정확한 BBox ❌ → 위치 학습하면 오히려 해로움
├── 정확한 텍스트 ✅ → 의미 학습은 가능
└── λ = 0으로 Contrastive Loss만 사용!

결론: "위치는 정확한 데이터로만, 의미는 모든 데이터로!" 🎯

🔍 실제 학습 예시

  
# 예시: 고양이 이미지 학습

# Case 1: COCO 데이터 (정확한 BBox)
image = "고양이 사진.jpg"
ground_truth = {
    'box': [100, 50, 200, 150],  # 정확한 좌표
    'text': "cat"
}
→ λ = 1로 위치 + 의미 둘 다 학습! ✅

# Case 2: CC3M 데이터 (Pseudo Box)  
image = "고양이 사진.jpg"
pseudo_labels = {
    'box': [90, 45, 210, 160],   # GLIP이 만든 부정확한 좌표
    'text': "cat"
}
→ λ = 0으로 의미만 학습! (위치는 무시) ✅

🎨 Mosaic Augmentation 활용

여러 이미지를 한 번에 합쳐서 학습:
┌─────────┬─────────┐
│ 🐱 cat  │ 🚗 car   │
├─────────┼─────────┤  
│ 🐕 dog  │ 👤 person│
└─────────┴─────────┘
→ 한 번에 4개 객체 학습으로 효율성 UP! ⚡

💡 데이터 수집의 핵심 아이디어

“정확한 데이터는 적지만 품질 보장, 대량 데이터는 자동으로 라벨링해서 활용!”

소량 정밀 데이터: Detection + Grounding (정확한 BBox)
대량 자동 데이터: Image-Text → Pseudo Labeling (스케일 확보)
균형 학습: 두 종류를 섞어서 최적의 성능 달성! 🎯

이렇게 clever한 데이터 전략으로 YOLO-World는 빠르면서도 정확한 Open-Vocabulary 탐지가 가능해졌습니다! 🚀

실험결과!! ✨

항목	설명
실시간 성능	LVIS에서 35.4 AP @ 52.0 FPS 달성 (V100 GPU)
Prompt-then-Detect	오프라인 어휘 임베딩으로 실시간 텍스트 인코딩 불필요
Zero-Shot 능력	학습에 없던 객체도 텍스트 프롬프트만으로 탐지 가능
경량화	기존 Open-Vocabulary 모델 대비 20배 빠르고 5배 작음

🎯 주요 기술적 혁신

RepVL-PAN의 핵심 구성요소

🎯 Text-guided CSPLayer (T-CSPLayer)
YOLOv8의 C2f 레이어에 텍스트 가이던스 추가
Max-Sigmoid Attention으로 텍스트 관련 영역에 집중
🖼️ Image-Pooling Attention
멀티스케일 이미지 특징을 27개 패치 토큰으로 압축
텍스트 임베딩을 시각적 컨텍스트로 향상

📊 성능 비교

Zero-shot LVIS 벤치마크

Model	Backbone	FPS	AP	AP_r	AP_c	AP_f
GLIP-T	Swin-T	0.12	26.0	20.8	21.4	31.0
Grounding DINO-T	Swin-T	1.5	27.4	18.1	23.3	32.7
DetCLIP-T	Swin-T	2.3	34.4	26.9	33.9	36.3
YOLO-World-L	YOLOv8-L	52.0	35.4	27.6	34.1	38.0

FPS가 높다! 즉 엄청 빠르죠? 1초데 52장을 처리할수 있대요!
그러면서 정확도(AP)도 높아요!

⚠️ 한계점

🎭 복잡한 상호작용 표현의 한계
단순한 텍스트 프롬프트로는 복잡한 관계 표현이 어려울 수 있음
📏 해상도 의존성
작은 객체 탐지를 위해서는 고해상도 입력이 필요할 수 있음
💾 메모리 사용량
Re-parameterization 과정에서 추가적인 메모리 오버헤드 발생

–

✅ 마무리 요약

YOLO-World는 실시간 성능과 Open-Vocabulary 능력을 동시에 달성한 획기적인 객체 탐지 모델입니다.

📌 YOLO의 속도 + CLIP의 언어 이해력!
기존 Open-Vocabulary 모델들이 무겁고 느린 문제를 해결하며,
실제 산업 현장에서 바로 활용 가능한 실용적인 솔루션을 제시!

YOLO-World의 등장으로 이제 Edge 디바이스에서도 Zero-shot Object Detection이 현실이 되었습니다! 🎉

AI, Research

This post is licensed under CC BY 4.0 by the author.

🧠 Understanding YOLO-World!!

🔎 Key Summary

🤔 Problems with Existing Research

1️⃣ Fatal Limitations of Closed-Set Detectors

2️⃣ Heavy Reality of Open-Vocabulary Models

🏗️ How Does It Work?

1️⃣ YOLO Detector (Image encoder, using YOLOv8)

2️⃣ Text Encoder (Language Understanding, using CLIP)

3️⃣ RepVL-PAN (Vision-Language Fusion Engine) 🔥

4️⃣ Text Contrastive Head (Matching Engine)

How Was the Data Collected?

🗂️ 3 Data Sources

🎭 Core Problem: Image-Text Data Dilemma

🤖 Brilliant Solution: 3-Step Pseudo Labeling

Step 1: Noun Phrase Extraction 🔍

Step 2: Pseudo Box Generation 📦

Step 3: Quality Verification & Filtering ✅

📊 Final Dataset Scale

🔥 Training Strategy: Region-Text Contrastive Loss

🎯 Step-by-Step Understanding of Overall Training Process

Loss Function Composition

Why λ = 0? 🤔

🔍 Real Training Examples

🎨 Mosaic Augmentation Utilization

💡 Core Idea of Data Collection

Experimental Results!! ✨

🎯 Key Technical Innovations

Core Components of RepVL-PAN

📊 Performance Comparison

Zero-shot LVIS Benchmark

⚠️ Limitations

✅ Summary

🧠 (한국어) YOLO-World 알아보기?!!

🔎 핵심 요약

🤔 기존 연구의 문제점

1️⃣ Close Set Detector의 치명적 한계

2️⃣ Open-Vocabulary 모델들의 무거운 현실

🏗️ 어떻게 작동할까?

1️⃣ YOLO Detector (Image encoder, YOLOv8 사용)

2️⃣ Text Encoder (Language Understanding, CLIP 사용)

3️⃣ RepVL-PAN (Vision-Language Fusion Engine) 🔥

4️⃣ Text Contrastive Head (매칭 엔진)

데이터는 어떻게 모았지?

🗂️ 3가지 데이터 소스

🎭 핵심 문제: Image-Text 데이터의 딜레마

🤖 천재적 해결책: 3단계 Pseudo Labeling

Step 1: 명사구 추출 🔍

Step 2: Pseudo Box 생성 📦

Step 3: 품질 검증 & 필터링 ✅

📊 최종 데이터셋 규모

🔥 학습 전략: Region-Text Contrastive Loss

🎯 전체 학습 과정 단계별 이해

Loss Function 구성

왜 λ = 0 인가? 🤔

🔍 실제 학습 예시

🎨 Mosaic Augmentation 활용

💡 데이터 수집의 핵심 아이디어

실험결과!! ✨

🎯 주요 기술적 혁신

RepVL-PAN의 핵심 구성요소

📊 성능 비교

Zero-shot LVIS 벤치마크

⚠️ 한계점

✅ 마무리 요약

Trending Tags