๐Understanding YOLO-World - ์ค์๊ฐ Open-Vocabulary Object Detection์ ํ์ !!!
๐ง Understanding YOLO-World!!
๐ YOLO finally enters the Zero-Shot world!!!
Paper: YOLO-World: Real-Time Open-Vocabulary Object Detection
Conference: CVPR 2024 (Tencent AI Lab, Cheng, Tianheng, et al.)
Code: AILab-CVC/YOLO-World
๐ Key Summary
- ๐ก YOLO with Open-Vocabulary capabilities added - can detect anything!!
- โก Real-time Zero-shot detection while maintaining inference speed!!!
- โ Prompt-then-Detect - decode prompts once and keep using them for speed!
๐ค Problems with Existing Research
1๏ธโฃ Fatal Limitations of Closed-Set Detectors
- Fixed vocabulary: Can only detect predefined objects like COCO 80, Objects365 365 ๐
- Zero scalability: Need data collection + labeling + retraining for new objects ๐
- Lack of practicality: Cannot handle infinitely diverse objects in real environments ๐
2๏ธโฃ Heavy Reality of Open-Vocabulary Models
- Massive backbones: GLIP, Grounding DINO use large backbones like Swin-L ๐ฅ
- Slow inference: Real-time text prompt encoding makes them extremely slow ๐
- Deployment hell: Nearly impossible to use on edge devices or real-time applications ๐ฑโ
- Computational explosion: Dilemma of sacrificing practicality for high accuracy โ๏ธ
๐ก Dilemma: โEither fast but limited (closed set) or flexible (open vocabulary) but extremely slow!โ
๐๏ธ How Does It Work?
YOLO-World has a structure that organically connects image and text encoders:
1๏ธโฃ YOLO Detector (Image encoder, using YOLOv8)
1
2
3
4
5
Input Image โ Darknet Backbone โ Multi-scale Features {C3, C4, C5}
โ
Feature Pyramid Network (PAN) โ {P3, P4, P5}
โ
Detection Head โ Bounding Boxes + Object Embeddings
- YOLOv8-based: Utilizes proven real-time object detection architecture YOLOv8 โก
- Multi-scale processing: Handles all sizes from small to large objects (C3,C4,C5) ๐
- Object Embeddings: Represents each object as D-dimensional vectors (for text matching) ๐
2๏ธโฃ Text Encoder (Language Understanding, using CLIP)
1
User Text โ n-gram Noun Extraction โ CLIP Text Encoder โ Text Embeddings W
- CLIP utilization: Powerful text understanding from vision-language pre-training ๐ง
- Noun phrase extraction: โperson with red hatโ โ [โpersonโ, โred hatโ] extraction ๐ฏ
- Embedding conversion: Maps text to D-dimensional vector space ๐
3๏ธโฃ RepVL-PAN (Vision-Language Fusion Engine) ๐ฅ
1
2
3
4
5
Image Features โโ Cross-Modal Fusion โโ Text Embeddings
โ โ
Text-guided CSPLayer Image-Pooling Attention
โ โ
Enhanced Visual Features โโโโโโโ Enhanced Text Features
๐ฏ Text-guided CSPLayer (Injecting text information into images)
- Max-Sigmoid Attention: Focus on image regions related to text
- Formula:
X'โ = Xโ ยท ฯ(max(XโW^T))
- Effect: If thereโs โcatโ text, focus more on cat regions in the image! ๐ฑ
๐ผ๏ธ Image-Pooling Attention (Injecting image information into text)
- 27 patch tokens: Compress image into 3ร3 regions for efficient processing
- Multi-Head Attention: Text understands visual context of the image
- Effect: โCatโ text reflects actual cat appearance/color information! ๐จ
4๏ธโฃ Text Contrastive Head (Matching Engine)
1
2
3
4
5
Object Embedding eโ โโ Similarity Score โโ Text Embedding wโฑผ
โ
s_{k,j} = ฮฑยทcos(eโ, wโฑผ) + ฮฒ
โ
Final Detection Result!
- Contrastive Learning: Distinguishes positive/negative like InfoNCE โ๏ธ
- L2 Normalization: Calculates similarity by direction (meaning) not magnitude ๐งญ
- Affine Transformation: Training stabilization with ฮฑ(scaling) + ฮฒ(shift) ๐
How Was the Data Collected?
YOLO-World was trained on 3 large-scale dataset types! ๐ฏ
๐๏ธ 3 Data Sources
Data Type | Example | Features |
---|---|---|
Detection Data | COCO, Objects365 | Accurate BBox + class labels โ |
Grounding Data | Visual Genome | Natural language descriptions + BBox ๐ |
Image-Text Data | CC3M | Image + caption (no BBox) โ |
๐ญ Core Problem: Image-Text Data Dilemma
1
2
Image-Text data: Massive but... no BBox! ๐ฑ
"A red car driving on the highway" + ๐ผ๏ธ = Where's the BBox?
๐ค Brilliant Solution: 3-Step Pseudo Labeling
Step 1: Noun Phrase Extraction ๐
1
2
3
4
# Extract object words using n-gram algorithm
caption = "A red car driving on the highway"
noun_phrases = extract_nouns(caption)
# Result: ["red car", "highway"]
Step 2: Pseudo Box Generation ๐ฆ
1
2
3
4
5
# Generate fake BBox using open-vocabulary models like GLIP
for phrase in noun_phrases:
pseudo_boxes = GLIP_model.detect(image, phrase)
# Result: "red car" โ [x1, y1, x2, y2] coordinates generated!
Step 3: Quality Verification & Filtering โ
1
2
3
4
5
6
7
8
9
10
# Calculate relevance score using CLIP
relevance_score = CLIP.similarity(image_region, text_phrase)
if relevance_score > threshold:
keep_annotation() # Keep only high-quality ones
else:
discard_annotation() # Discard poor ones
# + Remove duplicate BBox with NMS
final_boxes = non_maximum_suppression(pseudo_boxes)
๐ Final Dataset Scale
1
2
3
4
From CC3M dataset:
โโโ Sampled images: 246,000 ๐ธ
โโโ Generated Pseudo labels: 821,000 ๐ท๏ธ
โโโ Average: 3.3 objects per image
๐ฅ Training Strategy: Region-Text Contrastive Loss
๐ฏ Step-by-Step Understanding of Overall Training Process
Step 1: What the Model Predicts ๐ฆ
1
2
3
4
5
6
7
8
9
10
11
12
# What YOLO-World predicts from images:
predictions = {
'boxes': [B1, B2, ..., BK], # K bounding boxes
'scores': [s1, s2, ..., sK], # Confidence for each box
'embeddings': [e1, e2, ..., eK] # Feature vector for each object
}
# Actual ground truth data:
ground_truth = {
'boxes': [B1_gt, B2_gt, ..., BN_gt], # N ground truth boxes
'texts': [t1, t2, ..., tN] # Text label for each box
}
Step 2: Matching Predictions with Ground Truth ๐
1
2
3
4
5
6
7
8
# Using Task-aligned Assignment
# "Which prediction box corresponds to which ground truth box?"
for prediction_k in predictions:
best_match = find_best_groundtruth(prediction_k)
if IoU(prediction_k, best_match) > threshold:
positive_pairs.append((prediction_k, best_match))
assign_text_label(prediction_k, best_match.text)
Step 3: Contrastive Loss Calculation โ๏ธ
1
2
3
4
5
6
7
8
# Calculate similarity between object embeddings and text embeddings
for object_embedding, text_embedding in positive_pairs:
similarity = cosine_similarity(object_embedding, text_embedding)
# Calculate Loss with Cross Entropy
# Positive: High similarity with actual matching text
# Negative: Low similarity with other texts
contrastive_loss += cross_entropy(similarity, true_text_index)
Loss Function Composition
1
2
3
4
5
6
7
8
9
10
11
# Total Loss = Contrastive + Regression
total_loss = L_contrastive + ฮป * (L_IoU + L_focal)
# Role of each Loss:
# - L_contrastive: "Is this object a 'cat' or 'dog'?" (semantic learning)
# - L_IoU: "Is the bounding box location accurate?" (location learning)
# - L_focal: "Is there an object or not?" (existence learning)
# ฮป (lambda) value training strategy:
# - Detection/Grounding data: ฮป = 1 (use all losses)
# - Image-Text data: ฮป = 0 (contrastive only)
Why ฮป = 0? ๐ค
1
2
3
4
5
6
7
8
9
10
11
Situation 1: Detection data (COCO, Objects365)
โโโ Accurate BBox โ
โ Location learning possible
โโโ Accurate labels โ
โ Semantic learning possible
โโโ ฮป = 1 to use all losses!
Situation 2: Image-Text data (CC3M + Pseudo Labels)
โโโ Inaccurate BBox โ โ Location learning would be harmful
โโโ Accurate text โ
โ Semantic learning possible
โโโ ฮป = 0 to use only Contrastive Loss!
Conclusion: "Location with accurate data only, semantics with all data!" ๐ฏ
๐ Real Training Examples
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# Example: Cat image training
# Case 1: COCO data (accurate BBox)
image = "cat_photo.jpg"
ground_truth = {
'box': [100, 50, 200, 150], # Accurate coordinates
'text': "cat"
}
โ ฮป = 1 to learn both location + semantics! โ
# Case 2: CC3M data (Pseudo Box)
image = "cat_photo.jpg"
pseudo_labels = {
'box': [90, 45, 210, 160], # Inaccurate coordinates made by GLIP
'text': "cat"
}
โ ฮป = 0 to learn semantics only! (ignore location) โ
๐จ Mosaic Augmentation Utilization
1
2
3
4
5
6
7
Learning by combining multiple images at once:
โโโโโโโโโโโฌโโโโโโโโโโ
โ ๐ฑ cat โ ๐ car โ
โโโโโโโโโโโผโโโโโโโโโโค
โ ๐ dog โ ๐ค personโ
โโโโโโโโโโโดโโโโโโโโโโ
โ Learning 4 objects at once for efficiency UP! โก
๐ก Core Idea of Data Collection
โAccurate data is small but quality guaranteed, massive data is automatically labeled for utilization!โ
- Small precise data: Detection + Grounding (accurate BBox)
- Large automatic data: Image-Text โ Pseudo Labeling (scale acquisition)
- Balanced learning: Mix both types for optimal performance! ๐ฏ
With this clever data strategy, YOLO-World achieved fast yet accurate Open-Vocabulary detection! ๐
Experimental Results!! โจ
Item | Description |
---|---|
Real-time Performance | Achieved 35.4 AP @ 52.0 FPS on LVIS (V100 GPU) |
Prompt-then-Detect | No need for real-time text encoding with offline vocabulary embeddings |
Zero-Shot Ability | Can detect objects not seen in training with text prompts only |
Lightweight | 20x faster and 5x smaller than existing Open-Vocabulary models |
๐ฏ Key Technical Innovations
Core Components of RepVL-PAN
๐ฏ Text-guided CSPLayer (T-CSPLayer)
Adds text guidance to YOLOv8โs C2f layer
Focus on text-related regions with Max-Sigmoid Attention๐ผ๏ธ Image-Pooling Attention
Compresses multi-scale image features into 27 patch tokens
Enhances text embeddings with visual context
๐ Performance Comparison
Zero-shot LVIS Benchmark
Model | Backbone | FPS | AP | AP_r | AP_c | AP_f |
---|---|---|---|---|---|---|
GLIP-T | Swin-T | 0.12 | 26.0 | 20.8 | 21.4 | 31.0 |
Grounding DINO-T | Swin-T | 1.5 | 27.4 | 18.1 | 23.3 | 32.7 |
DetCLIP-T | Swin-T | 2.3 | 34.4 | 26.9 | 33.9 | 36.3 |
YOLO-World-L | YOLOv8-L | 52.0 | 35.4 | 27.6 | 34.1 | 38.0 |
- High FPS! Extremely fast - can process 52 images per second!
- While maintaining high accuracy (AP)!
โ ๏ธ Limitations
๐ญ Limitations in Complex Interaction Expression
Simple text prompts may struggle with complex relationship expressions๐ Resolution Dependency
High-resolution input may be required for small object detection๐พ Memory Usage
Additional memory overhead during re-parameterization process
โ Summary
YOLO-World is a groundbreaking object detection model that simultaneously achieves real-time performance and Open-Vocabulary capabilities.
๐ YOLOโs speed + CLIPโs language understanding!
Solves the heavy and slow problems of existing Open-Vocabulary models,
Presenting a practical solution ready for immediate use in industrial settings!
With YOLO-Worldโs emergence, Zero-shot Object Detection on Edge devices has become reality! ๐
๐ง (ํ๊ตญ์ด) YOLO-World ์์๋ณด๊ธฐ?!!
๐ YOLO๊ฐ ๋๋์ด Zero-Shot์ ์ธ๊ณ๋ก!!!
๋ ผ๋ฌธ: YOLO-World: Real-Time Open-Vocabulary Object Detection
๋ฐํ: CVPR 2024 (Tencent AI Lab, Cheng, Tianheng, et al.)
์ฝ๋: AILab-CVC/YOLO-World
๐ ํต์ฌ ์์ฝ
- ๐ก YOLO์ Open-Vocabulary ๋ฅ๋ ฅ์ ์ถ๊ฐ, ๋ชจ๋ ๊ฒ์ ํ์ง ๊ฐ๋ฅ!!
- โก Real-time Zero-shot ํ์ง๋ฅผ ํ๋ฉด์๋ ์ถ๋ก ์๋ ์ ์ง!!!
- โ Prompt-then-Detect ํ๋ฒ prompt decoding ํด๋๊ฑธ ๊ณ์ ์ฌ์ฉํ ์ ์์ด ๋น ๋ฅด๋ค!
๐ค ๊ธฐ์กด ์ฐ๊ตฌ์ ๋ฌธ์ ์
1๏ธโฃ Close Set Detector์ ์น๋ช ์ ํ๊ณ
- ๊ณ ์ ๋ ์ดํ: COCO 80๊ฐ, Objects365 365๊ฐ์ฒ๋ผ ๋ฏธ๋ฆฌ ์ ํด์ง ๊ฐ์ฒด๋ง ํ์ง ๊ฐ๋ฅ ๐
- ํ์ฅ์ฑ ์ ๋ก: ์๋ก์ด ๊ฐ์ฒด๋ฅผ ํ์งํ๋ ค๋ฉด ๋ฐ์ดํฐ ์์ง + ๋ผ๋ฒจ๋ง + ์ฌํ์ต ํ์ ๐
- ์ค์ฉ์ฑ ๋ถ์กฑ: ์ค์ ํ๊ฒฝ์์๋ ๋ฌดํํ ๋ค์ํ ๊ฐ์ฒด๊ฐ ์กด์ฌํ๋๋ฐ ๋์ ๋ถ๊ฐ ๐
2๏ธโฃ Open-Vocabulary ๋ชจ๋ธ๋ค์ ๋ฌด๊ฑฐ์ด ํ์ค
- ๊ฑฐ๋ํ ๋ฐฑ๋ณธ: GLIP, Grounding DINO ๋ฑ์ด Swin-L ๊ฐ์ ๋ํ ๋ฐฑ๋ณธ ์ฌ์ฉ ๐ฅ
- ๋๋ฆฐ ์ถ๋ก : ๋งค๋ฒ ํ ์คํธ ํ๋กฌํํธ๋ฅผ ์ค์๊ฐ์ผ๋ก ์ธ์ฝ๋ฉํด์ ์์ฒญ ๋๋ฆผ ๐
- ๋ฐฐํฌ ์ง์ฅ: ์ฃ์ง ๋๋ฐ์ด์ค๋ ์ค์๊ฐ ์์ฉ์์ ์ฌ์ฉํ๊ธฐ ๊ฑฐ์ ๋ถ๊ฐ๋ฅ ๐ฑโ
- ๊ณ์ฐ๋ ํญ์ฆ: ๋์ ์ ํ๋๋ฅผ ์ํด ์ค์ฉ์ฑ์ ํฌ๊ธฐํด์ผ ํ๋ ๋๋ ๋ง โ๏ธ
๐ก ๋๋ ๋ง: โ๋น ๋ฅด์ง๋ง ์ ํ์ (close set)์ด๊ฑฐ๋, ์ ์ฐ(Open vocabulary)ํ์ง๋ง ์์ฒญ ๋๋ฆฌ๊ฑฐ๋!โ
๐๏ธ ์ด๋ป๊ฒ ์๋ํ ๊น?
YOLO-World๋ ์ด๋ฏธ์ง ์ธ์ฝ๋์ ํ ์คํธ ์ธ์ฝ๋๋ฅผ ์ ๊ธฐ์ ์ผ๋ก ์ฐ๊ฒฐํ ๊ตฌ์กฐ์ ๋๋ค:
1๏ธโฃ YOLO Detector (Image encoder, YOLOv8 ์ฌ์ฉ)
1
2
3
4
5
์
๋ ฅ ์ด๋ฏธ์ง โ Darknet Backbone โ Multi-scale Features {C3, C4, C5}
โ
Feature Pyramid Network (PAN) โ {P3, P4, P5}
โ
Detection Head โ Bounding Boxes + Object Embeddings
- YOLOv8 ๊ธฐ๋ฐ: ๊ฒ์ฆ๋ ์ค์๊ฐ ๊ฐ์ฒด ํ์ง ์ํคํ ์ฒ YOLOv8 ํ์ฉ โก
- ๋ฉํฐ์ค์ผ์ผ ์ฒ๋ฆฌ: ์์ ๊ฐ์ฒด๋ถํฐ ํฐ ๊ฐ์ฒด๊น์ง(C3,C4,C5) ๋ชจ๋ ํฌ๊ธฐ ๋์ ๐
- Object Embeddings: ๊ฐ ๊ฐ์ฒด๋ฅผ D์ฐจ์ ๋ฒกํฐ๋ก ํํ (ํ ์คํธ์ ๋งค์นญ์ฉ) ๐
2๏ธโฃ Text Encoder (Language Understanding, CLIP ์ฌ์ฉ)
1
์ฌ์ฉ์ ํ
์คํธ โ n-gram ๋ช
์ฌ ์ถ์ถ โ CLIP Text Encoder โ Text Embeddings W
- CLIP ํ์ฉ: ์๊ฐ-์ธ์ด ์ฌ์ ํ์ต๋ ๊ฐ๋ ฅํ ํ ์คํธ ์ดํด ๐ง
- ๋ช ์ฌ๊ตฌ ์ถ์ถ: โ๋นจ๊ฐ ๋ชจ์๋ฅผ ์ด ์ฌ๋โ โ [โpersonโ, โred hatโ] ์ถ์ถ ๐ฏ
- ์๋ฒ ๋ฉ ๋ณํ: ํ ์คํธ๋ฅผ D์ฐจ์ ๋ฒกํฐ ๊ณต๊ฐ์ผ๋ก ๋งคํ ๐
3๏ธโฃ RepVL-PAN (Vision-Language Fusion Engine) ๐ฅ
1
2
3
4
5
Image Features โโ Cross-Modal Fusion โโ Text Embeddings
โ โ
Text-guided CSPLayer Image-Pooling Attention
โ โ
Enhanced Visual Features โโโโโโโ Enhanced Text Features
๐ฏ Text-guided CSPLayer (์ด๋ฏธ์ง์ ํ ์คํธ ์ ๋ณด ์ฃผ์ )
- Max-Sigmoid Attention: ํ ์คํธ์ ๊ด๋ จ๋ ์ด๋ฏธ์ง ์์ญ์ ์ง์ค
- ์์:
X'โ = Xโ ยท ฯ(max(XโW^T))
- ํจ๊ณผ: โ๊ณ ์์ดโ๋ผ๋ ํ ์คํธ๊ฐ ์์ผ๋ฉด ์ด๋ฏธ์ง์์ ๊ณ ์์ด ์์ญ์ ๋ ์ง์ค! ๐ฑ
๐ผ๏ธ Image-Pooling Attention (ํ ์คํธ์ ์ด๋ฏธ์ง ์ ๋ณด ์ฃผ์ )
- 27๊ฐ ํจ์น ํ ํฐ: 3ร3 ์์ญ์ผ๋ก ์ด๋ฏธ์ง๋ฅผ ์์ถํ์ฌ ํจ์จ์ ์ฒ๋ฆฌ
- Multi-Head Attention: ํ ์คํธ๊ฐ ์ด๋ฏธ์ง์ ์๊ฐ์ ์ปจํ ์คํธ ์ดํด
- ํจ๊ณผ: โ๊ณ ์์ดโ ํ ์คํธ๊ฐ ์ค์ ๊ณ ์์ด ๋ชจ์/์๊น ์ ๋ณด๋ฅผ ๋ฐ์! ๐จ
4๏ธโฃ Text Contrastive Head (๋งค์นญ ์์ง)
1
2
3
4
5
Object Embedding eโ โโ Similarity Score โโ Text Embedding wโฑผ
โ
s_{k,j} = ฮฑยทcos(eโ, wโฑผ) + ฮฒ
โ
์ต์ข
ํ์ง ๊ฒฐ๊ณผ ๊ฒฐ์ !
- Contrastive Learning: InfoNCE์ ๊ฐ์ ์๋ฆฌ๋ก positive/negative ๊ตฌ๋ถ โ๏ธ
- L2 ์ ๊ทํ: ํฌ๊ธฐ๊ฐ ์๋ ๋ฐฉํฅ(์๋ฏธ)์ผ๋ก ์ ์ฌ๋ ๊ณ์ฐ ๐งญ
- ์ํ ๋ณํ: ฮฑ(์ค์ผ์ผ๋ง) + ฮฒ(์ด๋)๋ก ํ๋ จ ์์ ํ ๐
๋ฐ์ดํฐ๋ ์ด๋ป๊ฒ ๋ชจ์์ง?
YOLO-World๋ ๋๊ท๋ชจ ๋ฐ์ดํฐ์ 3์ข ์ธํธ๋ก ํ์ตํ์ต๋๋ค! ๐ฏ
๐๏ธ 3๊ฐ์ง ๋ฐ์ดํฐ ์์ค
๋ฐ์ดํฐ ํ์ | ์์ | ํน์ง |
---|---|---|
Detection Data | COCO, Objects365 | ์ ํํ BBox + ํด๋์ค ๋ผ๋ฒจ โ |
Grounding Data | Visual Genome | ์์ฐ์ด ์ค๋ช + BBox ๐ |
Image-Text Data | CC3M | ์ด๋ฏธ์ง + ์บก์ (BBox ์์) โ |
๐ญ ํต์ฌ ๋ฌธ์ : Image-Text ๋ฐ์ดํฐ์ ๋๋ ๋ง
1
2
Image-Text ๋ฐ์ดํฐ: ์์ฒญ ๋ง์ง๋ง... BBox๊ฐ ์๋ค! ๐ฑ
"A red car driving on the highway" + ๐ผ๏ธ = BBox๋ ์ด๋์?
๐ค ์ฒ์ฌ์ ํด๊ฒฐ์ฑ : 3๋จ๊ณ Pseudo Labeling
Step 1: ๋ช ์ฌ๊ตฌ ์ถ์ถ ๐
1
2
3
4
# n-gram ์๊ณ ๋ฆฌ์ฆ์ผ๋ก ๊ฐ์ฒด ๋จ์ด ์ถ์ถ
caption = "A red car driving on the highway"
noun_phrases = extract_nouns(caption)
# ๊ฒฐ๊ณผ: ["red car", "highway"]
Step 2: Pseudo Box ์์ฑ ๐ฆ
1
2
3
4
5
# GLIP ๊ฐ์ ์คํ ์ดํ ๋ชจ๋ธ๋ก ๊ฐ์ง BBox ์์ฑ
for phrase in noun_phrases:
pseudo_boxes = GLIP_model.detect(image, phrase)
# ๊ฒฐ๊ณผ: "red car" โ [x1, y1, x2, y2] ์ขํ ์์ฑ!
Step 3: ํ์ง ๊ฒ์ฆ & ํํฐ๋ง โ
1
2
3
4
5
6
7
8
9
10
# CLIP์ผ๋ก ๊ด๋ จ์ฑ ์ ์ ๊ณ์ฐ
relevance_score = CLIP.similarity(image_region, text_phrase)
if relevance_score > threshold:
keep_annotation() # ํ์ง ์ข์ ๊ฒ๋ง ์ ์ง
else:
discard_annotation() # ์์ฑํ ๊ฒ์ ๋ฒ๋ฆผ
# + NMS๋ก ์ค๋ณต BBox ์ ๊ฑฐ
final_boxes = non_maximum_suppression(pseudo_boxes)
๐ ์ต์ข ๋ฐ์ดํฐ์ ๊ท๋ชจ
1
2
3
4
CC3M ๋ฐ์ดํฐ์
์์:
โโโ ์ํ๋ง๋ ์ด๋ฏธ์ง: 246,000์ฅ ๐ธ
โโโ ์์ฑ๋ Pseudo ๋ผ๋ฒจ: 821,000๊ฐ ๐ท๏ธ
โโโ ํ๊ท : ์ด๋ฏธ์ง๋น 3.3๊ฐ ๊ฐ์ฒด ํ์ง
๐ฅ ํ์ต ์ ๋ต: Region-Text Contrastive Loss
๐ฏ ์ ์ฒด ํ์ต ๊ณผ์ ๋จ๊ณ๋ณ ์ดํด
Step 1: ๋ชจ๋ธ์ด ์์ธกํ๋ ๊ฒ๋ค ๐ฆ
1
2
3
4
5
6
7
8
9
10
11
12
# YOLO-World๊ฐ ์ด๋ฏธ์ง๋ฅผ ๋ณด๊ณ ์์ธกํ๋ ๊ฒ:
predictions = {
'boxes': [B1, B2, ..., BK], # K๊ฐ์ ๋ฐ์ด๋ฉ ๋ฐ์ค
'scores': [s1, s2, ..., sK], # ๊ฐ ๋ฐ์ค์ confidence
'embeddings': [e1, e2, ..., eK] # ๊ฐ ๊ฐ์ฒด์ ํน์ง ๋ฒกํฐ
}
# ์ค์ ์ ๋ต ๋ฐ์ดํฐ:
ground_truth = {
'boxes': [B1_gt, B2_gt, ..., BN_gt], # N๊ฐ์ ์ ๋ต ๋ฐ์ค
'texts': [t1, t2, ..., tN] # ๊ฐ ๋ฐ์ค์ ํ
์คํธ ๋ผ๋ฒจ
}
Step 2: ์์ธก๊ณผ ์ ๋ต ๋งค์นญํ๊ธฐ ๐
1
2
3
4
5
6
7
8
# Task-aligned Assignment ์ฌ์ฉ
# "์ด๋ค ์์ธก ๋ฐ์ค๊ฐ ์ด๋ค ์ ๋ต ๋ฐ์ค์ ๋์๋๋๊ฐ?"
for prediction_k in predictions:
best_match = find_best_groundtruth(prediction_k)
if IoU(prediction_k, best_match) > threshold:
positive_pairs.append((prediction_k, best_match))
assign_text_label(prediction_k, best_match.text)
Step 3: Contrastive Loss ๊ณ์ฐ โ๏ธ
1
2
3
4
5
6
7
8
# ๊ฐ ๊ฐ์ฒด ์๋ฒ ๋ฉ๊ณผ ํ
์คํธ ์๋ฒ ๋ฉ ๊ฐ์ ์ ์ฌ๋ ๊ณ์ฐ
for object_embedding, text_embedding in positive_pairs:
similarity = cosine_similarity(object_embedding, text_embedding)
# Cross Entropy๋ก Loss ๊ณ์ฐ
# Positive: ์ค์ ๋งค์นญ๋๋ ํ
์คํธ์ ๋์ ์ ์ฌ๋
# Negative: ๋ค๋ฅธ ํ
์คํธ๋ค๊ณผ๋ ๋ฎ์ ์ ์ฌ๋
contrastive_loss += cross_entropy(similarity, true_text_index)
Loss Function ๊ตฌ์ฑ
1
2
3
4
5
6
7
8
9
10
11
# ์ ์ฒด Loss = Contrastive + Regression
total_loss = L_contrastive + ฮป * (L_IoU + L_focal)
# ๊ฐ Loss ์ญํ :
# - L_contrastive: "์ด ๊ฐ์ฒด๊ฐ '๊ณ ์์ด'์ธ๊ฐ '๊ฐ์์ง'์ธ๊ฐ?" (์๋ฏธ ํ์ต)
# - L_IoU: "๋ฐ์ด๋ฉ ๋ฐ์ค ์์น๊ฐ ์ ํํ๊ฐ?" (์์น ํ์ต)
# - L_focal: "๊ฐ์ฒด๊ฐ ์๋๊ฐ ์๋๊ฐ?" (์กด์ฌ ์ฌ๋ถ ํ์ต)
# ฮป (lambda) ๊ฐ์ ๋ฐ๋ฅธ ํ์ต ์ ๋ต:
# - Detection/Grounding ๋ฐ์ดํฐ: ฮป = 1 (๋ชจ๋ loss ์ฌ์ฉ)
# - Image-Text ๋ฐ์ดํฐ: ฮป = 0 (contrastive๋ง ์ฌ์ฉ)
์ ฮป = 0 ์ธ๊ฐ? ๐ค
1
2
3
4
5
6
7
8
9
10
11
์ํฉ 1: Detection ๋ฐ์ดํฐ (COCO, Objects365)
โโโ ์ ํํ BBox โ
โ ์์น ํ์ต ๊ฐ๋ฅ
โโโ ์ ํํ ๋ผ๋ฒจ โ
โ ์๋ฏธ ํ์ต ๊ฐ๋ฅ
โโโ ฮป = 1๋ก ๋ชจ๋ Loss ์ฌ์ฉ!
์ํฉ 2: Image-Text ๋ฐ์ดํฐ (CC3M + Pseudo Labels)
โโโ ๋ถ์ ํํ BBox โ โ ์์น ํ์ตํ๋ฉด ์คํ๋ ค ํด๋ก์
โโโ ์ ํํ ํ
์คํธ โ
โ ์๋ฏธ ํ์ต์ ๊ฐ๋ฅ
โโโ ฮป = 0์ผ๋ก Contrastive Loss๋ง ์ฌ์ฉ!
๊ฒฐ๋ก : "์์น๋ ์ ํํ ๋ฐ์ดํฐ๋ก๋ง, ์๋ฏธ๋ ๋ชจ๋ ๋ฐ์ดํฐ๋ก!" ๐ฏ
๐ ์ค์ ํ์ต ์์
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# ์์: ๊ณ ์์ด ์ด๋ฏธ์ง ํ์ต
# Case 1: COCO ๋ฐ์ดํฐ (์ ํํ BBox)
image = "๊ณ ์์ด ์ฌ์ง.jpg"
ground_truth = {
'box': [100, 50, 200, 150], # ์ ํํ ์ขํ
'text': "cat"
}
โ ฮป = 1๋ก ์์น + ์๋ฏธ ๋ ๋ค ํ์ต! โ
# Case 2: CC3M ๋ฐ์ดํฐ (Pseudo Box)
image = "๊ณ ์์ด ์ฌ์ง.jpg"
pseudo_labels = {
'box': [90, 45, 210, 160], # GLIP์ด ๋ง๋ ๋ถ์ ํํ ์ขํ
'text': "cat"
}
โ ฮป = 0์ผ๋ก ์๋ฏธ๋ง ํ์ต! (์์น๋ ๋ฌด์) โ
๐จ Mosaic Augmentation ํ์ฉ
1
2
3
4
5
6
7
์ฌ๋ฌ ์ด๋ฏธ์ง๋ฅผ ํ ๋ฒ์ ํฉ์ณ์ ํ์ต:
โโโโโโโโโโโฌโโโโโโโโโโ
โ ๐ฑ cat โ ๐ car โ
โโโโโโโโโโโผโโโโโโโโโโค
โ ๐ dog โ ๐ค personโ
โโโโโโโโโโโดโโโโโโโโโโ
โ ํ ๋ฒ์ 4๊ฐ ๊ฐ์ฒด ํ์ต์ผ๋ก ํจ์จ์ฑ UP! โก
๐ก ๋ฐ์ดํฐ ์์ง์ ํต์ฌ ์์ด๋์ด
โ์ ํํ ๋ฐ์ดํฐ๋ ์ ์ง๋ง ํ์ง ๋ณด์ฅ, ๋๋ ๋ฐ์ดํฐ๋ ์๋์ผ๋ก ๋ผ๋ฒจ๋งํด์ ํ์ฉ!โ
- ์๋ ์ ๋ฐ ๋ฐ์ดํฐ: Detection + Grounding (์ ํํ BBox)
- ๋๋ ์๋ ๋ฐ์ดํฐ: Image-Text โ Pseudo Labeling (์ค์ผ์ผ ํ๋ณด)
- ๊ท ํ ํ์ต: ๋ ์ข ๋ฅ๋ฅผ ์์ด์ ์ต์ ์ ์ฑ๋ฅ ๋ฌ์ฑ! ๐ฏ
์ด๋ ๊ฒ cleverํ ๋ฐ์ดํฐ ์ ๋ต์ผ๋ก YOLO-World๋ ๋น ๋ฅด๋ฉด์๋ ์ ํํ Open-Vocabulary ํ์ง๊ฐ ๊ฐ๋ฅํด์ก์ต๋๋ค! ๐
์คํ๊ฒฐ๊ณผ!! โจ
ํญ๋ชฉ | ์ค๋ช |
---|---|
์ค์๊ฐ ์ฑ๋ฅ | LVIS์์ 35.4 AP @ 52.0 FPS ๋ฌ์ฑ (V100 GPU) |
Prompt-then-Detect | ์คํ๋ผ์ธ ์ดํ ์๋ฒ ๋ฉ์ผ๋ก ์ค์๊ฐ ํ ์คํธ ์ธ์ฝ๋ฉ ๋ถํ์ |
Zero-Shot ๋ฅ๋ ฅ | ํ์ต์ ์๋ ๊ฐ์ฒด๋ ํ ์คํธ ํ๋กฌํํธ๋ง์ผ๋ก ํ์ง ๊ฐ๋ฅ |
๊ฒฝ๋ํ | ๊ธฐ์กด Open-Vocabulary ๋ชจ๋ธ ๋๋น 20๋ฐฐ ๋น ๋ฅด๊ณ 5๋ฐฐ ์์ |
๐ฏ ์ฃผ์ ๊ธฐ์ ์ ํ์
RepVL-PAN์ ํต์ฌ ๊ตฌ์ฑ์์
๐ฏ Text-guided CSPLayer (T-CSPLayer)
YOLOv8์ C2f ๋ ์ด์ด์ ํ ์คํธ ๊ฐ์ด๋์ค ์ถ๊ฐ
Max-Sigmoid Attention์ผ๋ก ํ ์คํธ ๊ด๋ จ ์์ญ์ ์ง์ค๐ผ๏ธ Image-Pooling Attention
๋ฉํฐ์ค์ผ์ผ ์ด๋ฏธ์ง ํน์ง์ 27๊ฐ ํจ์น ํ ํฐ์ผ๋ก ์์ถ
ํ ์คํธ ์๋ฒ ๋ฉ์ ์๊ฐ์ ์ปจํ ์คํธ๋ก ํฅ์
๐ ์ฑ๋ฅ ๋น๊ต
Zero-shot LVIS ๋ฒค์น๋งํฌ
Model | Backbone | FPS | AP | AP_r | AP_c | AP_f |
---|---|---|---|---|---|---|
GLIP-T | Swin-T | 0.12 | 26.0 | 20.8 | 21.4 | 31.0 |
Grounding DINO-T | Swin-T | 1.5 | 27.4 | 18.1 | 23.3 | 32.7 |
DetCLIP-T | Swin-T | 2.3 | 34.4 | 26.9 | 33.9 | 36.3 |
YOLO-World-L | YOLOv8-L | 52.0 | 35.4 | 27.6 | 34.1 | 38.0 |
- FPS๊ฐ ๋๋ค! ์ฆ ์์ฒญ ๋น ๋ฅด์ฃ ? 1์ด๋ฐ 52์ฅ์ ์ฒ๋ฆฌํ ์ ์๋์!
- ๊ทธ๋ฌ๋ฉด์ ์ ํ๋(AP)๋ ๋์์!
โ ๏ธ ํ๊ณ์
๐ญ ๋ณต์กํ ์ํธ์์ฉ ํํ์ ํ๊ณ
๋จ์ํ ํ ์คํธ ํ๋กฌํํธ๋ก๋ ๋ณต์กํ ๊ด๊ณ ํํ์ด ์ด๋ ค์ธ ์ ์์๐ ํด์๋ ์์กด์ฑ
์์ ๊ฐ์ฒด ํ์ง๋ฅผ ์ํด์๋ ๊ณ ํด์๋ ์ ๋ ฅ์ด ํ์ํ ์ ์์๐พ ๋ฉ๋ชจ๋ฆฌ ์ฌ์ฉ๋
Re-parameterization ๊ณผ์ ์์ ์ถ๊ฐ์ ์ธ ๋ฉ๋ชจ๋ฆฌ ์ค๋ฒํค๋ ๋ฐ์
โ
โ ๋ง๋ฌด๋ฆฌ ์์ฝ
YOLO-World๋ ์ค์๊ฐ ์ฑ๋ฅ๊ณผ Open-Vocabulary ๋ฅ๋ ฅ์ ๋์์ ๋ฌ์ฑํ ํ๊ธฐ์ ์ธ ๊ฐ์ฒด ํ์ง ๋ชจ๋ธ์ ๋๋ค.
๐ YOLO์ ์๋ + CLIP์ ์ธ์ด ์ดํด๋ ฅ!
๊ธฐ์กด Open-Vocabulary ๋ชจ๋ธ๋ค์ด ๋ฌด๊ฒ๊ณ ๋๋ฆฐ ๋ฌธ์ ๋ฅผ ํด๊ฒฐํ๋ฉฐ,
์ค์ ์ฐ์ ํ์ฅ์์ ๋ฐ๋ก ํ์ฉ ๊ฐ๋ฅํ ์ค์ฉ์ ์ธ ์๋ฃจ์ ์ ์ ์!
YOLO-World์ ๋ฑ์ฅ์ผ๋ก ์ด์ Edge ๋๋ฐ์ด์ค์์๋ Zero-shot Object Detection์ด ํ์ค์ด ๋์์ต๋๋ค! ๐