Post

๐Ÿ“Understanding YOLO-World - ์‹ค์‹œ๊ฐ„ Open-Vocabulary Object Detection์˜ ํ˜์‹ !!!

๐Ÿ“Understanding YOLO-World - ์‹ค์‹œ๊ฐ„ Open-Vocabulary Object Detection์˜ ํ˜์‹ !!!

๐Ÿง  Understanding YOLO-World!!

๐Ÿ” YOLO finally enters the Zero-Shot world!!!

manhwa

Paper: YOLO-World: Real-Time Open-Vocabulary Object Detection
Conference: CVPR 2024 (Tencent AI Lab, Cheng, Tianheng, et al.)
Code: AILab-CVC/YOLO-World


๐Ÿ”Ž Key Summary

  • ๐Ÿ’ก YOLO with Open-Vocabulary capabilities added - can detect anything!!
  • โšก Real-time Zero-shot detection while maintaining inference speed!!!
  • โœ… Prompt-then-Detect - decode prompts once and keep using them for speed!

๐Ÿค” Problems with Existing Research

problems

1๏ธโƒฃ Fatal Limitations of Closed-Set Detectors

  • Fixed vocabulary: Can only detect predefined objects like COCO 80, Objects365 365 ๐Ÿ˜”
  • Zero scalability: Need data collection + labeling + retraining for new objects ๐Ÿ”„
  • Lack of practicality: Cannot handle infinitely diverse objects in real environments ๐ŸŒ

2๏ธโƒฃ Heavy Reality of Open-Vocabulary Models

  • Massive backbones: GLIP, Grounding DINO use large backbones like Swin-L ๐Ÿ’ฅ
  • Slow inference: Real-time text prompt encoding makes them extremely slow ๐ŸŒ
  • Deployment hell: Nearly impossible to use on edge devices or real-time applications ๐Ÿ“ฑโŒ
  • Computational explosion: Dilemma of sacrificing practicality for high accuracy โš–๏ธ

๐Ÿ’ก Dilemma: โ€œEither fast but limited (closed set) or flexible (open vocabulary) but extremely slow!โ€


๐Ÿ—๏ธ How Does It Work?

structure

YOLO-World has a structure that organically connects image and text encoders:

1๏ธโƒฃ YOLO Detector (Image encoder, using YOLOv8)

1
2
3
4
5
Input Image โ†’ Darknet Backbone โ†’ Multi-scale Features {C3, C4, C5}
                    โ†“
        Feature Pyramid Network (PAN) โ†’ {P3, P4, P5}
                    โ†“
          Detection Head โ†’ Bounding Boxes + Object Embeddings
  • YOLOv8-based: Utilizes proven real-time object detection architecture YOLOv8 โšก
  • Multi-scale processing: Handles all sizes from small to large objects (C3,C4,C5) ๐Ÿ“
  • Object Embeddings: Represents each object as D-dimensional vectors (for text matching) ๐Ÿ”—

2๏ธโƒฃ Text Encoder (Language Understanding, using CLIP)

1
User Text โ†’ n-gram Noun Extraction โ†’ CLIP Text Encoder โ†’ Text Embeddings W
  • CLIP utilization: Powerful text understanding from vision-language pre-training ๐Ÿง 
  • Noun phrase extraction: โ€œperson with red hatโ€ โ†’ [โ€œpersonโ€, โ€œred hatโ€] extraction ๐ŸŽฏ
  • Embedding conversion: Maps text to D-dimensional vector space ๐Ÿ“Š

3๏ธโƒฃ RepVL-PAN (Vision-Language Fusion Engine) ๐Ÿ”ฅ

1
2
3
4
5
Image Features โ†โ†’ Cross-Modal Fusion โ†โ†’ Text Embeddings
       โ†“                                      โ†“
Text-guided CSPLayer              Image-Pooling Attention
       โ†“                                      โ†“
Enhanced Visual Features โ†โ”€โ”€โ”€โ”€โ”€โ”€ Enhanced Text Features

๐ŸŽฏ Text-guided CSPLayer (Injecting text information into images)

  • Max-Sigmoid Attention: Focus on image regions related to text
  • Formula: X'โ‚— = Xโ‚— ยท ฯƒ(max(Xโ‚—W^T))
  • Effect: If thereโ€™s โ€œcatโ€ text, focus more on cat regions in the image! ๐Ÿฑ

๐Ÿ–ผ๏ธ Image-Pooling Attention (Injecting image information into text)

  • 27 patch tokens: Compress image into 3ร—3 regions for efficient processing
  • Multi-Head Attention: Text understands visual context of the image
  • Effect: โ€œCatโ€ text reflects actual cat appearance/color information! ๐ŸŽจ

4๏ธโƒฃ Text Contrastive Head (Matching Engine)

1
2
3
4
5
Object Embedding eโ‚– โ†โ†’ Similarity Score โ†โ†’ Text Embedding wโฑผ
                        โ†“
              s_{k,j} = ฮฑยทcos(eโ‚–, wโฑผ) + ฮฒ
                        โ†“
                 Final Detection Result!
  • Contrastive Learning: Distinguishes positive/negative like InfoNCE โš–๏ธ
  • L2 Normalization: Calculates similarity by direction (meaning) not magnitude ๐Ÿงญ
  • Affine Transformation: Training stabilization with ฮฑ(scaling) + ฮฒ(shift) ๐Ÿ“ˆ

How Was the Data Collected?

YOLO-World was trained on 3 large-scale dataset types! ๐ŸŽฏ

๐Ÿ—‚๏ธ 3 Data Sources

Data TypeExampleFeatures
Detection DataCOCO, Objects365Accurate BBox + class labels โœ…
Grounding DataVisual GenomeNatural language descriptions + BBox ๐Ÿ”—
Image-Text DataCC3MImage + caption (no BBox) โŒ

๐ŸŽญ Core Problem: Image-Text Data Dilemma

1
2
Image-Text data: Massive but... no BBox! ๐Ÿ˜ฑ
"A red car driving on the highway" + ๐Ÿ–ผ๏ธ = Where's the BBox? 

๐Ÿค– Brilliant Solution: 3-Step Pseudo Labeling

Step 1: Noun Phrase Extraction ๐Ÿ”
1
2
3
4
# Extract object words using n-gram algorithm
caption = "A red car driving on the highway"
noun_phrases = extract_nouns(caption)
# Result: ["red car", "highway"]
Step 2: Pseudo Box Generation ๐Ÿ“ฆ
1
2
3
4
5
# Generate fake BBox using open-vocabulary models like GLIP
for phrase in noun_phrases:
    pseudo_boxes = GLIP_model.detect(image, phrase)
    
# Result: "red car" โ†’ [x1, y1, x2, y2] coordinates generated!
Step 3: Quality Verification & Filtering โœ…
1
2
3
4
5
6
7
8
9
10
# Calculate relevance score using CLIP
relevance_score = CLIP.similarity(image_region, text_phrase)

if relevance_score > threshold:
    keep_annotation()  # Keep only high-quality ones
else:
    discard_annotation()  # Discard poor ones

# + Remove duplicate BBox with NMS
final_boxes = non_maximum_suppression(pseudo_boxes)

๐Ÿ“Š Final Dataset Scale

1
2
3
4
From CC3M dataset:
โ”œโ”€โ”€ Sampled images: 246,000 ๐Ÿ“ธ
โ”œโ”€โ”€ Generated Pseudo labels: 821,000 ๐Ÿท๏ธ
โ””โ”€โ”€ Average: 3.3 objects per image

๐Ÿ”ฅ Training Strategy: Region-Text Contrastive Loss

๐ŸŽฏ Step-by-Step Understanding of Overall Training Process

Step 1: What the Model Predicts ๐Ÿ“ฆ

1
2
3
4
5
6
7
8
9
10
11
12
# What YOLO-World predicts from images:
predictions = {
    'boxes': [B1, B2, ..., BK],      # K bounding boxes
    'scores': [s1, s2, ..., sK],     # Confidence for each box
    'embeddings': [e1, e2, ..., eK]  # Feature vector for each object
}

# Actual ground truth data:
ground_truth = {
    'boxes': [B1_gt, B2_gt, ..., BN_gt],    # N ground truth boxes
    'texts': [t1, t2, ..., tN]              # Text label for each box
}

Step 2: Matching Predictions with Ground Truth ๐Ÿ”—

1
2
3
4
5
6
7
8
# Using Task-aligned Assignment
# "Which prediction box corresponds to which ground truth box?"

for prediction_k in predictions:
    best_match = find_best_groundtruth(prediction_k)
    if IoU(prediction_k, best_match) > threshold:
        positive_pairs.append((prediction_k, best_match))
        assign_text_label(prediction_k, best_match.text)

Step 3: Contrastive Loss Calculation โš–๏ธ

1
2
3
4
5
6
7
8
# Calculate similarity between object embeddings and text embeddings
for object_embedding, text_embedding in positive_pairs:
    similarity = cosine_similarity(object_embedding, text_embedding)
    
    # Calculate Loss with Cross Entropy
    # Positive: High similarity with actual matching text
    # Negative: Low similarity with other texts
    contrastive_loss += cross_entropy(similarity, true_text_index)
Loss Function Composition
1
2
3
4
5
6
7
8
9
10
11
# Total Loss = Contrastive + Regression
total_loss = L_contrastive + ฮป * (L_IoU + L_focal)

# Role of each Loss:
# - L_contrastive: "Is this object a 'cat' or 'dog'?" (semantic learning)
# - L_IoU: "Is the bounding box location accurate?" (location learning)  
# - L_focal: "Is there an object or not?" (existence learning)

# ฮป (lambda) value training strategy:
# - Detection/Grounding data: ฮป = 1 (use all losses)
# - Image-Text data: ฮป = 0 (contrastive only)
Why ฮป = 0? ๐Ÿค”
1
2
3
4
5
6
7
8
9
10
11
Situation 1: Detection data (COCO, Objects365)
โ”œโ”€โ”€ Accurate BBox โœ… โ†’ Location learning possible
โ”œโ”€โ”€ Accurate labels โœ… โ†’ Semantic learning possible  
โ””โ”€โ”€ ฮป = 1 to use all losses!

Situation 2: Image-Text data (CC3M + Pseudo Labels)
โ”œโ”€โ”€ Inaccurate BBox โŒ โ†’ Location learning would be harmful
โ”œโ”€โ”€ Accurate text โœ… โ†’ Semantic learning possible
โ””โ”€โ”€ ฮป = 0 to use only Contrastive Loss!

Conclusion: "Location with accurate data only, semantics with all data!" ๐ŸŽฏ
๐Ÿ” Real Training Examples
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# Example: Cat image training

# Case 1: COCO data (accurate BBox)
image = "cat_photo.jpg"
ground_truth = {
    'box': [100, 50, 200, 150],  # Accurate coordinates
    'text': "cat"
}
โ†’ ฮป = 1 to learn both location + semantics! โœ…

# Case 2: CC3M data (Pseudo Box)  
image = "cat_photo.jpg"
pseudo_labels = {
    'box': [90, 45, 210, 160],   # Inaccurate coordinates made by GLIP
    'text': "cat"
}
โ†’ ฮป = 0 to learn semantics only! (ignore location) โœ…

๐ŸŽจ Mosaic Augmentation Utilization

1
2
3
4
5
6
7
Learning by combining multiple images at once:
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ ๐Ÿฑ cat  โ”‚ ๐Ÿš— car   โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค  
โ”‚ ๐Ÿ• dog  โ”‚ ๐Ÿ‘ค personโ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ†’ Learning 4 objects at once for efficiency UP! โšก

๐Ÿ’ก Core Idea of Data Collection

โ€œAccurate data is small but quality guaranteed, massive data is automatically labeled for utilization!โ€

  1. Small precise data: Detection + Grounding (accurate BBox)
  2. Large automatic data: Image-Text โ†’ Pseudo Labeling (scale acquisition)
  3. Balanced learning: Mix both types for optimal performance! ๐ŸŽฏ

With this clever data strategy, YOLO-World achieved fast yet accurate Open-Vocabulary detection! ๐Ÿš€


Experimental Results!! โœจ

ItemDescription
Real-time PerformanceAchieved 35.4 AP @ 52.0 FPS on LVIS (V100 GPU)
Prompt-then-DetectNo need for real-time text encoding with offline vocabulary embeddings
Zero-Shot AbilityCan detect objects not seen in training with text prompts only
Lightweight20x faster and 5x smaller than existing Open-Vocabulary models

๐ŸŽฏ Key Technical Innovations

Core Components of RepVL-PAN

  • ๐ŸŽฏ Text-guided CSPLayer (T-CSPLayer)
    Adds text guidance to YOLOv8โ€™s C2f layer
    Focus on text-related regions with Max-Sigmoid Attention

  • ๐Ÿ–ผ๏ธ Image-Pooling Attention
    Compresses multi-scale image features into 27 patch tokens
    Enhances text embeddings with visual context


๐Ÿ“Š Performance Comparison

Zero-shot LVIS Benchmark

ModelBackboneFPSAPAP_rAP_cAP_f
GLIP-TSwin-T0.1226.020.821.431.0
Grounding DINO-TSwin-T1.527.418.123.332.7
DetCLIP-TSwin-T2.334.426.933.936.3
YOLO-World-LYOLOv8-L52.035.427.634.138.0
  • High FPS! Extremely fast - can process 52 images per second!
  • While maintaining high accuracy (AP)!

โš ๏ธ Limitations

  • ๐ŸŽญ Limitations in Complex Interaction Expression
    Simple text prompts may struggle with complex relationship expressions

  • ๐Ÿ“ Resolution Dependency
    High-resolution input may be required for small object detection

  • ๐Ÿ’พ Memory Usage
    Additional memory overhead during re-parameterization process


โœ… Summary

YOLO-World is a groundbreaking object detection model that simultaneously achieves real-time performance and Open-Vocabulary capabilities.

๐Ÿ“Œ YOLOโ€™s speed + CLIPโ€™s language understanding!
Solves the heavy and slow problems of existing Open-Vocabulary models,
Presenting a practical solution ready for immediate use in industrial settings!

With YOLO-Worldโ€™s emergence, Zero-shot Object Detection on Edge devices has become reality! ๐ŸŽ‰


๐Ÿง  (ํ•œ๊ตญ์–ด) YOLO-World ์•Œ์•„๋ณด๊ธฐ?!!

๐Ÿ” YOLO๊ฐ€ ๋“œ๋””์–ด Zero-Shot์˜ ์„ธ๊ณ„๋กœ!!!

manhwa

๋…ผ๋ฌธ: YOLO-World: Real-Time Open-Vocabulary Object Detection
๋ฐœํ‘œ: CVPR 2024 (Tencent AI Lab, Cheng, Tianheng, et al.)
์ฝ”๋“œ: AILab-CVC/YOLO-World


๐Ÿ”Ž ํ•ต์‹ฌ ์š”์•ฝ

  • ๐Ÿ’ก YOLO์— Open-Vocabulary ๋Šฅ๋ ฅ์„ ์ถ”๊ฐ€, ๋ชจ๋“ ๊ฒƒ์„ ํƒ์ง€ ๊ฐ€๋Šฅ!!
  • โšก Real-time Zero-shot ํƒ์ง€๋ฅผ ํ•˜๋ฉด์„œ๋„ ์ถ”๋ก  ์†๋„ ์œ ์ง€!!!
  • โœ… Prompt-then-Detect ํ•œ๋ฒˆ prompt decoding ํ•ด๋‘”๊ฑธ ๊ณ„์† ์‚ฌ์šฉํ• ์ˆ˜ ์žˆ์–ด ๋น ๋ฅด๋‹ค!

๐Ÿค” ๊ธฐ์กด ์—ฐ๊ตฌ์˜ ๋ฌธ์ œ์ 

problems

1๏ธโƒฃ Close Set Detector์˜ ์น˜๋ช…์  ํ•œ๊ณ„

  • ๊ณ ์ •๋œ ์–ดํœ˜: COCO 80๊ฐœ, Objects365 365๊ฐœ์ฒ˜๋Ÿผ ๋ฏธ๋ฆฌ ์ •ํ•ด์ง„ ๊ฐ์ฒด๋งŒ ํƒ์ง€ ๊ฐ€๋Šฅ ๐Ÿ˜”
  • ํ™•์žฅ์„ฑ ์ œ๋กœ: ์ƒˆ๋กœ์šด ๊ฐ์ฒด๋ฅผ ํƒ์ง€ํ•˜๋ ค๋ฉด ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ + ๋ผ๋ฒจ๋ง + ์žฌํ•™์Šต ํ•„์š” ๐Ÿ”„
  • ์‹ค์šฉ์„ฑ ๋ถ€์กฑ: ์‹ค์ œ ํ™˜๊ฒฝ์—์„œ๋Š” ๋ฌดํ•œํžˆ ๋‹ค์–‘ํ•œ ๊ฐ์ฒด๊ฐ€ ์กด์žฌํ•˜๋Š”๋ฐ ๋Œ€์‘ ๋ถˆ๊ฐ€ ๐ŸŒ

2๏ธโƒฃ Open-Vocabulary ๋ชจ๋ธ๋“ค์˜ ๋ฌด๊ฑฐ์šด ํ˜„์‹ค

  • ๊ฑฐ๋Œ€ํ•œ ๋ฐฑ๋ณธ: GLIP, Grounding DINO ๋“ฑ์ด Swin-L ๊ฐ™์€ ๋Œ€ํ˜• ๋ฐฑ๋ณธ ์‚ฌ์šฉ ๐Ÿ’ฅ
  • ๋А๋ฆฐ ์ถ”๋ก : ๋งค๋ฒˆ ํ…์ŠคํŠธ ํ”„๋กฌํ”„ํŠธ๋ฅผ ์‹ค์‹œ๊ฐ„์œผ๋กœ ์ธ์ฝ”๋”ฉํ•ด์„œ ์—„์ฒญ ๋А๋ฆผ ๐ŸŒ
  • ๋ฐฐํฌ ์ง€์˜ฅ: ์—ฃ์ง€ ๋””๋ฐ”์ด์Šค๋‚˜ ์‹ค์‹œ๊ฐ„ ์‘์šฉ์—์„œ ์‚ฌ์šฉํ•˜๊ธฐ ๊ฑฐ์˜ ๋ถˆ๊ฐ€๋Šฅ ๐Ÿ“ฑโŒ
  • ๊ณ„์‚ฐ๋Ÿ‰ ํญ์ฆ: ๋†’์€ ์ •ํ™•๋„๋ฅผ ์œ„ํ•ด ์‹ค์šฉ์„ฑ์„ ํฌ๊ธฐํ•ด์•ผ ํ•˜๋Š” ๋”œ๋ ˆ๋งˆ โš–๏ธ

๐Ÿ’ก ๋”œ๋ ˆ๋งˆ: โ€œ๋น ๋ฅด์ง€๋งŒ ์ œํ•œ์ (close set)์ด๊ฑฐ๋‚˜, ์œ ์—ฐ(Open vocabulary)ํ•˜์ง€๋งŒ ์—„์ฒญ ๋А๋ฆฌ๊ฑฐ๋‚˜!โ€


๐Ÿ—๏ธ ์–ด๋–ป๊ฒŒ ์ž‘๋™ํ• ๊นŒ?

structure

YOLO-World๋Š” ์ด๋ฏธ์ง€ ์ธ์ฝ”๋”์™€ ํ…์ŠคํŠธ ์ธ์ฝ”๋”๋ฅผ ์œ ๊ธฐ์ ์œผ๋กœ ์—ฐ๊ฒฐํ•œ ๊ตฌ์กฐ์ž…๋‹ˆ๋‹ค:

1๏ธโƒฃ YOLO Detector (Image encoder, YOLOv8 ์‚ฌ์šฉ)

1
2
3
4
5
์ž…๋ ฅ ์ด๋ฏธ์ง€ โ†’ Darknet Backbone โ†’ Multi-scale Features {C3, C4, C5}
                    โ†“
        Feature Pyramid Network (PAN) โ†’ {P3, P4, P5}
                    โ†“
          Detection Head โ†’ Bounding Boxes + Object Embeddings
  • YOLOv8 ๊ธฐ๋ฐ˜: ๊ฒ€์ฆ๋œ ์‹ค์‹œ๊ฐ„ ๊ฐ์ฒด ํƒ์ง€ ์•„ํ‚คํ…์ฒ˜ YOLOv8 ํ™œ์šฉ โšก
  • ๋ฉ€ํ‹ฐ์Šค์ผ€์ผ ์ฒ˜๋ฆฌ: ์ž‘์€ ๊ฐ์ฒด๋ถ€ํ„ฐ ํฐ ๊ฐ์ฒด๊นŒ์ง€(C3,C4,C5) ๋ชจ๋“  ํฌ๊ธฐ ๋Œ€์‘ ๐Ÿ“
  • Object Embeddings: ๊ฐ ๊ฐ์ฒด๋ฅผ D์ฐจ์› ๋ฒกํ„ฐ๋กœ ํ‘œํ˜„ (ํ…์ŠคํŠธ์™€ ๋งค์นญ์šฉ) ๐Ÿ”—

2๏ธโƒฃ Text Encoder (Language Understanding, CLIP ์‚ฌ์šฉ)

1
์‚ฌ์šฉ์ž ํ…์ŠคํŠธ โ†’ n-gram ๋ช…์‚ฌ ์ถ”์ถœ โ†’ CLIP Text Encoder โ†’ Text Embeddings W
  • CLIP ํ™œ์šฉ: ์‹œ๊ฐ-์–ธ์–ด ์‚ฌ์ „ํ•™์Šต๋œ ๊ฐ•๋ ฅํ•œ ํ…์ŠคํŠธ ์ดํ•ด ๐Ÿง 
  • ๋ช…์‚ฌ๊ตฌ ์ถ”์ถœ: โ€œ๋นจ๊ฐ„ ๋ชจ์ž๋ฅผ ์“ด ์‚ฌ๋žŒโ€ โ†’ [โ€œpersonโ€, โ€œred hatโ€] ์ถ”์ถœ ๐ŸŽฏ
  • ์ž„๋ฒ ๋”ฉ ๋ณ€ํ™˜: ํ…์ŠคํŠธ๋ฅผ D์ฐจ์› ๋ฒกํ„ฐ ๊ณต๊ฐ„์œผ๋กœ ๋งคํ•‘ ๐Ÿ“Š

3๏ธโƒฃ RepVL-PAN (Vision-Language Fusion Engine) ๐Ÿ”ฅ

1
2
3
4
5
Image Features โ†โ†’ Cross-Modal Fusion โ†โ†’ Text Embeddings
       โ†“                                      โ†“
Text-guided CSPLayer              Image-Pooling Attention
       โ†“                                      โ†“
Enhanced Visual Features โ†โ”€โ”€โ”€โ”€โ”€โ”€ Enhanced Text Features

๐ŸŽฏ Text-guided CSPLayer (์ด๋ฏธ์ง€์— ํ…์ŠคํŠธ ์ •๋ณด ์ฃผ์ž…)

  • Max-Sigmoid Attention: ํ…์ŠคํŠธ์™€ ๊ด€๋ จ๋œ ์ด๋ฏธ์ง€ ์˜์—ญ์— ์ง‘์ค‘
  • ์ˆ˜์‹: X'โ‚— = Xโ‚— ยท ฯƒ(max(Xโ‚—W^T))
  • ํšจ๊ณผ: โ€œ๊ณ ์–‘์ดโ€๋ผ๋Š” ํ…์ŠคํŠธ๊ฐ€ ์žˆ์œผ๋ฉด ์ด๋ฏธ์ง€์—์„œ ๊ณ ์–‘์ด ์˜์—ญ์— ๋” ์ง‘์ค‘! ๐Ÿฑ

๐Ÿ–ผ๏ธ Image-Pooling Attention (ํ…์ŠคํŠธ์— ์ด๋ฏธ์ง€ ์ •๋ณด ์ฃผ์ž…)

  • 27๊ฐœ ํŒจ์น˜ ํ† ํฐ: 3ร—3 ์˜์—ญ์œผ๋กœ ์ด๋ฏธ์ง€๋ฅผ ์••์ถ•ํ•˜์—ฌ ํšจ์œจ์  ์ฒ˜๋ฆฌ
  • Multi-Head Attention: ํ…์ŠคํŠธ๊ฐ€ ์ด๋ฏธ์ง€์˜ ์‹œ๊ฐ์  ์ปจํ…์ŠคํŠธ ์ดํ•ด
  • ํšจ๊ณผ: โ€œ๊ณ ์–‘์ดโ€ ํ…์ŠคํŠธ๊ฐ€ ์‹ค์ œ ๊ณ ์–‘์ด ๋ชจ์–‘/์ƒ‰๊น” ์ •๋ณด๋ฅผ ๋ฐ˜์˜! ๐ŸŽจ

4๏ธโƒฃ Text Contrastive Head (๋งค์นญ ์—”์ง„)

1
2
3
4
5
Object Embedding eโ‚– โ†โ†’ Similarity Score โ†โ†’ Text Embedding wโฑผ
                        โ†“
              s_{k,j} = ฮฑยทcos(eโ‚–, wโฑผ) + ฮฒ
                        โ†“
                 ์ตœ์ข… ํƒ์ง€ ๊ฒฐ๊ณผ ๊ฒฐ์ •!
  • Contrastive Learning: InfoNCE์™€ ๊ฐ™์€ ์›๋ฆฌ๋กœ positive/negative ๊ตฌ๋ถ„ โš–๏ธ
  • L2 ์ •๊ทœํ™”: ํฌ๊ธฐ๊ฐ€ ์•„๋‹Œ ๋ฐฉํ–ฅ(์˜๋ฏธ)์œผ๋กœ ์œ ์‚ฌ๋„ ๊ณ„์‚ฐ ๐Ÿงญ
  • ์•„ํ•€ ๋ณ€ํ™˜: ฮฑ(์Šค์ผ€์ผ๋ง) + ฮฒ(์ด๋™)๋กœ ํ›ˆ๋ จ ์•ˆ์ •ํ™” ๐Ÿ“ˆ

๋ฐ์ดํ„ฐ๋Š” ์–ด๋–ป๊ฒŒ ๋ชจ์•˜์ง€?

YOLO-World๋Š” ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ์…‹ 3์ข… ์„ธํŠธ๋กœ ํ•™์Šตํ–ˆ์Šต๋‹ˆ๋‹ค! ๐ŸŽฏ

๐Ÿ—‚๏ธ 3๊ฐ€์ง€ ๋ฐ์ดํ„ฐ ์†Œ์Šค

๋ฐ์ดํ„ฐ ํƒ€์ž…์˜ˆ์‹œํŠน์ง•
Detection DataCOCO, Objects365์ •ํ™•ํ•œ BBox + ํด๋ž˜์Šค ๋ผ๋ฒจ โœ…
Grounding DataVisual Genome์ž์—ฐ์–ด ์„ค๋ช… + BBox ๐Ÿ”—
Image-Text DataCC3M์ด๋ฏธ์ง€ + ์บก์…˜ (BBox ์—†์Œ) โŒ

๐ŸŽญ ํ•ต์‹ฌ ๋ฌธ์ œ: Image-Text ๋ฐ์ดํ„ฐ์˜ ๋”œ๋ ˆ๋งˆ

1
2
Image-Text ๋ฐ์ดํ„ฐ: ์—„์ฒญ ๋งŽ์ง€๋งŒ... BBox๊ฐ€ ์—†๋‹ค! ๐Ÿ˜ฑ
"A red car driving on the highway" + ๐Ÿ–ผ๏ธ = BBox๋Š” ์–ด๋””์—? 

๐Ÿค– ์ฒœ์žฌ์  ํ•ด๊ฒฐ์ฑ…: 3๋‹จ๊ณ„ Pseudo Labeling

Step 1: ๋ช…์‚ฌ๊ตฌ ์ถ”์ถœ ๐Ÿ”
1
2
3
4
# n-gram ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ ๊ฐ์ฒด ๋‹จ์–ด ์ถ”์ถœ
caption = "A red car driving on the highway"
noun_phrases = extract_nouns(caption)
# ๊ฒฐ๊ณผ: ["red car", "highway"]
Step 2: Pseudo Box ์ƒ์„ฑ ๐Ÿ“ฆ
1
2
3
4
5
# GLIP ๊ฐ™์€ ์˜คํ”ˆ ์–ดํœ˜ ๋ชจ๋ธ๋กœ ๊ฐ€์งœ BBox ์ƒ์„ฑ
for phrase in noun_phrases:
    pseudo_boxes = GLIP_model.detect(image, phrase)
    
# ๊ฒฐ๊ณผ: "red car" โ†’ [x1, y1, x2, y2] ์ขŒํ‘œ ์ƒ์„ฑ!
Step 3: ํ’ˆ์งˆ ๊ฒ€์ฆ & ํ•„ํ„ฐ๋ง โœ…
1
2
3
4
5
6
7
8
9
10
# CLIP์œผ๋กœ ๊ด€๋ จ์„ฑ ์ ์ˆ˜ ๊ณ„์‚ฐ
relevance_score = CLIP.similarity(image_region, text_phrase)

if relevance_score > threshold:
    keep_annotation()  # ํ’ˆ์งˆ ์ข‹์€ ๊ฒƒ๋งŒ ์œ ์ง€
else:
    discard_annotation()  # ์—‰์„ฑํ•œ ๊ฒƒ์€ ๋ฒ„๋ฆผ

# + NMS๋กœ ์ค‘๋ณต BBox ์ œ๊ฑฐ
final_boxes = non_maximum_suppression(pseudo_boxes)

๐Ÿ“Š ์ตœ์ข… ๋ฐ์ดํ„ฐ์…‹ ๊ทœ๋ชจ

1
2
3
4
CC3M ๋ฐ์ดํ„ฐ์…‹์—์„œ:
โ”œโ”€โ”€ ์ƒ˜ํ”Œ๋ง๋œ ์ด๋ฏธ์ง€: 246,000์žฅ ๐Ÿ“ธ
โ”œโ”€โ”€ ์ƒ์„ฑ๋œ Pseudo ๋ผ๋ฒจ: 821,000๊ฐœ ๐Ÿท๏ธ
โ””โ”€โ”€ ํ‰๊ท : ์ด๋ฏธ์ง€๋‹น 3.3๊ฐœ ๊ฐ์ฒด ํƒ์ง€

๐Ÿ”ฅ ํ•™์Šต ์ „๋žต: Region-Text Contrastive Loss

๐ŸŽฏ ์ „์ฒด ํ•™์Šต ๊ณผ์ • ๋‹จ๊ณ„๋ณ„ ์ดํ•ด

Step 1: ๋ชจ๋ธ์ด ์˜ˆ์ธกํ•˜๋Š” ๊ฒƒ๋“ค ๐Ÿ“ฆ

1
2
3
4
5
6
7
8
9
10
11
12
# YOLO-World๊ฐ€ ์ด๋ฏธ์ง€๋ฅผ ๋ณด๊ณ  ์˜ˆ์ธกํ•˜๋Š” ๊ฒƒ:
predictions = {
    'boxes': [B1, B2, ..., BK],      # K๊ฐœ์˜ ๋ฐ”์šด๋”ฉ ๋ฐ•์Šค
    'scores': [s1, s2, ..., sK],     # ๊ฐ ๋ฐ•์Šค์˜ confidence
    'embeddings': [e1, e2, ..., eK]  # ๊ฐ ๊ฐ์ฒด์˜ ํŠน์ง• ๋ฒกํ„ฐ
}

# ์‹ค์ œ ์ •๋‹ต ๋ฐ์ดํ„ฐ:
ground_truth = {
    'boxes': [B1_gt, B2_gt, ..., BN_gt],    # N๊ฐœ์˜ ์ •๋‹ต ๋ฐ•์Šค
    'texts': [t1, t2, ..., tN]              # ๊ฐ ๋ฐ•์Šค์˜ ํ…์ŠคํŠธ ๋ผ๋ฒจ
}

Step 2: ์˜ˆ์ธก๊ณผ ์ •๋‹ต ๋งค์นญํ•˜๊ธฐ ๐Ÿ”—

1
2
3
4
5
6
7
8
# Task-aligned Assignment ์‚ฌ์šฉ
# "์–ด๋–ค ์˜ˆ์ธก ๋ฐ•์Šค๊ฐ€ ์–ด๋–ค ์ •๋‹ต ๋ฐ•์Šค์™€ ๋Œ€์‘๋˜๋Š”๊ฐ€?"

for prediction_k in predictions:
    best_match = find_best_groundtruth(prediction_k)
    if IoU(prediction_k, best_match) > threshold:
        positive_pairs.append((prediction_k, best_match))
        assign_text_label(prediction_k, best_match.text)

Step 3: Contrastive Loss ๊ณ„์‚ฐ โš–๏ธ

1
2
3
4
5
6
7
8
# ๊ฐ ๊ฐ์ฒด ์ž„๋ฒ ๋”ฉ๊ณผ ํ…์ŠคํŠธ ์ž„๋ฒ ๋”ฉ ๊ฐ„์˜ ์œ ์‚ฌ๋„ ๊ณ„์‚ฐ
for object_embedding, text_embedding in positive_pairs:
    similarity = cosine_similarity(object_embedding, text_embedding)
    
    # Cross Entropy๋กœ Loss ๊ณ„์‚ฐ
    # Positive: ์‹ค์ œ ๋งค์นญ๋˜๋Š” ํ…์ŠคํŠธ์™€ ๋†’์€ ์œ ์‚ฌ๋„
    # Negative: ๋‹ค๋ฅธ ํ…์ŠคํŠธ๋“ค๊ณผ๋Š” ๋‚ฎ์€ ์œ ์‚ฌ๋„
    contrastive_loss += cross_entropy(similarity, true_text_index)
Loss Function ๊ตฌ์„ฑ
1
2
3
4
5
6
7
8
9
10
11
# ์ „์ฒด Loss = Contrastive + Regression
total_loss = L_contrastive + ฮป * (L_IoU + L_focal)

# ๊ฐ Loss ์—ญํ• :
# - L_contrastive: "์ด ๊ฐ์ฒด๊ฐ€ '๊ณ ์–‘์ด'์ธ๊ฐ€ '๊ฐ•์•„์ง€'์ธ๊ฐ€?" (์˜๋ฏธ ํ•™์Šต)
# - L_IoU: "๋ฐ”์šด๋”ฉ ๋ฐ•์Šค ์œ„์น˜๊ฐ€ ์ •ํ™•ํ•œ๊ฐ€?" (์œ„์น˜ ํ•™์Šต)  
# - L_focal: "๊ฐ์ฒด๊ฐ€ ์žˆ๋Š”๊ฐ€ ์—†๋Š”๊ฐ€?" (์กด์žฌ ์—ฌ๋ถ€ ํ•™์Šต)

# ฮป (lambda) ๊ฐ’์— ๋”ฐ๋ฅธ ํ•™์Šต ์ „๋žต:
# - Detection/Grounding ๋ฐ์ดํ„ฐ: ฮป = 1 (๋ชจ๋“  loss ์‚ฌ์šฉ)
# - Image-Text ๋ฐ์ดํ„ฐ: ฮป = 0 (contrastive๋งŒ ์‚ฌ์šฉ)
์™œ ฮป = 0 ์ธ๊ฐ€? ๐Ÿค”
1
2
3
4
5
6
7
8
9
10
11
์ƒํ™ฉ 1: Detection ๋ฐ์ดํ„ฐ (COCO, Objects365)
โ”œโ”€โ”€ ์ •ํ™•ํ•œ BBox โœ… โ†’ ์œ„์น˜ ํ•™์Šต ๊ฐ€๋Šฅ
โ”œโ”€โ”€ ์ •ํ™•ํ•œ ๋ผ๋ฒจ โœ… โ†’ ์˜๋ฏธ ํ•™์Šต ๊ฐ€๋Šฅ  
โ””โ”€โ”€ ฮป = 1๋กœ ๋ชจ๋“  Loss ์‚ฌ์šฉ!

์ƒํ™ฉ 2: Image-Text ๋ฐ์ดํ„ฐ (CC3M + Pseudo Labels)
โ”œโ”€โ”€ ๋ถ€์ •ํ™•ํ•œ BBox โŒ โ†’ ์œ„์น˜ ํ•™์Šตํ•˜๋ฉด ์˜คํžˆ๋ ค ํ•ด๋กœ์›€
โ”œโ”€โ”€ ์ •ํ™•ํ•œ ํ…์ŠคํŠธ โœ… โ†’ ์˜๋ฏธ ํ•™์Šต์€ ๊ฐ€๋Šฅ
โ””โ”€โ”€ ฮป = 0์œผ๋กœ Contrastive Loss๋งŒ ์‚ฌ์šฉ!

๊ฒฐ๋ก : "์œ„์น˜๋Š” ์ •ํ™•ํ•œ ๋ฐ์ดํ„ฐ๋กœ๋งŒ, ์˜๋ฏธ๋Š” ๋ชจ๋“  ๋ฐ์ดํ„ฐ๋กœ!" ๐ŸŽฏ
๐Ÿ” ์‹ค์ œ ํ•™์Šต ์˜ˆ์‹œ
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# ์˜ˆ์‹œ: ๊ณ ์–‘์ด ์ด๋ฏธ์ง€ ํ•™์Šต

# Case 1: COCO ๋ฐ์ดํ„ฐ (์ •ํ™•ํ•œ BBox)
image = "๊ณ ์–‘์ด ์‚ฌ์ง„.jpg"
ground_truth = {
    'box': [100, 50, 200, 150],  # ์ •ํ™•ํ•œ ์ขŒํ‘œ
    'text': "cat"
}
โ†’ ฮป = 1๋กœ ์œ„์น˜ + ์˜๋ฏธ ๋‘˜ ๋‹ค ํ•™์Šต! โœ…

# Case 2: CC3M ๋ฐ์ดํ„ฐ (Pseudo Box)  
image = "๊ณ ์–‘์ด ์‚ฌ์ง„.jpg"
pseudo_labels = {
    'box': [90, 45, 210, 160],   # GLIP์ด ๋งŒ๋“  ๋ถ€์ •ํ™•ํ•œ ์ขŒํ‘œ
    'text': "cat"
}
โ†’ ฮป = 0์œผ๋กœ ์˜๋ฏธ๋งŒ ํ•™์Šต! (์œ„์น˜๋Š” ๋ฌด์‹œ) โœ…

๐ŸŽจ Mosaic Augmentation ํ™œ์šฉ

1
2
3
4
5
6
7
์—ฌ๋Ÿฌ ์ด๋ฏธ์ง€๋ฅผ ํ•œ ๋ฒˆ์— ํ•ฉ์ณ์„œ ํ•™์Šต:
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ ๐Ÿฑ cat  โ”‚ ๐Ÿš— car   โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค  
โ”‚ ๐Ÿ• dog  โ”‚ ๐Ÿ‘ค personโ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ†’ ํ•œ ๋ฒˆ์— 4๊ฐœ ๊ฐ์ฒด ํ•™์Šต์œผ๋กœ ํšจ์œจ์„ฑ UP! โšก

๐Ÿ’ก ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘์˜ ํ•ต์‹ฌ ์•„์ด๋””์–ด

โ€œ์ •ํ™•ํ•œ ๋ฐ์ดํ„ฐ๋Š” ์ ์ง€๋งŒ ํ’ˆ์งˆ ๋ณด์žฅ, ๋Œ€๋Ÿ‰ ๋ฐ์ดํ„ฐ๋Š” ์ž๋™์œผ๋กœ ๋ผ๋ฒจ๋งํ•ด์„œ ํ™œ์šฉ!โ€

  1. ์†Œ๋Ÿ‰ ์ •๋ฐ€ ๋ฐ์ดํ„ฐ: Detection + Grounding (์ •ํ™•ํ•œ BBox)
  2. ๋Œ€๋Ÿ‰ ์ž๋™ ๋ฐ์ดํ„ฐ: Image-Text โ†’ Pseudo Labeling (์Šค์ผ€์ผ ํ™•๋ณด)
  3. ๊ท ํ˜• ํ•™์Šต: ๋‘ ์ข…๋ฅ˜๋ฅผ ์„ž์–ด์„œ ์ตœ์ ์˜ ์„ฑ๋Šฅ ๋‹ฌ์„ฑ! ๐ŸŽฏ

์ด๋ ‡๊ฒŒ cleverํ•œ ๋ฐ์ดํ„ฐ ์ „๋žต์œผ๋กœ YOLO-World๋Š” ๋น ๋ฅด๋ฉด์„œ๋„ ์ •ํ™•ํ•œ Open-Vocabulary ํƒ์ง€๊ฐ€ ๊ฐ€๋Šฅํ•ด์กŒ์Šต๋‹ˆ๋‹ค! ๐Ÿš€


์‹คํ—˜๊ฒฐ๊ณผ!! โœจ

ํ•ญ๋ชฉ์„ค๋ช…
์‹ค์‹œ๊ฐ„ ์„ฑ๋ŠฅLVIS์—์„œ 35.4 AP @ 52.0 FPS ๋‹ฌ์„ฑ (V100 GPU)
Prompt-then-Detect์˜คํ”„๋ผ์ธ ์–ดํœ˜ ์ž„๋ฒ ๋”ฉ์œผ๋กœ ์‹ค์‹œ๊ฐ„ ํ…์ŠคํŠธ ์ธ์ฝ”๋”ฉ ๋ถˆํ•„์š”
Zero-Shot ๋Šฅ๋ ฅํ•™์Šต์— ์—†๋˜ ๊ฐ์ฒด๋„ ํ…์ŠคํŠธ ํ”„๋กฌํ”„ํŠธ๋งŒ์œผ๋กœ ํƒ์ง€ ๊ฐ€๋Šฅ
๊ฒฝ๋Ÿ‰ํ™”๊ธฐ์กด Open-Vocabulary ๋ชจ๋ธ ๋Œ€๋น„ 20๋ฐฐ ๋น ๋ฅด๊ณ  5๋ฐฐ ์ž‘์Œ

๐ŸŽฏ ์ฃผ์š” ๊ธฐ์ˆ ์  ํ˜์‹ 

RepVL-PAN์˜ ํ•ต์‹ฌ ๊ตฌ์„ฑ์š”์†Œ

  • ๐ŸŽฏ Text-guided CSPLayer (T-CSPLayer)
    YOLOv8์˜ C2f ๋ ˆ์ด์–ด์— ํ…์ŠคํŠธ ๊ฐ€์ด๋˜์Šค ์ถ”๊ฐ€
    Max-Sigmoid Attention์œผ๋กœ ํ…์ŠคํŠธ ๊ด€๋ จ ์˜์—ญ์— ์ง‘์ค‘

  • ๐Ÿ–ผ๏ธ Image-Pooling Attention
    ๋ฉ€ํ‹ฐ์Šค์ผ€์ผ ์ด๋ฏธ์ง€ ํŠน์ง•์„ 27๊ฐœ ํŒจ์น˜ ํ† ํฐ์œผ๋กœ ์••์ถ•
    ํ…์ŠคํŠธ ์ž„๋ฒ ๋”ฉ์„ ์‹œ๊ฐ์  ์ปจํ…์ŠคํŠธ๋กœ ํ–ฅ์ƒ


๐Ÿ“Š ์„ฑ๋Šฅ ๋น„๊ต

Zero-shot LVIS ๋ฒค์น˜๋งˆํฌ

ModelBackboneFPSAPAP_rAP_cAP_f
GLIP-TSwin-T0.1226.020.821.431.0
Grounding DINO-TSwin-T1.527.418.123.332.7
DetCLIP-TSwin-T2.334.426.933.936.3
YOLO-World-LYOLOv8-L52.035.427.634.138.0
  • FPS๊ฐ€ ๋†’๋‹ค! ์ฆ‰ ์—„์ฒญ ๋น ๋ฅด์ฃ ? 1์ดˆ๋ฐ 52์žฅ์„ ์ฒ˜๋ฆฌํ• ์ˆ˜ ์žˆ๋Œ€์š”!
  • ๊ทธ๋Ÿฌ๋ฉด์„œ ์ •ํ™•๋„(AP)๋„ ๋†’์•„์š”!

โš ๏ธ ํ•œ๊ณ„์ 

  • ๐ŸŽญ ๋ณต์žกํ•œ ์ƒํ˜ธ์ž‘์šฉ ํ‘œํ˜„์˜ ํ•œ๊ณ„
    ๋‹จ์ˆœํ•œ ํ…์ŠคํŠธ ํ”„๋กฌํ”„ํŠธ๋กœ๋Š” ๋ณต์žกํ•œ ๊ด€๊ณ„ ํ‘œํ˜„์ด ์–ด๋ ค์šธ ์ˆ˜ ์žˆ์Œ

  • ๐Ÿ“ ํ•ด์ƒ๋„ ์˜์กด์„ฑ
    ์ž‘์€ ๊ฐ์ฒด ํƒ์ง€๋ฅผ ์œ„ํ•ด์„œ๋Š” ๊ณ ํ•ด์ƒ๋„ ์ž…๋ ฅ์ด ํ•„์š”ํ•  ์ˆ˜ ์žˆ์Œ

  • ๐Ÿ’พ ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰
    Re-parameterization ๊ณผ์ •์—์„œ ์ถ”๊ฐ€์ ์ธ ๋ฉ”๋ชจ๋ฆฌ ์˜ค๋ฒ„ํ—ค๋“œ ๋ฐœ์ƒ

โ€“

โœ… ๋งˆ๋ฌด๋ฆฌ ์š”์•ฝ

YOLO-World๋Š” ์‹ค์‹œ๊ฐ„ ์„ฑ๋Šฅ๊ณผ Open-Vocabulary ๋Šฅ๋ ฅ์„ ๋™์‹œ์— ๋‹ฌ์„ฑํ•œ ํš๊ธฐ์ ์ธ ๊ฐ์ฒด ํƒ์ง€ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.

๐Ÿ“Œ YOLO์˜ ์†๋„ + CLIP์˜ ์–ธ์–ด ์ดํ•ด๋ ฅ!
๊ธฐ์กด Open-Vocabulary ๋ชจ๋ธ๋“ค์ด ๋ฌด๊ฒ๊ณ  ๋А๋ฆฐ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๋ฉฐ,
์‹ค์ œ ์‚ฐ์—… ํ˜„์žฅ์—์„œ ๋ฐ”๋กœ ํ™œ์šฉ ๊ฐ€๋Šฅํ•œ ์‹ค์šฉ์ ์ธ ์†”๋ฃจ์…˜์„ ์ œ์‹œ!

YOLO-World์˜ ๋“ฑ์žฅ์œผ๋กœ ์ด์ œ Edge ๋””๋ฐ”์ด์Šค์—์„œ๋„ Zero-shot Object Detection์ด ํ˜„์‹ค์ด ๋˜์—ˆ์Šต๋‹ˆ๋‹ค! ๐ŸŽ‰

This post is licensed under CC BY 4.0 by the author.