Post

Understanding Grounding DINO!! - Grounding DINO ๋…ผ๋ฌธ ๊ณต๋ถ€!

Understanding Grounding DINO!! - Grounding DINO ๋…ผ๋ฌธ ๊ณต๋ถ€!

๐Ÿ“ Understanding Grounding DINO!!

Studying ใ€ŽGrounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detectionใ€ (ECCV, 2024)

manhwa

๐Ÿ“– Paper Title: Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection
โœ๏ธ Authors: Xinyu Chen, Xueyan Zou, Ze Liu, et al.
๐ŸŒŸ One-line Summary: A text-prompt-based object detector!


๐Ÿง  Core Ideas Summary

1๏ธโƒฃ DINO-based Structure with Enhanced Modality Fusion

detector_structure

  • Grounding DINO is based on the Transformer-based object detector DINO.
  • Unlike Faster R-CNN, DINOโ€™s structure naturally allows layer-level fusion between text and image.
  • Grounding DINO performs cross-modality fusion in Neck (Phase A), Query Initialization (Phase B), and Head (Phase C) stages to boost text-conditioned detection performance.

2๏ธโƒฃ Generalization through Grounded Pretraining

  • CLIP excels at global image-text alignment but struggles with region-level grounding.
  • To overcome CLIP-style zero-shot limitations, Grounding DINO introduces contrastive pretraining on region-text pairs.
  • It enhances GLIPโ€™s phrase grounding approach with sub-sentence level text processing to reduce category interference.
  • This allows Grounding DINO to become a true โ€œtext โ†’ detectionโ€ open-set detector, achieving new zero-shot benchmarks on COCO and ODinW.

๐Ÿ” Background of the Grounding DINO Research

Grounding DINO was proposed to go beyond the limitations of fixed class object detection.
Hereโ€™s the previous evolution of related models:


๐Ÿงฉ From DETR to DINO โ€” Still Bound by Fixed Classes

  • DETR (2020, Facebook AI)
    The first Transformer-based end-to-end object detector
    โ†’ But it only detects predefined classes, like those in COCO.

  • DINO (ICLR 2023)
    Improves DETRโ€™s training stability and accuracy
    โ†’ Great detection, but still limited to fixed class tokens

โžก๏ธ DINO detects well, but only if you already know what to detect.


๐Ÿงฉ Open-Set Object Detection โ€” Breaking Free from Fixed Classes

๐Ÿ” GLIP, OV-DETR, etc.

Traditional detectors are closed-set, trained only to recognize predefined classes via bounding box annotations.

To break that limitation, Microsoft proposed GLIP (Grounded Language-Image Pretraining):

  • Open-set Object Detection
  • Detecting arbitrary categories
  • Using natural language generalization to understand new objects

Similarly, OV-DETR uses Transformer-based structure with language-aware object queries injected into the decoder for open-vocabulary detection.

โš ๏ธ Limitations of Prior Work

These models mostly fused image and text features at limited stages, leading to sub-optimal generalization.

๐Ÿ“Š Multimodal Fusion Comparison
ModelFusion LocationDescriptionLimitation
GLIPPhase A (Feature Enhancement)Fuses text-image in the neck moduleLacks fusion in later modules
OV-DETRPhase B (Decoder Input)Injects language-aware queries into the decoderLimited fusion with early vision

โžก๏ธ These limited fusions can lead to weaker alignment and lower performance in open-vocabulary detection.


๐Ÿ—ฃ๏ธ SAMโ€™s Possibility and Limitation: Segmentation by Prompt

  • SAM (Segment Anything Model, 2023)
    A universal segmentation model based on point, box, and mask prompts
    โ†’ Can โ€œsegment anythingโ€ as the name implies

  • But SAM couldnโ€™t take natural language prompts directly
    (Text prompts were only conceptually proposed โ€” no actual interpretation)


๐Ÿ’ก Enter Grounding DINO!

Grounding DINO bridges both worlds:

  • Detection power of DINO + Text interpretation ability of CLIP
  • โ†’ Resulting in a text-prompt-based open-vocabulary object detector

It is then combined with SAM into Grounded-SAM, completing a full pipeline of:
โ€œText โ†’ Detection โ†’ Segmentationโ€


๐Ÿงช Grounding DINO Architecture

full_structure

๐Ÿ“ Architecture Overview

Grounding DINO uses a dual-encoder + single-decoder design:

  1. Image Backbone: Extracts visual features
  2. Text Backbone: Extracts language features
  3. Feature Enhancer: Fuses image-text features (Sec. 3.1)
  4. Language-Guided Query Selection: Initializes decoder queries (Sec. 3.2)
  5. Cross-Modality Decoder: Refines detections (Sec. 3.3)

3.1 ๐Ÿ”ง Feature Extraction and Enhancer
  • Image features via Swin Transformer (multi-scale)
  • Text features via BERT
  • Fusion includes:
    • Deformable Self-Attention for image
    • Vanilla Self-Attention for text
    • Image-to-Text and Text-to-Image Cross-Attention
  • Multiple stacked fusion layers

3.2 ๐ŸŽฏ Language-Guided Query Selection

Grounding DINO dynamically selects decoder queries based on the input text.
Unlike fixed queries in DETR, it scores the similarity between text and image patches.

๐Ÿ” Process Overview:

  1. ๐Ÿ“ธ Extract image patch features
  2. ๐Ÿ“ Extract text features from the sentence
  3. ๐Ÿ” Measure how well each image patch matches text tokens
  4. โญ Select top 900 image patches as detection queries
  5. โ†’ Used to predict bounding boxes and labels

Query =

  • Positional Part: Anchor box information
  • Content Part: Learnable feature vector

3.3 ๐Ÿ”„ Cross-Modality Decoder

Each decoder layer includes:

  1. Self-Attention
  2. Image Cross-Attention
  3. Text Cross-Attention (added)
  4. Feed-Forward Network

โžค Added text cross-attention allows better text-image fusion during decoding.


3.4 โœ‚๏ธ Sub-Sentence Level Text Feature

subsentence

Existing approaches:

  • Sentence-level: Encodes whole sentence โ†’ loses fine details
  • Word-level: Encodes all class names together โ†’ unintended word interference

Grounding DINO proposes:
โžก๏ธ Sub-sentence level encoding with attention masks
โ†’ Removes interference between unrelated words
โ†’ Preserves fine-grained per-word features


๐ŸŽฏ Loss Function Design


๐Ÿ”ง 3.5 Loss Function

Grounding DINO combines multiple loss components:

๐Ÿ“ฆ 1. Bounding Box Regression
  • L1 Loss
  • GIoU Loss for location accuracy

๐Ÿท๏ธ 2. Classification (Text-based)
  • Contrastive Loss: Matches text tokens with predicted boxes
  • Uses:
    • Dot product between queries and text features
    • Focal Loss on logits for robust learning

๐Ÿ”„ 3. Matching and Final Loss
  • Bipartite matching aligns predictions with ground truth
  • Final loss = Box loss + Classification loss

๐Ÿงฑ 4. Auxiliary Loss
  • Added at:
    • Each decoder layer
    • Encoder outputs
  • Helps stabilize early-stage training and convergence

๐Ÿ“Š Ablation Study Summary

Grounding DINO evaluates the importance of each design by removing or altering modules.
Evaluated on COCO and LVIS (minival) for Zero-Shot and Fine-Tune settings.


๐Ÿ“‹ Results (Table 7)

IDModel VariantCOCO (Zero-Shot)COCO (Fine-Tune)LVIS (Zero-Shot)
0โœ… Full Model46.756.916.1
1โŒ No Encoder Fusion45.856.113.1
2โŒ Static Query Selection46.356.613.6
3โŒ No Text Cross-Attention46.156.314.3
4โŒ Word-Level Prompt (vs. Sub-sentence)46.456.615.6

๐Ÿ” Interpretation

  1. Encoder Fusion (ID #1) is most critical
    • Drop of -0.9 AP (COCO) and -3.0 AP (LVIS)
  2. Static Query Selection (ID #2) hurts zero-shot performance
  3. Text Cross-Attention (ID #3) improves grounding
  4. Sub-sentence Prompt is more effective than word-level

โœ… Conclusion

  • Encoder Fusion is the biggest performance booster
  • Query Selection & Text Attention matter especially in open-vocabulary settings
  • Sub-sentence prompts improve fine-grained alignment

๐Ÿ’ก Takeaways

Grounding DINO is not just a better detector โ€”
itโ€™s a model that connects language and vision meaningfully.

I was especially impressed by how it finds objects not from fixed labels,
but from free-form text prompts!


๐Ÿ“š References

  1. Paper: https://arxiv.org/abs/2303.05499
  2. Code: https://github.com/IDEA-Research/GroundingDINO
  3. Thanks to ChatGPT for summarization ๐Ÿ™

(ํ•œ๊ตญ์–ด) ๐Ÿ“ Grounding DINO ์•Œ์•„๋ณด๊ธฐ!!

ใ€ŽGrounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detectionใ€(ECCV, 2024) ๊ณต๋ถ€

manhwa

๐Ÿ“– ๋…ผ๋ฌธ ์ œ๋ชฉ: Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection
โœ๏ธ ์ €์ž: Xinyu Chen, Xueyan Zou, Ze Liu, et al.
๐ŸŒŸ ํ•œ์ค„ ์š”์•ฝ: ์ œ์‹œ๋œ ํ…์ŠคํŠธ ํ”„๋กฌํ”„ํŠธ ๊ธฐ๋ฐ˜ ๊ฐ์ฒด ํƒ์ง€๊ธฐ!


๐Ÿง  ํ•ต์‹ฌ ์•„์ด๋””์–ด ์š”์•ฝ

1๏ธโƒฃ DINO ๊ธฐ๋ฐ˜ ๊ตฌ์กฐ์™€ ๋ชจ๋‹ฌ ์œตํ•ฉ ๊ฐ•ํ™”

detector_structure

  • Grounding DINO๋Š” Transformer ๊ธฐ๋ฐ˜ ๊ฐ์ฒด ํƒ์ง€๊ธฐ์ธ DINO๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์„ค๊ณ„๋จ.
  • ๊ธฐ์กด Faster R-CNN ๊ตฌ์กฐ์™€ ๋‹ฌ๋ฆฌ, DINO๋Š” ํ…์ŠคํŠธ์™€ ์ด๋ฏธ์ง€ ๊ฐ„ layer-level ์œตํ•ฉ์ด ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ๊ฐ€๋Šฅํ•œ ๊ตฌ์กฐ๋ฅผ ๊ฐ€์ง.
  • Neck(phase A), Query Initialization(phase B), Head(phase C) ๋‹จ๊ณ„ ๋ชจ๋‘์—์„œ cross-modality fusion์ด ์ด๋ฃจ์–ด์ง€๋„๋ก ์„ค๊ณ„ํ•˜์—ฌ, ํ…์ŠคํŠธ ๊ธฐ๋ฐ˜ ๊ฐ์ฒด ํƒ์ง€ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ด.

2๏ธโƒฃ Grounded Pretraining์„ ํ†ตํ•œ Open-Set ์ผ๋ฐ˜ํ™”

  • CLIP์€ ์ด๋ฏธ์ง€ ์ „์ฒด ์ˆ˜์ค€์—์„œ๋Š” ๋›ฐ์–ด๋‚˜์ง€๋งŒ, ์˜์—ญ(region) ์ˆ˜์ค€ ํ…์ŠคํŠธ ๋Œ€์‘์—๋Š” ํ•œ๊ณ„๊ฐ€ ์กด์žฌ.
  • ์ด๋Ÿฐ CLIP ๊ธฐ๋ฐ˜ zero-shot ๋ฐฉ์‹์˜ ํ•œ๊ณ„๋ฅผ ๊ทน๋ณตํ•˜๊ธฐ ์œ„ํ•ด, region-text ์Œ์— ๋Œ€ํ•œ contrastive pretraining์„ ๋„์ž….
  • GLIP์˜ phrase grounding ๋ฐฉ์‹์„ ๊ฐœ์„ ํ•˜์—ฌ, sub-sentence ๋‹จ์œ„ ํ…์ŠคํŠธ ์ฒ˜๋ฆฌ๋ฅผ ํ†ตํ•ด ํด๋ž˜์Šค ๊ฐ„ ๊ฐ„์„ญ์„ ์ค„์ž„.
  • ์ด๋กœ์จ Grounding DINO๋Š” โ€œํ…์ŠคํŠธ โ†’ ํƒ์ง€โ€๊ฐ€ ๊ฐ€๋Šฅํ•œ open-set object detector๋กœ์„œ, COCO ๋ฐ ODinW ๋“ฑ์—์„œ zero-shot ์„ฑ๋Šฅ์˜ ์ƒˆ๋กœ์šด ๊ธฐ์ค€์„ ์ œ์‹œํ•จ.

๐Ÿ” Grounding DINO ์—ฐ๊ตฌ์˜ ๋ฐฐ๊ฒฝ

Grounding DINO๋Š” ๊ธฐ์กด์˜ ๊ฐ์ฒด ํƒ์ง€(Object Detection) ๋ชจ๋ธ๋“ค์ด ๊ฐ€์ง„ ๊ณ ์ •๋œ ํด๋ž˜์Šค ์ œํ•œ์„ ๋›ฐ์–ด๋„˜๊ธฐ ์œ„ํ•ด ์ œ์•ˆ๋œ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.
์ด์ „๊นŒ์ง€์˜ ํ๋ฆ„์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:


๐Ÿงฉ DETR ์ดํ›„ DINO, ํ•˜์ง€๋งŒ ์—ฌ์ „ํžˆ ํด๋ž˜์Šค๋Š” ๊ณ ์ •

  • DETR (2020, Facebook AI)
    Transformer ๊ธฐ๋ฐ˜์œผ๋กœ ๊ฐ์ฒด ํƒ์ง€๋ฅผ ์ˆ˜ํ–‰ํ•œ ์ตœ์ดˆ์˜ end-to-end ๋ชจ๋ธ
    โ†’ ํ•˜์ง€๋งŒ ํด๋ž˜์Šค๋Š” COCO์ฒ˜๋Ÿผ ์‚ฌ์ „ ์ •์˜๋œ ํด๋ž˜์Šค์…‹์— ํ•œ์ •๋จ

  • DINO (ICLR 2023)
    DETR ๊ตฌ์กฐ๋ฅผ ๊ฐœ์„ ํ•ด ํ•™์Šต ์•ˆ์ •์„ฑ๊ณผ ์ •ํ™•๋„๋ฅผ ๋†’์ธ ๋ชจ๋ธ
    โ†’ ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ์„ ๋ณด์˜€์ง€๋งŒ ์—ฌ์ „ํžˆ ๊ณ ์ •๋œ ํด๋ž˜์Šค(class token)๋งŒ ํƒ์ง€ ๊ฐ€๋Šฅ

์ฆ‰, DINO๋Š” ํƒ์ง€๋Š” ์ž˜ํ•˜์ง€๋งŒ โ€˜๋ฌด์—‡์„ ํƒ์ง€ํ• ์ง€โ€™๋Š” ์ด๋ฏธ ์ •ํ•ด์ ธ ์žˆ์–ด์•ผ ํ–ˆ์Šต๋‹ˆ๋‹ค.


๐Ÿงฉ Open-Set Object Detection, ์ฆ‰ ๊ณ ์ •๋œ ๊ฐ์ฒด ํ•œ๊ณ„๋ฅผ ๋„˜์–ด์„œ๋Š” ์—ฐ๊ตฌ๋“ค

๐Ÿ” GLIP, OV-DETR* ์—ฐ๊ตฌ

๊ธฐ์กด ๊ฐ์ฒด ํƒ์ง€๋Š” ์‚ฌ์ „์— ์ •์˜๋œ ํด๋ž˜์Šค(bounding box ์–ด๋…ธํ…Œ์ด์…˜)์—๋งŒ ๋ฐ˜์‘ํ•˜๋Š”
๊ณ ์ • ํด๋ž˜์Šค ๊ธฐ๋ฐ˜(closed-set) ํƒ์ง€ ๋ฐฉ์‹์— ํ•œ์ •๋˜์–ด ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.

์ด์— ๋Œ€ํ•ด GLIP(Grounded Language-Image Pre-training, Microsoft)์€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋ฐฉํ–ฅ์„ ์ œ์‹œํ–ˆ์Œ:

  • ์˜คํ”ˆ์…‹ ๊ฐ์ฒด ํƒ์ง€ (Open-Set Object Detection)
  • ์ž„์˜์˜ ํด๋ž˜์Šค (arbitrary class)์— ๋Œ€ํ•œ ํƒ์ง€ ์ˆ˜ํ–‰
  • ์ž์—ฐ์–ด ๊ธฐ๋ฐ˜ ์ผ๋ฐ˜ํ™” (language generalization)๋ฅผ ํ†ตํ•ด ์ƒˆ๋กœ์šด ๊ฐ์ฒด๋ฅผ ์ดํ•ดํ•˜๊ณ  ํƒ์ง€

์ฆ‰, ์ •ํ•ด์ง„ ๋ผ๋ฒจ ์—†์ด๋„ ํ…์ŠคํŠธ ํ”„๋กฌํ”„ํŠธ ๊ธฐ๋ฐ˜์œผ๋กœ ๊ฐ์ฒด๋ฅผ ํƒ์ง€ํ•  ์ˆ˜ ์žˆ๋Š” ๋Šฅ๋ ฅ์„ ๋ชฉํ‘œ๋กœ ํ•ฉ๋‹ˆ๋‹ค.

ํ•œํŽธ, OV-DETR์€ Transformer ๊ตฌ์กฐ ๊ธฐ๋ฐ˜์˜ ๊ฐ์ฒด ํƒ์ง€๊ธฐ๋กœ,
์–ธ์–ด ์ •๋ณด๊ฐ€ ํฌํ•จ๋œ ์ฟผ๋ฆฌ(query)๋ฅผ ๋””์ฝ”๋”์— ์ง์ ‘ ์ฃผ์ž…ํ•˜์—ฌ open-vocabulary ํƒ์ง€๋ฅผ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

โš ๏ธ ๊ธฐ์กด ์—ฐ๊ตฌ๋“ค์˜ ํ•œ๊ณ„์ 

์ด๋Ÿฌํ•œ ๋ชจ๋ธ๋“ค์€ ๋ชจ๋‘ ์ด๋ฏธ์ง€์™€ ์–ธ์–ด๋ผ๋Š” ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์ •๋ณด๋ฅผ
์ผ๋ถ€ ๋ชจ๋“ˆ์—๋งŒ ๊ตญํ•œํ•˜์—ฌ ์œตํ•ฉ(fusion)ํ•จ์— ๋”ฐ๋ผ,
์–ธ์–ด ๊ธฐ๋ฐ˜ ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ์ด ์ตœ์ ๋ณด๋‹ค ๋‚ฎ๊ฒŒ(sub-optimal) ์ž‘๋™ํ•  ๊ฐ€๋Šฅ์„ฑ์ด ์กด์žฌํ•ฉ๋‹ˆ๋‹ค.

๐Ÿ“Š ์˜ˆ์‹œ: ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๊ฒฐํ•ฉ ์œ„์น˜ ๋น„๊ต
๋ชจ๋ธ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๊ฒฐํ•ฉ ์œ„์น˜์„ค๋ช…ํ•œ๊ณ„์ 
GLIPPhase A (Feature Enhancement)๋ฐฑ๋ณธ ์ดํ›„ neck ๋‹จ๊ณ„์—์„œ ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ ํŠน์ง• ์œตํ•ฉ์ดํ›„ ๋””์ฝ”๋”์™€์˜ ์—ฐ๊ฒฐ์„ฑ ๋ถ€์กฑ
OV-DETRPhase B (Decoder Input)๋””์ฝ”๋”์— ์–ธ์–ด ์ฟผ๋ฆฌ(query)๋ฅผ ์ง์ ‘ ์‚ฝ์ž…์ดˆ๊ธฐ ์‹œ๊ฐ ์ •๋ณด์™€์˜ ๊นŠ์€ ์œตํ•ฉ ๋ถ€์กฑ

โžก๏ธ ์ด๋Ÿฌํ•œ ๊ตฌ์กฐ์  ์ œ์•ฝ์€,
ํ…์ŠคํŠธ์™€ ์ด๋ฏธ์ง€ ๊ฐ„์˜ ๊นŠ์ด ์žˆ๋Š” ์ •๋ ฌ(alignment)์ด ์š”๊ตฌ๋˜๋Š” open-vocabulary ํƒ์ง€์—์„œ
์„ฑ๋Šฅ ์ €ํ•˜ ๋˜๋Š” ์ผ๋ฐ˜ํ™” ํ•œ๊ณ„๋กœ ์ด์–ด์งˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.


๐Ÿ—ฃ๏ธ SAM์ด ์ œ์‹œํ•œํ•œ ๊ฐ€๋Šฅ์„ฑ๊ณผ ํ•œ๊ณ„: ํ…์ŠคํŠธ ํ”„๋กฌํ”„ํŠธ ๊ธฐ๋ฐ˜ ๋ถ„ํ•  ์•„๋””์ด๋””์–ด

  • SAM (Segment Anything Model, 2023)
    ํฌ์ธํŠธ, ๋ฐ•์Šค, ๋งˆ์Šคํฌ ๊ธฐ๋ฐ˜์˜ ๋ฒ”์šฉ ์„ธ๊ทธ๋ฉ˜ํ…Œ์ด์…˜ ๋ชจ๋ธ
    โ†’ Segment Anything์ด๋ผ๋Š” ์ด๋ฆ„์— ๊ฑธ๋งž๊ฒŒ ์–ด๋–ค ๊ฐ์ฒด๋“  ์ž˜๋ผ๋‚ผ ์ˆ˜ ์žˆ์Œ

  • ๊ทธ๋Ÿฌ๋‚˜ SAM์€ ํ…์ŠคํŠธ๋ฅผ ์ง์ ‘ ์ž…๋ ฅํ•ด segmentation์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜๋Š” ์—†์—ˆ์Œ
    (ํ…์ŠคํŠธ๋Š” ๊ฐœ๋…์ ์œผ๋กœ ์ œ์‹œ๋˜์—ˆ์ง€๋งŒ, ์‹ค์ œ ํ…์ŠคํŠธ ์ธ์‹์„ ํ•˜์ง€ ์•Š์Œ)


๐Ÿ’ก ๊ทธ๋ž˜์„œ ๋“ฑ์žฅํ•œ Grounding DINO!

Grounding DINO๋Š” ์ด๋Ÿฌํ•œ ๋‘ ํ๋ฆ„์„ ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ์—ฐ๊ฒฐํ•ฉ๋‹ˆ๋‹ค:

  • DINO์˜ ๊ฐ์ฒด ํƒ์ง€ ๋Šฅ๋ ฅ + ํ…์ŠคํŠธ ํ”„๋กฌํ”„ํŠธ ํ•ด์„ ๋Šฅ๋ ฅ(CLIP ๊ธฐ๋ฐ˜)
  • โ†’ ๊ฒฐ๊ตญ โ€œ๋ง๋กœ ํƒ์ง€ํ•˜๋Š”(open-vocabulary) ๊ฐ์ฒด ํƒ์ง€๊ธฐโ€๊ฐ€ ๋œ ๊ฒƒ!!

์ดํ›„ SAM๊ณผ ๊ฒฐํ•ฉํ•˜์—ฌ Grounded SAM์œผ๋กœ ํ™•์žฅ๋˜๋ฉฐ,
โ€œํ…์ŠคํŠธ โ†’ ํƒ์ง€ โ†’ ๋ถ„ํ• โ€์ด๋ผ๋Š” ์ „์ฒด ํŒŒ์ดํ”„๋ผ์ธ์ด ์™„์„ฑ๋ฉ๋‹ˆ๋‹ค.


๐Ÿงช Grounding DINO์˜ ๊ตฌ์„ฑ

full_structure

๐Ÿ“ ์•„ํ‚คํ…์ฒ˜ ๊ฐœ์š”

Grounding DINO๋Š” dual-encoder + single-decoder ๊ตฌ์กฐ๋ฅผ ์ฑ„ํƒํ•ฉ๋‹ˆ๋‹ค.

๊ตฌ์„ฑ ์š”์†Œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

  1. Image Backbone: ์ด๋ฏธ์ง€ ํŠน์ง• ์ถ”์ถœ
  2. Text Backbone: ํ…์ŠคํŠธ ํŠน์ง• ์ถ”์ถœ
  3. Feature Enhancer: ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ ํŠน์ง• ์œตํ•ฉ (Sec. 3.1)
  4. Language-Guided Query Selection: ์ฟผ๋ฆฌ ์ดˆ๊ธฐํ™” (Sec. 3.2)
  5. Cross-Modality Decoder: ๋ฐ•์Šค refinement ์ˆ˜ํ–‰ (Sec. 3.3)

3.1 ๐Ÿ”ง Feature Extraction and Enhancer
  • ์ด๋ฏธ์ง€ Feature: Swin Transformer์™€ ๊ฐ™์€ ๋ฐฑ๋ณธ์„ ํ†ตํ•ด ๋‹ค์ค‘ ์Šค์ผ€์ผ ํŠน์ง• ์ถ”์ถœ
  • ํ…์ŠคํŠธ Feature: BERT ๊ธฐ๋ฐ˜์˜ ๋ฐฑ๋ณธ์œผ๋กœ ์ถ”์ถœ
  • ์œตํ•ฉ ๋ฐฉ์‹:
    • ์ด๋ฏธ์ง€: Deformable self-attention
    • ํ…์ŠคํŠธ: Vanilla self-attention
    • ํฌ๋กœ์Šค๋ชจ๋‹ฌ ์œตํ•ฉ:
      • Image-to-Text Cross-Attention
      • Text-to-Image Cross-Attention
    • ๋‹ค์ˆ˜์˜ Feature Enhancer Layer๋กœ ๊ตฌ์„ฑ

๐Ÿ‘‰ ์„œ๋กœ ๋‹ค๋ฅธ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ์˜ ํŠน์ง• ์ •๋ ฌ(alignment)์„ ์œ„ํ•œ ํ•ต์‹ฌ ๋ชจ๋“ˆ


3.2 ๐ŸŽฏ Language-Guided Query Selection

Grounding DINO๋Š” ์ž…๋ ฅ ํ…์ŠคํŠธ์— ๋”ฐ๋ผ ํƒ์ง€ ์ฟผ๋ฆฌ๋ฅผ ๋™์ ์œผ๋กœ ์„ ํƒํ•˜๋Š” ๊ตฌ์กฐ๋ฅผ ๊ฐ–๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.
๊ธฐ์กด์˜ DETR ๊ณ„์—ด ๋ชจ๋ธ๋“ค์ด ๊ณ ์ •๋œ ์ฟผ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ๊ณผ ๋‹ฌ๋ฆฌ,
์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ ๊ฐ„์˜ ์œ ์‚ฌ๋„๋ฅผ ๊ณ„์‚ฐํ•˜์—ฌ ๊ฐ€์žฅ ๊ด€๋ จ ์žˆ๋Š” ์ฟผ๋ฆฌ๋“ค์„ ์„ ํƒํ•ฉ๋‹ˆ๋‹ค.

  • ๐Ÿ” ์ž‘๋™ ๋ฐฉ์‹
  1. ๐Ÿ“ธ ์ด๋ฏธ์ง€๋ฅผ ์กฐ๊ฐ์กฐ๊ฐ ๋‚˜๋ˆ ์„œ(=ํŒจ์น˜๋กœ) ํŠน์ง•์„ ๋ฝ‘๊ณ ,
  2. ๐Ÿ“ ์ž…๋ ฅ ๋ฌธ์žฅ(์˜ˆ: โ€œa red umbrellaโ€)๋„ ๋‹จ์–ด๋ณ„๋กœ ํŠน์ง•์„ ์ถ”์ถœ!
  3. ๐Ÿ” ์ด๋ฏธ์ง€์˜ ๊ฐ ์กฐ๊ฐ์ด ํ…์ŠคํŠธ์˜ ์–ด๋–ค ๋‹จ์–ด์™€ ์ž˜ ๋งž๋Š”์ง€ ์ ์ˆ˜๋ฅผ ๊ณ„์‚ฐํ•˜๊ณ ,
  4. โญ ์ ์ˆ˜๊ฐ€ ๋†’์€ ์ด๋ฏธ์ง€ ์กฐ๊ฐ 900๊ฐœ๋ฅผ โ€œํƒ์ง€ ์ฟผ๋ฆฌโ€๋กœ ์„ ํƒ
  5. ์ด ์ฟผ๋ฆฌ๋“ค์€ ๋””์ฝ”๋”์— ๋“ค์–ด๊ฐ€์„œ bounding box์™€ ๋ ˆ์ด๋ธ”์„ ์˜ˆ์ธก
  • โ€œ์ฟผ๋ฆฌโ€์˜ ๊ตฌ์„ฑ์€? : ์ฟผ๋ฆฌ๋Š” ๋‘ ๊ฐ€์ง€ ์ •๋ณด๋กœ ๊ตฌ์„ฑ๋จ

  • ์œ„์น˜ ์ •๋ณด (Positional Part): ์ฟผ๋ฆฌ๊ฐ€ ์ด๋ฏธ์ง€ ์–ด๋””๋ฅผ ๊ฐ€๋ฆฌํ‚ค๋Š”์ง€(encoder ์ถœ๋ ฅ์œผ๋กœ๋ถ€ํ„ฐ anchor box ์ดˆ๊ธฐํ™”)
  • ๋‚ด์šฉ ์ •๋ณด (Content Part): ์–ด๋–ค ๊ฐ์ฒด๋ฅผ ์ฐพ์œผ๋ ค๊ณ  ํ•˜๋Š”์ง€

3.3 ๐Ÿ”„ Cross-Modality Decoder
  • ๊ฐ ๋””์ฝ”๋” ๋ ˆ์ด์–ด๋Š” ๋‹ค์Œ ๋ธ”๋ก์œผ๋กœ ๊ตฌ์„ฑ๋จ:
    1. Self-Attention
    2. Image Cross-Attention
    3. Text Cross-Attention
    4. Feed-Forward Network (FFN)
  • DINO์˜ ๋””์ฝ”๋” ๊ตฌ์กฐ์— ๋น„ํ•ด Text Cross-Attention ๋ธ”๋ก์ด ์ถ”๊ฐ€๋จ
    โ†’ ํ…์ŠคํŠธ ์ •๋ณด๊ฐ€ ์ฟผ๋ฆฌ ์—…๋ฐ์ดํŠธ์— ๋” ๊ฐ•ํ•˜๊ฒŒ ๋ฐ˜์˜๋จ

3.4 โœ‚๏ธ Sub-Sentence Level Text Feature

subsentence

  • ๊ธฐ์กด ํ…์ŠคํŠธ ์ธ์ฝ”๋”ฉ ๋ฐฉ์‹:
  • Sentence-level: ๋ฌธ์žฅ์„ ํ•˜๋‚˜์˜ ๋ฒกํ„ฐ๋กœ ์ฒ˜๋ฆฌ โ†’ ์ •๋ฐ€๋„ ์†์‹ค
  • Word-level: ์—ฌ๋Ÿฌ ๋‹จ์–ด๋ฅผ ํ•œ ๋ฒˆ์— ์ธ์ฝ”๋”ฉ โ†’ ๋‹จ์–ด ๊ฐ„ ๋ถˆํ•„์š”ํ•œ ์ƒํ˜ธ์ž‘์šฉ ๋ฐœ์ƒ

  • ๋ฌธ์ œ: ํ…์ŠคํŠธ๊ฐ€ ์—ฌ๋Ÿฌ ํด๋ž˜์Šค๋ช…์„ ํฌํ•จํ•  ๊ฒฝ์šฐ, ๋ฌด๊ด€ํ•œ ๋‹จ์–ด ๊ฐ„ ์ƒํ˜ธ์ž‘์šฉ(attention)์ด ์ƒ๊น€

  • ํ•ด๊ฒฐ:
    Sub-sentence level representation ๋„์ž…
    โ†’ ์„œ๋กœ ๋‹ค๋ฅธ ํด๋ž˜์Šค๋ช… ์‚ฌ์ด์˜ attention์„ maskํ•˜์—ฌ ๋ถˆํ•„์š”ํ•œ ์ƒํ˜ธ์ž‘์šฉ ์ œ๊ฑฐ
    โ†’ ๋‹จ์–ด ๋‹จ์œ„ ์ •๋ฐ€ ํ‘œํ˜„ ์œ ์ง€ + ์ƒํ˜ธ ๊ฐ„์„ญ ๋ฐฉ์ง€

๐ŸŽฏ Loss์˜ ๊ตฌ์„ฑ


๐Ÿ”ง 3.5 Loss Function

Grounding DINO๋Š” ๊ธฐ์กด์˜ DETR ๊ณ„์—ด ๋ชจ๋ธ๋“ค๊ณผ ์œ ์‚ฌํ•˜๊ฒŒ,
๋‹ค์Œ ์„ธ ๊ฐ€์ง€ ์ฃผ์š” ์†์‹ค ํ•จ์ˆ˜๋ฅผ ์กฐํ•ฉํ•˜์—ฌ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค:

๐Ÿ“ฆ 1. Bounding Box Regression
  • L1 Loss
  • GIoU Loss (Generalized Intersection over Union)
  • โ†’ ๋ฐ•์Šค ์œ„์น˜ ์˜ˆ์ธก ์ •๋ฐ€๋„ ํ–ฅ์ƒ์— ์‚ฌ์šฉ
  • ์ฐธ๊ณ : DETR, Deformable DETR ๋“ฑ์—์„œ ์‚ฌ์šฉ๋œ ๋ฐฉ์‹๊ณผ ๋™์ผ

๐Ÿท๏ธ 2. Classification (ํ…์ŠคํŠธ ๊ธฐ๋ฐ˜ ๋ถ„๋ฅ˜)
  • Contrastive Loss (GLIP ๋ฐฉ์‹ ์ฑ„ํƒ)
    • ์˜ˆ์ธก๋œ ๊ฐ์ฒด์™€ ํ…์ŠคํŠธ ํ† ํฐ ๊ฐ„์˜ ๋Œ€์‘ ๊ด€๊ณ„ ํ•™์Šต
  • ๋ฐฉ์‹:
    • ๊ฐ ์ฟผ๋ฆฌ์™€ ํ…์ŠคํŠธ ํŠน์ง• ๊ฐ„์˜ dot product โ†’ logits ๊ณ„์‚ฐ
    • ๊ฐ ํ…์ŠคํŠธ ํ† ํฐ๋ณ„๋กœ Focal Loss ์ ์šฉํ•˜์—ฌ ๋ถ„๋ฅ˜ ํ•™์Šต

๐Ÿ”„ 3. ๋งค์นญ ๋ฐ ์ดํ•ฉ ๊ณ„์‚ฐ
  • ์˜ˆ์ธก๊ฐ’๊ณผ ์ •๋‹ต ๊ฐ„ ์ด์ค‘ ์ด๋ถ„ ๋งค์นญ (bipartite matching) ์ˆ˜ํ–‰
    โ†’ ๋ฐ•์Šค regression cost + classification cost ๊ธฐ๋ฐ˜
  • ๋งค์นญ ํ›„ ์ตœ์ข… ์†์‹ค์€ ๋‹ค์Œ์„ ํ•ฉ์‚ฐํ•˜์—ฌ ๊ณ„์‚ฐ:
    • Bounding Box Loss (L1 + GIoU)
    • Classification Loss (Focal + Contrastive)

๐Ÿงฑ 4. Auxiliary Loss
  • DETR ๊ณ„์—ด ๊ตฌ์กฐ๋ฅผ ๋”ฐ๋ฅด๊ธฐ ๋•Œ๋ฌธ์—, ๋‹ค์Œ ๋‘ ์œ„์น˜์— ๋ณด์กฐ ์†์‹ค(auxiliary loss)์„ ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค:
    • ๊ฐ ๋””์ฝ”๋” ๋ ˆ์ด์–ด ์ถœ๋ ฅ
    • ์ธ์ฝ”๋” ์ถœ๋ ฅ (encoder outputs)

โžก๏ธ ์ด ๋ณด์กฐ ์†์‹ค์€ ํ•™์Šต ์ดˆ๊ธฐ ์•ˆ์ •์„ฑ๊ณผ ์ˆ˜๋ ด ๊ฐ€์†์— ๊ธฐ์—ฌํ•ฉ๋‹ˆ๋‹ค.


๐Ÿ“Š Grounding DINO Ablation ์‹คํ—˜ ์ •๋ฆฌ

Grounding DINO์˜ ์ฃผ์š” ์„ค๊ณ„ ์š”์†Œ๋“ค์ด ์‹ค์ œ ์„ฑ๋Šฅ์— ์–ด๋–ค ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š”์ง€๋ฅผ ๋ถ„์„ํ•˜๊ธฐ ์œ„ํ•ด,
์—ฌ๋Ÿฌ ๊ตฌ์„ฑ ์š”์†Œ๋ฅผ ์ œ๊ฑฐํ•˜๊ฑฐ๋‚˜ ๋ณ€๊ฒฝํ•œ Ablation ์‹คํ—˜์„ ์ˆ˜ํ–‰ํ•˜์˜€์Œ
์‹คํ—˜ ๊ฒฐ๊ณผ๋Š” COCO (minival)์™€ LVIS (minival) ๋ฐ์ดํ„ฐ์…‹์—์„œ์˜
Zero-Shot ๋ฐ Fine-Tune ์กฐ๊ฑด์„ ๊ธฐ์ค€์œผ๋กœ ํ‰๊ฐ€


๐Ÿ“‹ ์‹คํ—˜ ๊ฒฐ๊ณผ ์š”์•ฝ (Table 7)

ID๋ชจ๋ธ ๊ตฌ์„ฑCOCO (Zero-Shot)COCO (Fine-Tune)LVIS (Zero-Shot)
0โœ… Grounding DINO (Full Model)46.756.916.1
1โŒ w/o Encoder Fusion45.856.113.1
2โŒ Static Query Selection46.356.613.6
3โŒ w/o Text Cross-Attention46.156.314.3
4โŒ Word-Level Text Prompt (vs. Sub-sentence)46.456.615.6

๐Ÿ” ํ•ด์„ ๋ฐ ๊ตฌ์„ฑ ์š”์†Œ๋ณ„ ์˜ํ–ฅ ๋ถ„์„

  1. Encoder Fusion ์ œ๊ฑฐ (๋ชจ๋ธ #1)
    • COCO: -0.9 AP
    • LVIS: -3.0 AP
    • โžค ๊ฐ€์žฅ ํฐ ์„ฑ๋Šฅ ์ €ํ•˜ โ†’ ํ…์ŠคํŠธ-์ด๋ฏธ์ง€ ๊นŠ์€ ์œตํ•ฉ์ด ํ•ต์‹ฌ ์—ญํ• 
  2. Static Query Selection (๋ชจ๋ธ #2)
    • ์ฟผ๋ฆฌ๋ฅผ ๋™์ ์œผ๋กœ ์„ ํƒํ•˜์ง€ ์•Š๊ณ  ๊ณ ์ •๋œ ๋ฐฉ์‹ ์‚ฌ์šฉ
    • LVIS ์„ฑ๋Šฅ -2.5 AP ํ•˜๋ฝ
    • โžค ๋™์  ์ฟผ๋ฆฌ ์„ ํƒ์ด ์ œ๋กœ์ƒท ํƒ์ง€์— ์œ ์˜๋ฏธํ•œ ๊ธฐ์—ฌ
  3. Text Cross-Attention ์ œ๊ฑฐ (๋ชจ๋ธ #3)
    • COCO/Fine-Tune ์˜ํ–ฅ ์ž‘์ง€๋งŒ, LVIS์—์„œ๋Š” -1.8 AP
    • โžค ํ…์ŠคํŠธ ์ •๋ณด๊ฐ€ ๋””์ฝ”๋”์— ์ง์ ‘ ๋ฐ˜์˜๋  ๋•Œ ํšจ๊ณผ ์กด์žฌ
  4. Word-level Prompt ์‚ฌ์šฉ (๋ชจ๋ธ #4)
    • Sub-sentence ๋Œ€์‹  ์ „์ฒด ๋ฌธ์žฅ์„ ๋‹จ์–ด ๋‹จ์œ„๋กœ ์ฒ˜๋ฆฌ
    • LVIS ์„ฑ๋Šฅ -0.5 AP
    • โžค Sub-sentence ๋ฐฉ์‹์ด fine-grained ํ‘œํ˜„์— ์œ ๋ฆฌ

โœ… ๊ฒฐ๋ก  ์š”์•ฝ

  • Encoder Fusion์ด ๊ฐ€์žฅ ํฐ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ์ฃผ๋Š” ํ•ต์‹ฌ ๊ตฌ์„ฑ ์š”์†Œ์ž„์ด ํ™•์ธ๋จ
  • Query Selection๊ณผ Text Cross-Attention์€ ํŠนํžˆ LVIS์™€ ๊ฐ™์€ ์„ธ๋ถ„ํ™”๋œ ์˜คํ”ˆ์…‹ ๋ฐ์ดํ„ฐ์…‹์—์„œ ํšจ๊ณผ์ 
  • Sub-sentence ํ…์ŠคํŠธ ์ฒ˜๋ฆฌ๋Š” Word-level ๋ฐฉ์‹๋ณด๋‹ค ์ •๋ฐ€ํ•œ ํ‘œํ˜„๋ ฅ์„ ์ œ๊ณต

๐Ÿ’ก ๋А๋‚€์ 

Grounding DINO๋Š” ๋‹จ์ˆœํžˆ ํƒ์ง€๋ฅผ ์ž˜ํ•˜๋Š” ๊ฒƒ์„ ๋„˜์–ด,
ํ…์ŠคํŠธ์™€ ์‹œ๊ฐ ์ •๋ณด๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ์—ฐ๊ฒฐํ•˜๋Š” ๋ฐฉ์‹์„ ์ž˜ ๋ณด์—ฌ์ฃผ๋Š” ๋…ผ๋ฌธ ๊ธฐ์กด ํ•™์Šต๋œ ๊ฐ์ฑ„๋ฅผ ํ…€์–ด ํ…์ŠคํŠธ ๊ธฐ๋ฐ˜์˜ ๊ฐ์ฑ„ ํƒ์ƒ‰์„ ํ•œ๋‹ค๋Š” ๊ฒƒ์ด ์ธ์ƒ์ ์ด์—‡๋‹ค!


๐Ÿ“š ์ฐธ๊ณ  ์‚ฌํ•ญ

  1. Grounding DINO paper: https://arxiv.org/abs/2303.05499
  2. Grounding DINO GitHub: https://github.com/IDEA-Research/GroundingDINO
  3. chatGPT์˜ ์š”์•ฝ๋Šฅ๋ ฅ!!
This post is licensed under CC BY 4.0 by the author.