Post

๐Ÿ“ The First Transformer-based Image Detection Model!! DETR! - Transformer๋กœ ๊ฐ์ฑ„ ํƒ์ง€๊นŒ์ง€!! DETR์˜ ๋“ฑ์žฅ!! (CVPR 2020)

๐Ÿ“ The First Transformer-based Image Detection Model!! DETR! - Transformer๋กœ ๊ฐ์ฑ„ ํƒ์ง€๊นŒ์ง€!! DETR์˜ ๋“ฑ์žฅ!! (CVPR 2020)

๐Ÿ“Œ What is DETR?

In NLP, Transformers are already the dominant force!
But in vision (image detection), CNN-based models like AlexNet and YOLO still prevailed.
Facebook developed DETR, a Transformer-based image detection model,
and demonstrated that Transformers work well in vision tasks too!

manhwa

DETR (DEtection TRansformer) was introduced by Facebook AI in 2020.
Unlike traditional CNN-based detectors, it is the first object detection model using a Transformer.


๐Ÿ” Motivation: Limitations of Traditional Object Detectors

  • Anchor Box Design: Requires manual tuning
  • Complex Post-processing: Needs Non-Maximum Suppression (NMS)
  • Modular Design: Not truly end-to-end
  • Region Proposal: Required as a separate stage

1. Anchor Box Design: Manual Tuning Required

  • Anchor boxes were once an innovative solution for object detection.
    • In earlier models, the entire image had to be scanned with sliding windows of various sizes and aspect ratios to detect objects.
    • This resulted in high computational cost and issues with varying scales and shapes of objects.
    • Anchor boxes were introduced to solve this โ€” by predefining different shapes of boxes and calculating matches more efficiently.
  • Most detectors at the time used multiple predefined anchor boxes to estimate object locations.
    • For example: using 3 scales ร— 3 aspect ratios = 9 anchors per location.

๐Ÿ”ด Problems:

  • Anchor sizes, ratios, and quantities must be manually tuned.
  • Optimal anchors vary per dataset โ†’ limited generalization.
  • Poor alignment between anchors and objects reduces detection accuracy.

2. Complex Post-processing: Non-Maximum Suppression (NMS)

  • Traditional detectors tend to predict multiple boxes for the same object.
  • NMS is used to select the box with the highest confidence and remove overlapping ones.

๐Ÿ”ด Problems:

  • Performance is sensitive to the NMS threshold.
  • May mistakenly suppress nearby true objects.
  • Hard to parallelize on GPU, limiting inference speed.

3. Modular Design: Hard to Train End-to-End
  • Traditional detection models consist of multiple modules:
    • Backbone (CNN feature extractor)
    • Region Proposal Network (RPN)
    • RoI Pooling
    • Classifier & Box Regressor

๐Ÿ”ด Problems:

  • Modules operate independently, making end-to-end learning difficult.
  • Complex pipelines, harder debugging, and risk of performance bottlenecks.

4. Region Proposal Required
  • Models like Faster R-CNN generate thousands of region proposals first, then classify which ones contain real objects.
  • This step is performed by the Region Proposal Network (RPN).

๐Ÿ”ด Problems:

  • Region proposal introduces extra computation and training.
  • Slows down processing and increases architectural complexity.

These challenges laid the foundation for the creation of DETR.


๐Ÿง  Key Ideas Behind DETR

โœ… How does DETR solve these traditional problems?

Traditional ProblemsDETRโ€™s Solution
Anchor BoxesโŒ Not used โ€” boxes are predicted via object queries
NMSโŒ Not used โ€” each GT is assigned one prediction only
Modular Structureโœ… Unified End-to-End Transformer-based model
Region ProposalโŒ Not needed โ€” Transformer directly predicts box location

Summary of Advantages

  • โœ… Fully End-to-End Training Structure
  • ๐Ÿงน Anchor-Free & NMS-Free
  • ๐Ÿ’ฌ Global Context with Transformer Attention

Advantage 1: Fully End-to-End Training

  • DETR consists of a single integrated Transformer model:
    • Input image โ†’ predicted object boxes + classes
    • Trained using a single loss function
  • No need for region proposal, anchor configuration, or NMS post-processing.

๐Ÿ‘‰ Result: Simplified code, easier debugging and maintenance

Advantage 2: ๐Ÿงน Anchor-Free & NMS-Free

  • Anchor-Free:
    • No predefined anchor boxes
    • Object queries directly predict positions
    • Automatically adapts to dataset characteristics โ€” no anchor tuning
  • NMS-Free:
    • Each object query learns to handle only one object
    • No need to remove overlaps via NMS
    • Accurate predictions even without post-processing

๐Ÿ‘‰ Enables a cleaner and simpler training/inference pipeline

Advantage 3: Global Context from Transformers

  • CNN-based detectors focus on local features
  • DETRโ€™s Transformer models learn:
    • Global relationships between image patches
    • Robustness to distant parts or occluded views of an object

๐Ÿ‘‰ Useful for detecting objects in cluttered scenes or with structural context


โœ… DETR Architecture

DETR frames object detection as a sequence prediction task, enabling direct application of the Transformer.

structure

  • Backbone: CNN (e.g., ResNet)
  • Transformer Encoder-Decoder
  • Object Queries: Fixed-size set of learnable queries
  • Hungarian Matching: One-to-one match with ground truth
  • No Post-processing: NMS not needed

DETR Pipeline Summary:

1
2
3
4
5
Input Image 
 โ†’ CNN Backbone (e.g., ResNet)
   โ†’ Transformer Encoder-Decoder
     โ†’ Object Query Set
       โ†’ Predictions {Class, Bounding Box}โ‚~โ‚™

DETR Component 1: Backbone

  • Extracts image features
  • Uses a CNN-based backbone (e.g., ResNet-50, ResNet-101) to process the input image
  • Outputs a 2D feature map that retains spatial layout and semantics

DETR Component 2: Transformer Encoder-Decoder

  • Encoder:
    • Flattens CNN feature map into a sequence of tokens
    • Processes them with self-attention in the Transformer encoder
    • Learns global context across the entire image
  • Decoder:
    • Input: a fixed set of learnable object queries
    • Each query is responsible for predicting one object
    • Uses cross-attention to interact with encoder outputs and predicts bounding boxes and classes

DETR Component 3: Object Query

  • DETR uses a fixed number of learnable object queries
  • For example, with 100 queries, the model always outputs 100 predictions
  • Some predictions correspond to real objects, others are classified as โ€œno objectโ€

๐Ÿ“Œ Unlike anchor-based approaches, this enables direct and interpretable position learning


DETR Component 4: Hungarian Matching

๐Ÿง  What is Hungarian Matching?

  • A classic algorithm for solving the assignment problem, which finds the optimal one-to-one pairing between two sets based on a cost matrix
  • Yes โ€” itโ€™s named after Hungarian mathematicians!

  • Goal:
    • Given a set of jobs and workers with assignment costs,
    • Find the minimum-cost one-to-one matching
  • Example:
    • If you have 3 workers and 3 tasks,
    • How should they be assigned to minimize total cost?

๐Ÿง  Hungarian Matching in DETR

This algorithm is used only during training!

  • During training, DETR uses Hungarian Matching to match predicted boxes to ground-truth objects (GT)
    • DETRโ€™s 100 queries may predict many different boxes for the same object
    • But only the best match is used for each GT object
    • It calculates a cost matrix combining classification error, L1 box distance, and IoU
  • Matching ensures that:
    • Each GT is paired with the best fitting query
    • Overlapping or redundant boxes are discouraged
  • This enables clean and duplicate-free training without needing NMS

๐Ÿ“Œ However, during inference, no such matching is used โ€”
so if the model is not well trained, it can predict multiple boxes for the same object.


โš ๏ธ DETR Limitations

Summary in One Line:

Itโ€™s slow to train, and not great at small object detection.

  • ๐Ÿข Very slow convergence (training requires hundreds of thousands of steps)
  • ๐Ÿ“ Poor performance on small objects
  • ๐Ÿง  High computational cost of Transformer self-attention

๐Ÿข Slow Convergence

  • DETR takes a long time to learn meaningful assignments between object queries and GT
  • Compared to models like Faster R-CNN, it converges very slowly

    500 epochs!? Thatโ€™s a lot!

  • On the COCO dataset, it typically needs 500+ epochs for strong performance

๐Ÿ“Œ Reason:

  • Object queries are initialized randomly
  • Early predictions are meaningless
  • Weak supervision signal in the beginning (many queries just predict background)

๐Ÿ“ Weakness on Small Objects

  • While Transformers capture global context, they may overlook fine local details
  • Small objects often get lost in the low-resolution feature map
  • Object queries may struggle to lock onto such small targets

๐Ÿ“Œ Traditional CNN detectors often use FPNs and multi-scale tricks
DETR (in its original form) lacked these enhancements


๐Ÿง  Transformer Compute Cost

  • Transformer self-attention has O(Nยฒ) complexity
    • (N: number of patches/tokens, proportional to image resolution)
  • High-resolution inputs lead to huge compute and memory demands

๐Ÿ“Œ As a result:

  • Inference is slower
  • Large memory use and limited batch size

I thought ViT was the model that brought Transformers to vision?
But turns out DETR came first!

In short: ViT uses the Transformer encoder for classification,
while DETR uses CNN features with a Transformer for object detection.

  • ViT (Vision Transformer) was released after DETR (Oct 2020)
  • DETR was one of the first applications of Transformers in vision
  • DETR is a CNN + Transformer hybrid,
    while ViT is a pure Transformer vision model

  • After ViT, many DETR variants started using ViT backbones
    (e.g., DINOv2)

๐Ÿง  Core Differences

ItemDETRViT
PublishedMay 2020 (ECCV)Oct 2020 (arXiv)
Transformer UseEncoder-Decoder for object detectionEncoder-only for image classification
Input FormatCNN feature map to TransformerRaw image patches to Transformer
Model PurposePredict bounding boxes + classesPredict class label

ViTโ€™s Impact on DETR Evolution

  • ViT popularized Transformer backbones for vision
  • Later DETR variants began using ViT:
    • e.g., DINO + Swin Transformer
    • e.g., Grounding DINO + CLIP-ViT
    • e.g., DINOv2 + ViT-L

๐Ÿ“Œ ViT made DETR variants more expressive and opened new paths:
open-vocabulary detection, grounding, and multimodal vision models


๐Ÿ’ฌ Final Thoughts

Transformers are amazing โ€” and DETR is proof that
even in vision, weโ€™re moving from CNNs to attention-based models!

Itโ€™s a bit disappointing that like most object detectors, DETR can only detect pretrained classes.
Thankfully, newer research addresses this with the grounding family of models.
And as DETR merges with ViT, we now see many successors that push the field forward.
Iโ€™m excited to continue learning from here!


(ํ•œ๊ตญ์–ด) ๐Ÿง  ํŠธ๋žœ์Šคํฌ๋จธ ๊ธฐ๋ฐ˜์˜ ์ฒซ Image Detection ๋ชจ๋ธ!! DETR!

ใ€ŽEnd-to-End Object Detection with Transformersใ€(CVPR, 2020) ๊ณต๋ถ€

๋…ผ๋ฌธ: End-to-End Object Detection with Transformers
๋ฐœํ‘œ: CVPR 2020
์ฝ”๋“œ: facebookresearch/detr

โœ๏ธ ์ €์ž: Meta AI Research (Carion, Massa et al.)
๐ŸŒŸ ํ•œ์ค„ ์š”์•ฝ: Transformer๋ฅผ ์‚ฌ์šฉํ•ด ๊ฐ์ฑ„ ํƒ์ง€๋ฅผ ํ•˜๋Š” ์ฒซ ๋ชจ๋ธ!!!

๐Ÿ“Œ DETR๋ž€ ๋ฌด์—‡์ธ๊ฐ€?

ํ…์ŠคํŠธ ์„ธ๊ณ„์—์„œ๋Š” ์ด๋ฏธ Transformer๊ฐ€ ์™•์„ฑํ•˜๊ฒŒ ํ™œ๋™์ค‘!!
๊ทธ๋Ÿฌ๋‚˜ ์ด๋ฏธ์ง€ ์„ธ๊ณ„๋Š” (์ด๋ฏธ์ง€ Detection) Alexnet, Yolo ๋“ฑ ์—ฌ์ „ํžˆ CNN ๊ธฐ๋ฐ˜์˜ ๋ชจ๋ธ๋“ค๋งŒ ์กด์žฌ!
ํŽ˜์ด์Šค๋ถ์—์„œ Transformer ๊ธฐ๋ฐ˜์˜ ์ด๋ฏธ์ง€ Detection ๋ชจ๋ธ DETR์„ ๊ฐœ๋ฐœํ–ˆ๊ณ ,
์ด๋ฏธ์ง€์˜ ์˜์—ญ์—์„œ๋„ Transformer๊ฐ€ ์ž˜ ์ž‘๋™ํ•จ์„ ๋ณด์—ฌ์คŒ!!

manhwa

DETR (DEtection TRansformer)๋Š” Facebook AI์—์„œ 2020๋…„์— ๋ฐœํ‘œํ•œ ๊ฐ์ฒด ํƒ์ง€ ๋ชจ๋ธ๋กœ,
๊ธฐ์กด CNN ๊ธฐ๋ฐ˜์˜ ํƒ์ง€๊ธฐ์™€ ๋‹ฌ๋ฆฌ Transformer๋ฅผ ์‚ฌ์šฉํ•œ ์ตœ์ดˆ์˜ ๊ฐ์ฒด ํƒ์ง€ ๋ชจ๋ธ!!


๐Ÿ” ์—ฐ๊ตฌ์˜ ๋ฐฐ๊ฒฝ : ๊ธฐ์กด ๊ฐ์ฒด ํƒ์ง€(Image Detection) ๋ฐฉ์‹์˜ ํ•œ๊ณ„

  • Anchor Box ์„ค๊ณ„: ์ˆ˜๋™ ํŠœ๋‹ ํ•„์š”
  • ๋ณต์žกํ•œ ํ›„์ฒ˜๋ฆฌ: Non-Maximum Suppression(NMS)
  • ๋ชจ๋“ˆ ๋ถ„๋ฆฌ ๊ตฌ์กฐ: End-to-End ํ•™์Šต์ด ์–ด๋ ค์›€
  • Region Proposal ํ•„์š”

1. Anchor Box ์„ค๊ณ„: ์ˆ˜๋™ ํŠœ๋‹ ํ•„์š”

  • ์›๋ž˜ Anchor Box ๋„ ๊ฐ์ฒด ๊ฐ์ง€์— ์žˆ์–ด ์ฐธ์‹ ํ•œ ํ•ด๊ฒฐ์ฑ…์ด์—ˆ์Œ
    • ๊ธฐ์กด์—๋Š” ๊ฐ์ฑ„ ์ธ์‹์„ ์œ„ํ—ค ์ด๋ฏธ์ง€ ๋‚ด์˜ ๋ชจ๋“  ๊ฐ€๋Šฅํ•œ ์œ„์น˜์™€ ๋‹ค์–‘ํ•œ ํฌ๊ธฐ์˜ ์œˆ๋„์šฐ๋ฅผ ์Šฌ๋ผ์ด๋”ฉํ•˜๋ฉฐ ๊ฐ์ฒด์˜ ์กด์žฌ ์—ฌ๋ถ€๋ฅผ ํ™•์ธํ–ˆ์Šต๋‹ˆ๋‹ค.
    • ์ด๋Š” ๋†’์€ ๊ณ„์‚ฐ๋น„์šฉ, ๋‹ค์–‘ํ•œ ์ข…ํšก๋น„, ๊ฐ์ฒด ํฌ๊ธฐ์˜ ๋ณ€ํ™” ๋“ฑ์˜ ๋ฌธ์ œ๊ฐ€์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.
    • ์ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด Anchor Box์ด ๋“ฑ์žฅ์“ฐ!!
    • Anchor Box๋Š” ๋ฏธ๋ฆฌ ๋‹ค์–‘ํ•œ ํ˜•ํƒœ์˜ ๋ฐ•์Šค๋ฅผ ์ •์˜ํ•ด์„œ ํšจ์œจ์ ์œผ๋กœ ๊ณ„์‚ฐํ•˜๊ฒŒํ•˜๋Š”๊ฒƒ์ž„!!
  • ์ด์—, ์ด์‹œ์ ˆ ๋Œ€๋ถ€๋ถ„์˜ ๊ฐ์ฒด ํƒ์ง€๊ธฐ๋Š” ์‚ฌ์ „์— ์ •์˜๋œ ์—ฌ๋Ÿฌ ๊ฐœ์˜ Anchor Box๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ฌผ์ฒด์˜ ์œ„์น˜ ์ถ”์ •
    • ์˜ˆ: 3๊ฐœ์˜ ์Šค์ผ€์ผ๊ณผ 3๊ฐœ์˜ ๋น„์œจ์„ ์กฐํ•ฉํ•˜์—ฌ ์ด 9๊ฐœ์˜ anchor๋ฅผ ํ•˜๋‚˜์˜ ์œ„์น˜์— ๋ฐฐ์น˜

๐Ÿ”ด ๋ฌธ์ œ์ 

  • Anchor์˜ ํฌ๊ธฐ, ๋น„์œจ, ์ˆ˜๋Ÿ‰ ๋“ฑ์„ ์ˆ˜๋™ ์„ค๊ณ„ ๋ฐ ํŠœ๋‹ํ•ด์•ผ ํ•จ
  • ๋ฐ์ดํ„ฐ์…‹๋งˆ๋‹ค ์ตœ์ ์˜ anchor ๊ตฌ์„ฑ์ด ๋‹ฌ๋ผ ๋ฒ”์šฉ์„ฑ์ด ๋–จ์–ด์ง
  • Anchor box๊ฐ€ ์‹ค์ œ ๊ฐ์ฒด์™€ ๋งž์ง€ ์•Š์œผ๋ฉด ํƒ์ง€ ์„ฑ๋Šฅ ์ €ํ•˜

2. ๋ณต์žกํ•œ ํ›„์ฒ˜๋ฆฌ: Non-Maximum Suppression (NMS)

  • ๊ธฐ์กด ๋ชจ๋ธ๋“ค์€ ๊ฐ™์€ ๊ฐ์ฒด์— ๋Œ€ํ•ด ์—ฌ๋Ÿฌ ๊ฐœ์˜ ๋ฐ•์Šค๋ฅผ ์˜ˆ์ธก ํ•˜๊ฒŒ๋จ
  • ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด NMS ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์‚ฌ์šฉ, ๊ฒน์น˜๋Š” ๋ฐ•์Šค ์ค‘์—์„œ ๊ฐ€์žฅ ์‹ ๋ขฐ๋„ ๋†’์€ ๊ฒƒ๋งŒ ๋‚จ๊ธฐ๊ณ  ๋‚˜๋จธ์ง€๋ฅผ ์ œ๊ฑฐ

๐Ÿ”ด ๋ฌธ์ œ์ 

  • NMS๋Š” ์ž„๊ณ„๊ฐ’ ์„ค์ •์ด ๋ฏผ๊ฐํ•˜์—ฌ ์„ฑ๋Šฅ์— ์˜ํ–ฅ์„ ์คŒ
  • ๊ทผ์ ‘ํ•œ ๊ฐ์ฒด๋ฅผ ์ž˜๋ชป ์ œ๊ฑฐํ•  ์œ„ํ—˜์ด ์žˆ์Œ
  • GPU ๋ณ‘๋ ฌํ™”์— ์ ํ•ฉํ•˜์ง€ ์•Š์•„ ์—ฐ์‚ฐ ์†๋„์— ์ œ์•ฝ

3. ๋ชจ๋“ˆ ๋ถ„๋ฆฌ ๊ตฌ์กฐ: End-to-End ํ•™์Šต์ด ์–ด๋ ค์›€
  • ๊ธฐ์กด ํƒ์ง€๊ธฐ๋Š” ์—ฌ๋Ÿฌ ๊ฐœ์˜ ๋ชจ๋“ˆ๋กœ ๊ตฌ์„ฑ๋จ:
    • Backbone (CNN ๊ธฐ๋ฐ˜ ํŠน์ง• ์ถ”์ถœ๊ธฐ)
    • Region Proposal Network (RPN)
    • RoI Pooling
    • Classifier & Box Regressor

๐Ÿ”ด ๋ฌธ์ œ์ 

  • ๊ฐ ๋ชจ๋“ˆ์ด ๋ถ„๋ฆฌ๋œ ๋ฐฉ์‹์œผ๋กœ ๋™์ž‘ํ•˜์—ฌ ์ „์ฒด ๋ชจ๋ธ์„ ํ•˜๋‚˜๋กœ ํ•™์Šตํ•˜๊ธฐ ์–ด๋ ค์›€
  • ํ•™์Šต ํŒŒ์ดํ”„๋ผ์ธ์ด ๋ณต์žกํ•˜๊ณ , ๋””๋ฒ„๊น…์ด ์–ด๋ ต๊ณ , ์„ฑ๋Šฅ ๋ณ‘๋ชฉ์ด ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ์Œ

4. Region Proposal ํ•„์š”
  • Faster R-CNN๊ณผ ๊ฐ™์€ ๋ชจ๋ธ์€ ๋จผ์ € ์ˆ˜์ฒœ ๊ฐœ์˜ ํ›„๋ณด ๋ฐ•์Šค(region proposals)๋ฅผ ์ƒ์„ฑํ•œ ํ›„, ์ด ์ค‘์—์„œ ์ง„์งœ ๊ฐ์ฒด๋ฅผ ๋ถ„๋ฅ˜
  • ์ด ๊ณผ์ •์€ RPN(Region Proposal Network)์„ ํ†ตํ•ด ์ˆ˜ํ–‰๋จ

๐Ÿ”ด ๋ฌธ์ œ์ 

  • Region Proposal ๋‹จ๊ณ„๋Š” ์ถ”๊ฐ€์ ์ธ ์—ฐ์‚ฐ๊ณผ ํ•™์Šต์ด ํ•„์š”
  • ์ „์ฒด ์ฒ˜๋ฆฌ ์†๋„๋ฅผ ๋А๋ฆฌ๊ฒŒ ๋งŒ๋“ค๊ณ  ๊ตฌ์กฐ ๋ณต์žก๋„๋ฅผ ์ฆ๊ฐ€์‹œํ‚ด

์ด๋Ÿฌํ•œ ๋ฌธ์ œ์ ๋“ค์€ DETR๊ฐ€ ๋“ฑ์žฅํ•˜๊ฒŒ ๋œ ๊ณ„๊ธฐ์ด์ž ๋ฐฐ๊ฒฝ์ž…๋‹ˆ๋‹ค.


๐Ÿง  DETR์˜ ํ•ต์‹ฌ ์•„์ด๋””์–ด

โœ… DETR๋Š” ๊ธฐ์กด์˜ ๋ฌธ์ œ๋“ค์„ ์–ด๋–ป๊ฒŒ ํ•ด๊ฒฐํ–ˆ์„๊นŒ?

๊ธฐ์กด ๋ฌธ์ œ์ DETR์˜ ์ ‘๊ทผ ๋ฐฉ์‹
Anchor Box ํ•„์š”โŒ ์—†์Œ โ€” object query๋ฅผ ํ†ตํ•ด ์ง์ ‘ box ์˜ˆ์ธก
NMS ํ•„์š”โŒ ์—†์Œ โ€” ํ•˜๋‚˜์˜ GT์— ํ•˜๋‚˜์˜ ์˜ˆ์ธก box๋งŒ ํ• ๋‹น
๋ชจ๋“ˆ ๋ถ„๋ฆฌ ๊ตฌ์กฐโœ… End-to-End transformer ๊ตฌ์กฐ๋กœ ํ†ตํ•ฉ
Region Proposal ํ•„์š”โŒ ์—†์Œ โ€” transformer๊ฐ€ ์ง์ ‘ ์œ„์น˜ ์˜ˆ์ธก ์ˆ˜ํ–‰

์žฅ์  ์š”์•ฝ

  • โœ… ์™„์ „ํ•œ End-to-End ํ•™์Šต ๊ตฌ์กฐ
  • ๐Ÿงน Anchor-Free & NMS-Free
  • ๐Ÿ’ฌ Transformer์˜ ๊ธ€๋กœ๋ฒŒ ์ปจํ…์ŠคํŠธ ํ™œ์šฉ

์žฅ์ 1 : ์™„์ „ํ•œ End-to-End ํ•™์Šต ๊ตฌ์กฐ

  • DETR์€ ํ•˜๋‚˜์˜ ํ†ตํ•ฉ๋œ transformer ๋„คํŠธ์›Œํฌ๋กœ ๊ตฌ์„ฑ,
    • ์ด๋ฏธ์ง€ ์ž…๋ ฅ โ†’ ๊ฐ์ฒด ์œ„์น˜(box) + ํด๋ž˜์Šค๊นŒ์ง€
    • ๋‹จ ํ•˜๋‚˜์˜ ์†์‹ค ํ•จ์ˆ˜๋กœ ์ง์ ‘ ํ•™์Šต ๊ฐ€๋Šฅ
  • ๋ณ„๋„์˜ region proposal, anchor ์„ค์ •, NMS ๋“ฑ์˜ ํ›„์ฒ˜๋ฆฌ๊ฐ€ ํ•„์š” ์—†์Œ

๐Ÿ‘‰ ๊ฒฐ๊ณผ์ ์œผ๋กœ ์ฝ”๋“œ๊ฐ€ ๋‹จ์ˆœํ•ด์ง€๊ณ , ๋””๋ฒ„๊น…๊ณผ ์œ ์ง€๋ณด์ˆ˜๊ฐ€ ์‰ฌ์›Œ์ง

์žฅ์ 2 : ๐Ÿงน Anchor-Free & NMS-Free

  • Anchor-Free:
    • ์‚ฌ์ „ ์ •์˜๋œ anchor box๊ฐ€ ์—†๊ณ 
    • transformer์˜ object query๊ฐ€ ๊ฐ์ฒด ์œ„์น˜๋ฅผ ์ง์ ‘ ์˜ˆ์ธก
    • anchor ํŠœ๋‹ ํ•„์š” ์—†์ด, ๋ฐ์ดํ„ฐ ํŠน์„ฑ์— ์ž๋™ ์ ์‘
  • NMS-Free:
    • DETR๋Š” ๊ฐ object query๊ฐ€ ํ•˜๋‚˜์˜ ๊ฐ์ฒด๋งŒ ๋‹ด๋‹นํ•˜๋„๋ก ํ•™์Šต๋จ
    • ๊ฒน์น˜๋Š” ๋ฐ•์Šค๋ฅผ ์ œ๊ฑฐํ•˜๊ธฐ ์œ„ํ•œ NMS๊ฐ€ ๋ถˆํ•„์š”
    • ํ›„์ฒ˜๋ฆฌ ์—†์ด๋„ ์ •ํ™•ํ•œ ์˜ˆ์ธก์ด ๊ฐ€๋Šฅ

๐Ÿ‘‰ ์ด๋กœ ์ธํ•ด ํ•™์Šต/์ถ”๋ก  ํŒŒ์ดํ”„๋ผ์ธ์ด ๋” ๊ฐ„๋‹จํ•˜๊ณ  ๊น”๋”ํ•ด์ง

์žฅ์ 3 : Transformer์˜ ๊ธ€๋กœ๋ฒŒ ์ปจํ…์ŠคํŠธ ํ™œ์šฉ

  • ๊ธฐ์กด CNN ๊ธฐ๋ฐ˜ ํƒ์ง€๊ธฐ๋Š” ๋กœ์ปฌ ํŠน์ง•(local pattern) ์ค‘์‹ฌ
  • ๋ฐ˜๋ฉด DETR๋Š” transformer๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ
    • ์ด๋ฏธ์ง€ ์ „์ฒด์˜ ํŒจ์น˜ ๊ฐ„ ๊ด€๊ณ„(global context)๋ฅผ ํ•™์Šตํ•จ
    • ๊ฐ์ฒด๊ฐ€ ๋–จ์–ด์ ธ ์žˆ๊ฑฐ๋‚˜ ๋ถ€๋ถ„์ ์œผ๋กœ ๋ณด์ผ ๋•Œ๋„ ๊ฐ•๋ ฅํ•จ

๐Ÿ‘‰ ๋ณต์žกํ•œ ๋ฐฐ๊ฒฝ, ๊ฒน์นœ ๊ฐ์ฒด, ๊ตฌ์กฐ์  ๊ด€๊ณ„ ํƒ์ง€์— ์œ ๋ฆฌ

โœ… DETR์˜ ๊ตฌ์กฐ!

DETR๋Š” ๊ฐ์ฒด ํƒ์ง€๋ฅผ sequence prediction ๋ฌธ์ œ๋กœ ๋ฐ”๊พธ์–ด Transformer๋ฅผ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค.

structure

  • Backbone: ResNet ๋“ฑ CNN ์‚ฌ์šฉ
  • Transformer Encoder-Decoder ๊ตฌ์กฐ
  • Object Query: ์˜ˆ์ธกํ•  ๊ฐ์ฒด ๊ฐœ์ˆ˜๋งŒํผ learnable query ์‚ฌ์šฉ
  • Hungarian Matching: ground truth์™€ ์˜ˆ์ธก ๊ฒฐ๊ณผ๋ฅผ ์ผ๋Œ€์ผ ๋Œ€์‘
  • Post-processing ์—†์Œ: NMS ์—†์ด end-to-end๋กœ ํ•™์Šต

DETR์˜ ๊ตฌ์กฐ ์š”์•ฝ!

1
2
3
4
5
Input Image 
 โ†’ CNN Backbone (e.g., ResNet)
   โ†’ Transformer Encoder-Decoder
     โ†’ Object Query Set
       โ†’ Predictions {Class, Bounding Box}โ‚~โ‚™

DETR์˜ ์š”์†Œ1 : Backbone

  • ์ด๋ฏธ์ง€ ํŠน์ง• ์ถ”์ถœ
  • ์ž…๋ ฅ ์ด๋ฏธ์ง€๋ฅผ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด CNN ๊ธฐ๋ฐ˜ backbone ์‚ฌ์šฉ (์˜ˆ: ResNet-50, ResNet-101)
  • ์ด ๋‹จ๊ณ„์—์„œ๋Š” ๊ณ ์ˆ˜์ค€์˜ ์‹œ๊ฐ ํŠน์ง•(feature map)์„ ์ถ”์ถœ
  • ์ถœ๋ ฅ: ์ด๋ฏธ์ง€์˜ ๊ณต๊ฐ„์  ์ •๋ณด๋ฅผ ๋‹ด๊ณ  ์žˆ๋Š” 2D feature map

DETR์˜ ์š”์†Œ2 : Transformer Encoder-Decoder

  • Encoder:
    • CNN์˜ feature map์„ flattenํ•˜์—ฌ sequence๋กœ ๋ณ€ํ™˜
    • ์ด sequence๋ฅผ self-attention ๊ธฐ๋ฐ˜ transformer encoder์— ์ž…๋ ฅ
    • ์ด๋ฏธ์ง€์˜ ์ „์—ญ ์ปจํ…์ŠคํŠธ(global context)๋ฅผ ํ•™์Šต
  • Decoder:
    • ์ž…๋ ฅ: learnableํ•œ object query ์ง‘ํ•ฉ
    • ๊ฐ query๋Š” ์˜ˆ์ธกํ•ด์•ผ ํ•  ๊ฐ์ฒด ํ•œ ๊ฐœ๋ฅผ ๋‹ด๋‹น
    • cross-attention์„ ํ†ตํ•ด ์ด๋ฏธ์ง€์™€ ์ƒํ˜ธ์ž‘์šฉํ•˜๋ฉฐ ๊ฐ์ฒด ์œ„์น˜(box)์™€ ํด๋ž˜์Šค(class)๋ฅผ ์˜ˆ์ธก

DETR์˜ ์š”์†Œ3 : Object Query

  • DETR๋Š” ๊ณ ์ •๋œ ๊ฐœ์ˆ˜์˜ learnable object query๋ฅผ ์‚ฌ์šฉ
  • ์˜ˆ: object query๊ฐ€ 100๊ฐœ๋ผ๋ฉด, ํ•ญ์ƒ 100๊ฐœ์˜ ์˜ˆ์ธก์ด ์ถœ๋ ฅ๋จ
  • ์ด ์ค‘ ์ผ๋ถ€๋Š” ์‹ค์ œ ๊ฐ์ฒด, ์ผ๋ถ€๋Š” โ€œno objectโ€๋กœ ๋ถ„๋ฅ˜๋จ

๐Ÿ“Œ ๊ธฐ์กด anchor box ๊ธฐ๋ฐ˜ ์ ‘๊ทผ๊ณผ ๋‹ฌ๋ฆฌ, ์ง์ ‘์ ์ธ ์œ„์น˜ ํ•™์Šต์ด ๊ฐ€๋Šฅ

DETR์˜ ์š”์†Œ4 : Hungarian Matching

๐Ÿง  Hungarian Matching์ด๋ž€?

  • Hungarian Matching ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๋‘ ์ง‘ํ•ฉ ๊ฐ„์˜ โ€œ์ตœ์ ์˜ ์ผ๋Œ€์ผ ๋งค์นญโ€์„ ์ฐพ๋Š”, ํ• ๋‹น ๋ฌธ์ œ (Assignment Problem)๋ฅผ ํ•ด๊ฒฐํ•˜๋Š”๋ฐ ์ตœ์ ํ™”๋œ ์•Œ๊ณ ๋ฆฌ์ฆ˜
  • ์ด๋ฆ„์ฒ˜๋Ÿผ ํ—๊ฐ€๋ฆฌ ์ˆ˜ํ•™์ž์— ์˜ํ•˜์—ฌ ๊ฐœ๋ฐœ๋œ ๊ฒƒ์ด ๋งž์Œ!!

  • ๋ชฉํ‘œ:
    • ์ž‘์—…(Job)๊ณผ ์ž‘์—…์ž(Worker) ์‚ฌ์ด์˜ ๋น„์šฉ(cost)์ด ์ฃผ์–ด์กŒ์„ ๋•Œ,
    • ์ด ๋น„์šฉ์ด ์ตœ์†Œ๊ฐ€ ๋˜๋Š” 1:1 ๋งค์นญ์„ ์ฐพ๋Š” ๊ฒƒ
  • ์˜ˆ:
    • 3๋ช…์˜ ์ž‘์—…์ž์™€ 3๊ฐœ์˜ ์ž‘์—…์ด ์žˆ์„ ๋•Œ,
    • ๊ฐ๊ฐ์„ ์–ด๋–ป๊ฒŒ ๋ฐฐ์ •ํ•˜๋ฉด ์ด ๋น„์šฉ์ด ์ตœ์†Œํ™”๋ ๊นŒ?

๐Ÿง  DETR์—์„œ์˜ Hungarian Matching!!

์ด ํ—๊ฐ€๋ฆฌ์•ˆ ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ํ•™์Šต์—์„œ๋งŒ ์‚ฌ์šฉ๋จ!!

  • DETR์—์„œ๋Š” ํ•™์Šต์‹œ์— ์˜ˆ์ธก๋œ ๊ฐ์ฒด์™€ ์‹ค์ œ ๊ฐ์ฒด(GT) ๊ฐ„์˜ ๋งค์นญ์— ์ด ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ํ™œ์šฉ
    • DETR์˜ ์ดˆ์•ˆ ๊ฒฐ๊ณผ์—์„œ๋Š” ํ•˜๋‚˜์˜ ๊ฐ์ฑ„๋ฅผ ์—ฌ๋Ÿฌ ๋ฐฉ์‹์œผ๋กœ ์˜ˆ์ธก ํ•  ๊ฒƒ์ด๊ณ !
    • ๊ทธ ์ค‘ ์‹ค์ œ ๊ฒฐ๊ณผ ํ•˜๋‚˜๋งŒ์„ ๋งค์นญํ•ด์•ผ ํ•˜๋ฉฐ
    • โ€œ๋ชจ๋“  GT ๊ฐ์ฒด๋ฅผ ์˜ˆ์ธก ์ค‘ ๊ฐ€์žฅ ์ž˜ ๋งž๋Š” ํ•˜๋‚˜์˜ query์™€ ๋งค์นญโ€ํ•˜๋Š” ์ตœ์ ์˜ ์กฐํ•ฉ์„ ๊ณ„์‚ฐ
  • ์ด๋ฅผ ํ†ตํ•ด NMS ์—†์ด๋„ ์ค‘๋ณต ์—†์ด ๊น”๋”ํ•œ ํƒ์ง€ ๊ฒฐ๊ณผ ์ œ์•ˆ ๊ฐ€๋Šฅ
  • ๋ฌผ๋ก , ์ถ”๋ก ์‹œ์—๋Š” ๋ณ„๋„ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ์—†๊ธฐ์— ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์— ๋”ฐ๋ผ ํ•œ๊ฐœ์˜ ๊ฐ์ฒด๋ฅผ ์—ฌ๋Ÿฌ๊ฐœ๋กœ ํƒ์ง€ํ•  ์œ„ํ—˜๋„ ์กด์žฌ

โš ๏ธ DETR์˜ ํ•œ๊ณ„

ํ•œ๊ณ„๋ฅผ ์š”์•ฝํ•˜๋ฉด!

ํ•œ๊ณ„ ํ•œ์ค„์š”์•ฝ! : ์†๋„๊ฐ€ ๋А๋ฆฌ๊ณ  ์ž‘์€ ๊ฐ์ฒด์— ์•ฝํ•˜๋‹ค!!

  • ๐Ÿข ์ˆ˜๋ ด ์†๋„ ๋งค์šฐ ๋А๋ฆผ (ํ•™์Šต ์‹œ๊ฐ„์ด ์ˆ˜์‹ญ๋งŒ ์Šคํ… ํ•„์š”)
  • ๐Ÿ“ ์ž‘์€ ๊ฐ์ฒด ํƒ์ง€ ์„ฑ๋Šฅ ์ €ํ•˜
  • ๐Ÿง  Transformer ์—ฐ์‚ฐ๋Ÿ‰ ๋ฌธ์ œ (๊ณ ํ•ด์ƒ๋„ ์ด๋ฏธ์ง€ ์ฒ˜๋ฆฌ ์–ด๋ ค์›€)

๐Ÿข ์ˆ˜๋ ด ์†๋„ ๋งค์šฐ ๋А๋ฆผ

  • DETR๋Š” ํ•™์Šต ์ดˆ๊ธฐ์—๋Š” object query์™€ ground truth ๊ฐ„์˜ ์˜๋ฏธ ์žˆ๋Š” ์—ฐ๊ฒฐ์„ ์ฐพ๋Š” ๋ฐ ์‹œ๊ฐ„์ด ์˜ค๋ž˜ ๊ฑธ๋ฆผ
  • ๊ธฐ์กด ๋ชจ๋ธ(Faster R-CNN ๋“ฑ)์— ๋น„ํ•ด ์ˆ˜๋ ด ์†๋„๊ฐ€ ํ˜„์ €ํžˆ ๋А๋ฆผ

    500 epoch๋ผ๋‹ˆ!! ์—„์ฒญ ๋งŽ์ง€์š”!?

  • COCO dataset ๊ธฐ์ค€, 500 epoch ์ด์ƒ ํ•™์Šตํ•ด์•ผ ์•ˆ์ •์ ์ธ ์„ฑ๋Šฅ ๋„๋‹ฌ

๐Ÿ“Œ ์ด์œ :

  • object query๊ฐ€ ๋žœ๋คํ•œ ์ƒํƒœ๋กœ ์‹œ์ž‘๋˜๋ฉฐ, ์ง€๋„ ํ•™์Šต ์ด์ „๊นŒ์ง€ ์˜๋ฏธ ์—†๋Š” ์˜ˆ์ธก์ด ์ง€์†๋จ
  • ์ดˆ๊ธฐ์—๋Š” ๋งŽ์€ query๊ฐ€ background๋กœ ๋งคํ•‘๋˜๋ฉฐ, ํ•™์Šต ์‹ ํ˜ธ๊ฐ€ ๋ถ€์กฑํ•จ

๐Ÿ“ ์ž‘์€ ๊ฐ์ฒด ํƒ์ง€ ์„ฑ๋Šฅ ์ €ํ•˜

  • Transformer๋Š” global attention์„ ๊ธฐ๋ฐ˜์œผ๋กœ ์ž‘๋™ํ•˜์ง€๋งŒ, ์ด๋กœ ์ธํ•ด ์„ธ๋ฐ€ํ•œ ๊ตญ์†Œ์  ์ •๋ณด(local details)๊ฐ€ ์•ฝํ•ด์งˆ ์ˆ˜ ์žˆ์Œ
  • ์ž‘์€ ๊ฐ์ฒด๋Š” feature map์—์„œ ์ž˜ ํ‘œํ˜„๋˜์ง€ ์•Š์•„ query๊ฐ€ ํ•ด๋‹น ๊ฐ์ฒด๋ฅผ ์žก์•„๋‚ด๊ธฐ ์–ด๋ ค์›€
  • ํŠนํžˆ, ํ•ด์ƒ๋„๊ฐ€ ๋‚ฎ์€ feature map ์œ„์—์„œ ์ž‘์€ ๊ฐ์ฒด๋Š” ๋” ํฌ๋ฏธํ•ด์ง

๐Ÿ“Œ ๊ธฐ์กด CNN ๊ธฐ๋ฐ˜ ํƒ์ง€๊ธฐ๋“ค์€ ์ž‘์€ ๊ฐ์ฒด ์ฒ˜๋ฆฌ๋ฅผ ์œ„ํ•œ FPN, multi-scale strategy ๋“ฑ์„ ํ™œ์šฉํ•˜์ง€๋งŒ
DETR ์ดˆ๊ธฐ ๋ฒ„์ „์€ ๊ทธ๋Ÿฐ ๋ณด์™„์ด ๋ถ€์กฑํ–ˆ์Œ

๐Ÿง  Transformer ์—ฐ์‚ฐ๋Ÿ‰ ๋ฌธ์ œ

  • Transformer ๊ตฌ์กฐ๋Š” self-attention ์—ฐ์‚ฐ์˜ ๋ณต์žก๋„๊ฐ€ O(Nยฒ)
    (N: ์ž…๋ ฅ ์‹œํ€€์Šค ๊ธธ์ด โ†’ ์ฆ‰, ํ”ฝ์…€ ์ˆ˜์— ๋น„๋ก€)
  • ๊ณ ํ•ด์ƒ๋„ ์ด๋ฏธ์ง€์ผ์ˆ˜๋ก ์—ฐ์‚ฐ๋Ÿ‰๊ณผ ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ์ด ๊ธ‰๊ฒฉํžˆ ์ฆ๊ฐ€

๐Ÿ“Œ ๊ฒฐ๊ณผ์ ์œผ๋กœ:

  • ์ถ”๋ก  ์†๋„๊ฐ€ ๋А๋ฆผ
  • GPU ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์ด ํฌ๊ณ , batch size๋„ ์ œํ•œ์ 

๊ณต๋ถ€ํ•˜๋ฉด์„œ ๊ถ๊ธˆ์ฆ!! - DETR์™€ ViT ๊ด€๊ณ„๋Š”??

ViT๊ฐ€ ์ด๋ฏธ์ง€๋ฅผ Transformer๋กœ ์ ์šฉ์‹œํ‚จ ๊ฑฐ๋ผ๊ณ  ๊ณต๋ถ€ํ–ˆ๋Š”๋””!?
๊ทธ๋Ÿฐ๋ฐ DETR์ด ์ด ViT๋ณด๋‹ค ๋จผ์ € ๋‚˜์™€์„œ,,

๊ฒฐ๋ก ์€ ViT๋Š” Transformer๋ฅผ ์ธ์ฝ”๋”๋กœ ์“ด ๊ฑฐ๊ณ ,
DETR์€ CNN์œผ๋กœ ๋ฒกํ„ฐํ™”ํ•œ ๊ฒฐ๊ณผ๋ฅผ Transformer์— ๋„ฃ์–ด ๊ฐ์ฒด ํƒ์ง€๋ฅผ ํ•œ ๊ฑฐ์˜€๋‹ค!

  • ViT(Vision Transformer)๋Š” DETR ์ดํ›„ ๋ฐœํ‘œ๋จ (2020.10)
  • DETR๋Š” Transformer๋ฅผ vision์— ์ ์šฉํ•œ ์ตœ์ดˆ์˜ ์‹œ๋„ ์ค‘ ํ•˜๋‚˜
  • ์ฆ‰, DETR๋Š” CNN + Transformer์˜ ํ•˜์ด๋ธŒ๋ฆฌ๋“œ, ViT๋Š” ์ˆœ์ˆ˜ Transformer ๊ธฐ๋ฐ˜ ๋น„์ „ ๋ชจ๋ธ
  • ViT ๋“ฑ์žฅ ์ดํ›„ DETR ๊ณ„์—ด์— ViT๋ฅผ backbone์œผ๋กœ ์‚ฌ์šฉํ•˜๋Š” ๋ชจ๋ธ๋“ค์ด ๋“ฑ์žฅ (์˜ˆ: DINOv2)

๐Ÿง  ํ•ต์‹ฌ ์ฐจ์ด ์š”์•ฝ

ํ•ญ๋ชฉDETRViT
๋ฐœํ‘œ ์‹œ๊ธฐ2020๋…„ 5์›” (ECCV)2020๋…„ 10์›” (arXiv)
Transformer ์šฉ๋„Encoder-Decoder ๊ตฌ์กฐ๋กœ ๊ฐ์ฒด ํƒ์ง€์— ์‚ฌ์šฉEncoder๋งŒ ์‚ฌ์šฉํ•˜์—ฌ ์ด๋ฏธ์ง€ ๋ถ„๋ฅ˜์— ์‚ฌ์šฉ
์ž…๋ ฅ ๋ฐฉ์‹CNN backbone์˜ feature map์„ transformer์— ์ž…๋ ฅ์ด๋ฏธ์ง€๋ฅผ patch๋กœ ์ž˜๋ผ ์ง์ ‘ transformer์— ์ž…๋ ฅ
๊ตฌ์กฐ ๋ชฉ์ ๊ฐ์ฒด ํƒ์ง€ (bounding box + class)์ด๋ฏธ์ง€ ์ „์ฒด์— ๋Œ€ํ•œ label ๋ถ„๋ฅ˜

ViT์˜ ์˜ํ–ฅ: DETR ๊ตฌ์กฐ์˜ ๋ฐœ์ „

  • ViT์˜ ๋“ฑ์žฅ์€ vision ๋ถ„์•ผ์—์„œ Transformer ๊ธฐ๋ฐ˜ ๋ฐฑ๋ณธ ์‚ฌ์šฉ์„ ๋Œ€์ค‘ํ™”์‹œํ‚ด
  • ์ดํ›„ DETR ๊ณ„์—ด ๋ชจ๋ธ๋“ค๋„ ViT๋ฅผ ๋ฐฑ๋ณธ์œผ๋กœ ์ฑ„ํƒํ•˜๊ธฐ ์‹œ์ž‘
    • ์˜ˆ: DINO + Swin Transformer
    • ์˜ˆ: Grounding DINO + CLIP-ViT
    • ์˜ˆ: DINOv2 + ViT-L

๐Ÿ“Œ ViT ๋•๋ถ„์— DETR ๊ณ„์—ด์€ ๋” ๊ฐ•๋ ฅํ•œ ํ‘œํ˜„๋ ฅ์„ ๊ฐ€์ง€๊ฒŒ ๋˜์—ˆ๊ณ ,
open-vocabulary detection, grounding, multimodal ๋ถ„์•ผ๋กœ ํ™•์žฅ ๊ฐ€๋Šฅํ•ด์ง


๐Ÿ’ฌ ์ •๋ฆฌ ๋ฐ ๊ฐœ์ธ ์˜๊ฒฌ

Transformer ์˜ ๊ตฌ์กฐ๊ฐ€ ์–ผ๋งˆ๋‚˜ ๋Œ€๋‹จํ•œ์ง€ ๋‹ค์‹œ๊ธˆ ๋А๋ผ๊ฒŒ๋ฉ๋‹ˆ๋‹ค!!
DETR์—ฐ๊ตฌ๊ฐ€ ์ด๋ฏธ์ง€ ๋ถ„์„ ์—ญ์‹œ CNN์˜ ์‹œ๋Œ€์—์„œ Transformer ์‹œ๋Œ€๋กœ์˜ ๋ฌธ์„ ์—ฐ ๋Œ€ํ‘œ์  ์—ฐ๊ตฌ์ธ๊ฒƒ ๊ฐ™๋„ค์š”~!

๊ทธ๋ฆฌ๊ณ ! ๊ธฐ์กด Object Detection๊ณผ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ํ•™์Šต๋œ ๊ฐ์ฑ„๋งŒ ์ธ์‹ํ•  ์ˆ˜ ์žˆ์–ด ์•„์‰ฌ์šด๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค!!
์ถ”ํ›„ ์—ฐ๊ตฌ ์ค‘์—๋Š” ์ด๋Ÿฐ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๋Š” ์—ฐ๊ตฌ๋„ ์žˆ๊ณ , gounding์ด๋ผ๋Š” ์ด๋ฆ„์˜ ์—ฐ๊ตฌ๋“ค์ด๋ผ๊ณ ํ•ฉ๋‹ˆ๋‹ค
๋ฟ๋งŒ ์•„๋‹ˆ๋ผ DETR์ด ViT์™€ ๊ฒฐํ•ฉํ•˜๋ฉฐ ์—ฌ๋Ÿฌ ํ›„์† ์—ฐ๊ตฌ๊ฐ€ ๋‚˜์™”๋‹ค๊ณ ํ•˜๋‹ˆ
์•ž์œผ๋กœ ์ฐจ๊ทผ์ฐจ๊ทผ ๊ณต๋ถ€ํ•ด๋ด์•ผ๊ฒ ์Šต๋‹ˆ๋‹ค!
โ€”

This post is licensed under CC BY 4.0 by the author.