📝 The First Transformer-based Image Detection Model!! DETR! - Transformer로 객채 탐지까지!! DETR의 등장!! (CVPR 2020)

Posted May 7, 2025

By DrFirst

26 min read

📌 What is DETR?

In NLP, Transformers are already the dominant force!
But in vision (image detection), CNN-based models like AlexNet and YOLO still prevailed.
Facebook developed DETR, a Transformer-based image detection model,
and demonstrated that Transformers work well in vision tasks too!

DETR (DEtection TRansformer) was introduced by Facebook AI in 2020.
Unlike traditional CNN-based detectors, it is the first object detection model using a Transformer.

🔍 Motivation: Limitations of Traditional Object Detectors

Anchor Box Design: Requires manual tuning
Complex Post-processing: Needs Non-Maximum Suppression (NMS)
Modular Design: Not truly end-to-end
Region Proposal: Required as a separate stage

1. Anchor Box Design: Manual Tuning Required

Anchor boxes were once an innovative solution for object detection.
- In earlier models, the entire image had to be scanned with sliding windows of various sizes and aspect ratios to detect objects.
- This resulted in high computational cost and issues with varying scales and shapes of objects.
- Anchor boxes were introduced to solve this — by predefining different shapes of boxes and calculating matches more efficiently.
Most detectors at the time used multiple predefined anchor boxes to estimate object locations.
- For example: using 3 scales × 3 aspect ratios = 9 anchors per location.

🔴 Problems:

Anchor sizes, ratios, and quantities must be manually tuned.
Optimal anchors vary per dataset → limited generalization.
Poor alignment between anchors and objects reduces detection accuracy.

2. Complex Post-processing: Non-Maximum Suppression (NMS)

Traditional detectors tend to predict multiple boxes for the same object.
NMS is used to select the box with the highest confidence and remove overlapping ones.

🔴 Problems:

Performance is sensitive to the NMS threshold.
May mistakenly suppress nearby true objects.
Hard to parallelize on GPU, limiting inference speed.

3. Modular Design: Hard to Train End-to-End

Traditional detection models consist of multiple modules:
- Backbone (CNN feature extractor)
- Region Proposal Network (RPN)
- RoI Pooling
- Classifier & Box Regressor

🔴 Problems:

Modules operate independently, making end-to-end learning difficult.
Complex pipelines, harder debugging, and risk of performance bottlenecks.

4. Region Proposal Required

Models like Faster R-CNN generate thousands of region proposals first, then classify which ones contain real objects.
This step is performed by the Region Proposal Network (RPN).

🔴 Problems:

Region proposal introduces extra computation and training.
Slows down processing and increases architectural complexity.

These challenges laid the foundation for the creation of DETR.

🧠 Key Ideas Behind DETR

✅ How does DETR solve these traditional problems?

Traditional Problems	DETR’s Solution
Anchor Boxes	❌ Not used — boxes are predicted via object queries
NMS	❌ Not used — each GT is assigned one prediction only
Modular Structure	✅ Unified End-to-End Transformer-based model
Region Proposal	❌ Not needed — Transformer directly predicts box location

Summary of Advantages

✅ Fully End-to-End Training Structure
🧹 Anchor-Free & NMS-Free
💬 Global Context with Transformer Attention

Advantage 1: Fully End-to-End Training

DETR consists of a single integrated Transformer model:
- Input image → predicted object boxes + classes
- Trained using a single loss function
No need for region proposal, anchor configuration, or NMS post-processing.

👉 Result: Simplified code, easier debugging and maintenance

Advantage 2: 🧹 Anchor-Free & NMS-Free

Anchor-Free:
- No predefined anchor boxes
- Object queries directly predict positions
- Automatically adapts to dataset characteristics — no anchor tuning
NMS-Free:
- Each object query learns to handle only one object
- No need to remove overlaps via NMS
- Accurate predictions even without post-processing

👉 Enables a cleaner and simpler training/inference pipeline

Advantage 3: Global Context from Transformers

CNN-based detectors focus on local features
DETR’s Transformer models learn:
- Global relationships between image patches
- Robustness to distant parts or occluded views of an object

👉 Useful for detecting objects in cluttered scenes or with structural context

✅ DETR Architecture

DETR frames object detection as a sequence prediction task, enabling direct application of the Transformer.

Backbone: CNN (e.g., ResNet)
Transformer Encoder-Decoder
Object Queries: Fixed-size set of learnable queries
Hungarian Matching: One-to-one match with ground truth
No Post-processing: NMS not needed

DETR Pipeline Summary:

Input Image 
 → CNN Backbone (e.g., ResNet)
   → Transformer Encoder-Decoder
     → Object Query Set
       → Predictions {Class, Bounding Box}₁~ₙ

DETR Component 1: Backbone

Extracts image features
Uses a CNN-based backbone (e.g., ResNet-50, ResNet-101) to process the input image
Outputs a 2D feature map that retains spatial layout and semantics

DETR Component 2: Transformer Encoder-Decoder

Encoder:
- Flattens CNN feature map into a sequence of tokens
- Processes them with self-attention in the Transformer encoder
- Learns global context across the entire image
Decoder:
- Input: a fixed set of learnable object queries
- Each query is responsible for predicting one object
- Uses cross-attention to interact with encoder outputs and predicts bounding boxes and classes

DETR Component 3: Object Query

DETR uses a fixed number of learnable object queries
For example, with 100 queries, the model always outputs 100 predictions
Some predictions correspond to real objects, others are classified as “no object”

📌 Unlike anchor-based approaches, this enables direct and interpretable position learning

DETR Component 4: Hungarian Matching

🧠 What is Hungarian Matching?

A classic algorithm for solving the assignment problem, which finds the optimal one-to-one pairing between two sets based on a cost matrix
Yes — it’s named after Hungarian mathematicians!
Goal:
- Given a set of jobs and workers with assignment costs,
- Find the minimum-cost one-to-one matching
Example:
- If you have 3 workers and 3 tasks,
- How should they be assigned to minimize total cost?

🧠 Hungarian Matching in DETR

This algorithm is used only during training!

During training, DETR uses Hungarian Matching to match predicted boxes to ground-truth objects (GT)
- DETR’s 100 queries may predict many different boxes for the same object
- But only the best match is used for each GT object
- It calculates a cost matrix combining classification error, L1 box distance, and IoU
Matching ensures that:
- Each GT is paired with the best fitting query
- Overlapping or redundant boxes are discouraged
This enables clean and duplicate-free training without needing NMS

📌 However, during inference, no such matching is used —
so if the model is not well trained, it can predict multiple boxes for the same object.

⚠️ DETR Limitations

Summary in One Line:

It’s slow to train, and not great at small object detection.

🐢 Very slow convergence (training requires hundreds of thousands of steps)
📏 Poor performance on small objects
🧠 High computational cost of Transformer self-attention

🐢 Slow Convergence

DETR takes a long time to learn meaningful assignments between object queries and GT
Compared to models like Faster R-CNN, it converges very slowly
500 epochs!? That’s a lot!
On the COCO dataset, it typically needs 500+ epochs for strong performance

📌 Reason:

Object queries are initialized randomly
Early predictions are meaningless
Weak supervision signal in the beginning (many queries just predict background)

📏 Weakness on Small Objects

While Transformers capture global context, they may overlook fine local details
Small objects often get lost in the low-resolution feature map
Object queries may struggle to lock onto such small targets

📌 Traditional CNN detectors often use FPNs and multi-scale tricks
DETR (in its original form) lacked these enhancements

🧠 Transformer Compute Cost

Transformer self-attention has O(N²) complexity
- (N: number of patches/tokens, proportional to image resolution)
High-resolution inputs lead to huge compute and memory demands

📌 As a result:

Inference is slower
Large memory use and limited batch size

I thought ViT was the model that brought Transformers to vision?
But turns out DETR came first!

In short: ViT uses the Transformer encoder for classification,
while DETR uses CNN features with a Transformer for object detection.

ViT (Vision Transformer) was released after DETR (Oct 2020)
DETR was one of the first applications of Transformers in vision
DETR is a CNN + Transformer hybrid,
while ViT is a pure Transformer vision model
After ViT, many DETR variants started using ViT backbones
(e.g., DINOv2)

🧠 Core Differences

Item	DETR	ViT
Published	May 2020 (ECCV)	Oct 2020 (arXiv)
Transformer Use	Encoder-Decoder for object detection	Encoder-only for image classification
Input Format	CNN feature map to Transformer	Raw image patches to Transformer
Model Purpose	Predict bounding boxes + classes	Predict class label

ViT’s Impact on DETR Evolution

ViT popularized Transformer backbones for vision
Later DETR variants began using ViT:
- e.g., DINO + Swin Transformer
- e.g., Grounding DINO + CLIP-ViT
- e.g., DINOv2 + ViT-L

📌 ViT made DETR variants more expressive and opened new paths:
open-vocabulary detection, grounding, and multimodal vision models

💬 Final Thoughts

Transformers are amazing — and DETR is proof that
even in vision, we’re moving from CNNs to attention-based models!

It’s a bit disappointing that like most object detectors, DETR can only detect pretrained classes.
Thankfully, newer research addresses this with the grounding family of models.
And as DETR merges with ViT, we now see many successors that push the field forward.
I’m excited to continue learning from here!

(한국어) 🧠 트랜스포머 기반의 첫 Image Detection 모델!! DETR!

『End-to-End Object Detection with Transformers』(CVPR, 2020) 공부

논문: End-to-End Object Detection with Transformers
발표: CVPR 2020
코드: facebookresearch/detr

✍️ 저자: Meta AI Research (Carion, Massa et al.)
🌟 한줄 요약: Transformer를 사용해 객채 탐지를 하는 첫 모델!!!

📌 DETR란 무엇인가?

텍스트 세계에서는 이미 Transformer가 왕성하게 활동중!!
그러나 이미지 세계는 (이미지 Detection) Alexnet, Yolo 등 여전히 CNN 기반의 모델들만 존재!
페이스북에서 Transformer 기반의 이미지 Detection 모델 DETR을 개발했고,
이미지의 영역에서도 Transformer가 잘 작동함을 보여줌!!

DETR (DEtection TRansformer)는 Facebook AI에서 2020년에 발표한 객체 탐지 모델로,
기존 CNN 기반의 탐지기와 달리 Transformer를 사용한 최초의 객체 탐지 모델!!

🔍 연구의 배경 : 기존 객체 탐지(Image Detection) 방식의 한계

Anchor Box 설계: 수동 튜닝 필요
복잡한 후처리: Non-Maximum Suppression(NMS)
모듈 분리 구조: End-to-End 학습이 어려움
Region Proposal 필요

1. Anchor Box 설계: 수동 튜닝 필요

원래 Anchor Box 도 객체 감지에 있어 참신한 해결책이었음
- 기존에는 객채 인식을 위헤 이미지 내의 모든 가능한 위치와 다양한 크기의 윈도우를 슬라이딩하며 객체의 존재 여부를 확인했습니다.
- 이는 높은 계산비용, 다양한 종횡비, 객체 크기의 변화 등의 문제가있었습니다.
- 이 문제를 해결하기 위해 Anchor Box이 등장쓰!!
- Anchor Box는 미리 다양한 형태의 박스를 정의해서 효율적으로 계산하게하는것임!!
이에, 이시절 대부분의 객체 탐지기는 사전에 정의된 여러 개의 Anchor Box를 사용하여 물체의 위치 추정
- 예: 3개의 스케일과 3개의 비율을 조합하여 총 9개의 anchor를 하나의 위치에 배치

🔴 문제점

Anchor의 크기, 비율, 수량 등을 수동 설계 및 튜닝해야 함
데이터셋마다 최적의 anchor 구성이 달라 범용성이 떨어짐
Anchor box가 실제 객체와 맞지 않으면 탐지 성능 저하

2. 복잡한 후처리: Non-Maximum Suppression (NMS)

기존 모델들은 같은 객체에 대해 여러 개의 박스를 예측 하게됨
이를 해결하기 위해 NMS 알고리즘을 사용, 겹치는 박스 중에서 가장 신뢰도 높은 것만 남기고 나머지를 제거

🔴 문제점

NMS는 임계값 설정이 민감하여 성능에 영향을 줌
근접한 객체를 잘못 제거할 위험이 있음
GPU 병렬화에 적합하지 않아 연산 속도에 제약

3. 모듈 분리 구조: End-to-End 학습이 어려움

기존 탐지기는 여러 개의 모듈로 구성됨:
- Backbone (CNN 기반 특징 추출기)
- Region Proposal Network (RPN)
- RoI Pooling
- Classifier & Box Regressor

🔴 문제점

각 모듈이 분리된 방식으로 동작하여 전체 모델을 하나로 학습하기 어려움
학습 파이프라인이 복잡하고, 디버깅이 어렵고, 성능 병목이 발생할 수 있음

4. Region Proposal 필요

Faster R-CNN과 같은 모델은 먼저 수천 개의 후보 박스(region proposals)를 생성한 후, 이 중에서 진짜 객체를 분류
이 과정은 RPN(Region Proposal Network)을 통해 수행됨

🔴 문제점

Region Proposal 단계는 추가적인 연산과 학습이 필요
전체 처리 속도를 느리게 만들고 구조 복잡도를 증가시킴

이러한 문제점들은 DETR가 등장하게 된 계기이자 배경입니다.

🧠 DETR의 핵심 아이디어

✅ DETR는 기존의 문제들을 어떻게 해결했을까?

기존 문제점	DETR의 접근 방식
Anchor Box 필요	❌ 없음 — object query를 통해 직접 box 예측
NMS 필요	❌ 없음 — 하나의 GT에 하나의 예측 box만 할당
모듈 분리 구조	✅ End-to-End transformer 구조로 통합
Region Proposal 필요	❌ 없음 — transformer가 직접 위치 예측 수행

장점 요약

✅ 완전한 End-to-End 학습 구조
🧹 Anchor-Free & NMS-Free
💬 Transformer의 글로벌 컨텍스트 활용

장점1 : 완전한 End-to-End 학습 구조

DETR은 하나의 통합된 transformer 네트워크로 구성,
- 이미지 입력 → 객체 위치(box) + 클래스까지
- 단 하나의 손실 함수로 직접 학습 가능
별도의 region proposal, anchor 설정, NMS 등의 후처리가 필요 없음

👉 결과적으로 코드가 단순해지고, 디버깅과 유지보수가 쉬워짐

장점2 : 🧹 Anchor-Free & NMS-Free

Anchor-Free:
- 사전 정의된 anchor box가 없고
- transformer의 object query가 객체 위치를 직접 예측
- anchor 튜닝 필요 없이, 데이터 특성에 자동 적응
NMS-Free:
- DETR는 각 object query가 하나의 객체만 담당하도록 학습됨
- 겹치는 박스를 제거하기 위한 NMS가 불필요
- 후처리 없이도 정확한 예측이 가능

👉 이로 인해 학습/추론 파이프라인이 더 간단하고 깔끔해짐

장점3 : Transformer의 글로벌 컨텍스트 활용

기존 CNN 기반 탐지기는 로컬 특징(local pattern) 중심
반면 DETR는 transformer를 사용하여
- 이미지 전체의 패치 간 관계(global context)를 학습함
- 객체가 떨어져 있거나 부분적으로 보일 때도 강력함

👉 복잡한 배경, 겹친 객체, 구조적 관계 탐지에 유리

✅ DETR의 구조!

DETR는 객체 탐지를 sequence prediction 문제로 바꾸어 Transformer를 적용합니다.

Backbone: ResNet 등 CNN 사용
Transformer Encoder-Decoder 구조
Object Query: 예측할 객체 개수만큼 learnable query 사용
Hungarian Matching: ground truth와 예측 결과를 일대일 대응
Post-processing 없음: NMS 없이 end-to-end로 학습

DETR의 구조 요약!

Input Image 
 → CNN Backbone (e.g., ResNet)
   → Transformer Encoder-Decoder
     → Object Query Set
       → Predictions {Class, Bounding Box}₁~ₙ

DETR의 요소1 : Backbone

이미지 특징 추출
입력 이미지를 처리하기 위해 CNN 기반 backbone 사용 (예: ResNet-50, ResNet-101)
이 단계에서는 고수준의 시각 특징(feature map)을 추출
출력: 이미지의 공간적 정보를 담고 있는 2D feature map

DETR의 요소2 : Transformer Encoder-Decoder

Encoder:
- CNN의 feature map을 flatten하여 sequence로 변환
- 이 sequence를 self-attention 기반 transformer encoder에 입력
- 이미지의 전역 컨텍스트(global context)를 학습
Decoder:
- 입력: learnable한 object query 집합
- 각 query는 예측해야 할 객체 한 개를 담당
- cross-attention을 통해 이미지와 상호작용하며 객체 위치(box)와 클래스(class)를 예측

DETR의 요소3 : Object Query

DETR는 고정된 개수의 learnable object query를 사용
예: object query가 100개라면, 항상 100개의 예측이 출력됨
이 중 일부는 실제 객체, 일부는 “no object”로 분류됨

📌 기존 anchor box 기반 접근과 달리, 직접적인 위치 학습이 가능

DETR의 요소4 : Hungarian Matching

🧠 Hungarian Matching이란?

Hungarian Matching 알고리즘은 두 집합 간의 “최적의 일대일 매칭”을 찾는, 할당 문제 (Assignment Problem)를 해결하는데 최적화된 알고리즘
이름처럼 헝가리 수학자에 의하여 개발된 것이 맞음!!
목표:
- 작업(Job)과 작업자(Worker) 사이의 비용(cost)이 주어졌을 때,
- 총 비용이 최소가 되는 1:1 매칭을 찾는 것
예:
- 3명의 작업자와 3개의 작업이 있을 때,
- 각각을 어떻게 배정하면 총 비용이 최소화될까?

🧠 DETR에서의 Hungarian Matching!!

이 헝가리안 알고리즘은 학습에서만 사용됨!!

DETR에서는 학습시에 예측된 객체와 실제 객체(GT) 간의 매칭에 이 알고리즘을 활용
- DETR의 초안 결과에서는 하나의 객채를 여러 방식으로 예측 할 것이고!
- 그 중 실제 결과 하나만을 매칭해야 하며
- “모든 GT 객체를 예측 중 가장 잘 맞는 하나의 query와 매칭”하는 최적의 조합을 계산
이를 통해 NMS 없이도 중복 없이 깔끔한 탐지 결과 제안 가능
물론, 추론시에는 별도 알고리즘이 없기에 모델의 성능에 따라 한개의 객체를 여러개로 탐지할 위험도 존재

⚠️ DETR의 한계

한계를 요약하면!

한계 한줄요약! : 속도가 느리고 작은 객체에 약하다!!

🐢 수렴 속도 매우 느림 (학습 시간이 수십만 스텝 필요)
📏 작은 객체 탐지 성능 저하
🧠 Transformer 연산량 문제 (고해상도 이미지 처리 어려움)

🐢 수렴 속도 매우 느림

DETR는 학습 초기에는 object query와 ground truth 간의 의미 있는 연결을 찾는 데 시간이 오래 걸림
기존 모델(Faster R-CNN 등)에 비해 수렴 속도가 현저히 느림
500 epoch라니!! 엄청 많지요!?
COCO dataset 기준, 500 epoch 이상 학습해야 안정적인 성능 도달

📌 이유:

object query가 랜덤한 상태로 시작되며, 지도 학습 이전까지 의미 없는 예측이 지속됨
초기에는 많은 query가 background로 매핑되며, 학습 신호가 부족함

📏 작은 객체 탐지 성능 저하

Transformer는 global attention을 기반으로 작동하지만, 이로 인해 세밀한 국소적 정보(local details)가 약해질 수 있음
작은 객체는 feature map에서 잘 표현되지 않아 query가 해당 객체를 잡아내기 어려움
특히, 해상도가 낮은 feature map 위에서 작은 객체는 더 희미해짐

📌 기존 CNN 기반 탐지기들은 작은 객체 처리를 위한 FPN, multi-scale strategy 등을 활용하지만
DETR 초기 버전은 그런 보완이 부족했음

🧠 Transformer 연산량 문제

Transformer 구조는 self-attention 연산의 복잡도가 O(N²)
(N: 입력 시퀀스 길이 → 즉, 픽셀 수에 비례)
고해상도 이미지일수록 연산량과 메모리 사용이 급격히 증가

📌 결과적으로:

추론 속도가 느림
GPU 메모리 사용량이 크고, batch size도 제한적

공부하면서 궁금증!! - DETR와 ViT 관계는??

ViT가 이미지를 Transformer로 적용시킨 거라고 공부했는디!?
그런데 DETR이 이 ViT보다 먼저 나와서,,
결론은 ViT는 Transformer를 인코더로 쓴 거고,
DETR은 CNN으로 벡터화한 결과를 Transformer에 넣어 객체 탐지를 한 거였다!

ViT(Vision Transformer)는 DETR 이후 발표됨 (2020.10)
DETR는 Transformer를 vision에 적용한 최초의 시도 중 하나
즉, DETR는 CNN + Transformer의 하이브리드, ViT는 순수 Transformer 기반 비전 모델
ViT 등장 이후 DETR 계열에 ViT를 backbone으로 사용하는 모델들이 등장 (예: DINOv2)

🧠 핵심 차이 요약

항목	DETR	ViT
발표 시기	2020년 5월 (ECCV)	2020년 10월 (arXiv)
Transformer 용도	Encoder-Decoder 구조로 객체 탐지에 사용	Encoder만 사용하여 이미지 분류에 사용
입력 방식	CNN backbone의 feature map을 transformer에 입력	이미지를 patch로 잘라 직접 transformer에 입력
구조 목적	객체 탐지 (bounding box + class)	이미지 전체에 대한 label 분류

ViT의 영향: DETR 구조의 발전

ViT의 등장은 vision 분야에서 Transformer 기반 백본 사용을 대중화시킴
이후 DETR 계열 모델들도 ViT를 백본으로 채택하기 시작
- 예: DINO + Swin Transformer
- 예: Grounding DINO + CLIP-ViT
- 예: DINOv2 + ViT-L

📌 ViT 덕분에 DETR 계열은 더 강력한 표현력을 가지게 되었고,
open-vocabulary detection, grounding, multimodal 분야로 확장 가능해짐

💬 정리 및 개인 의견

Transformer 의 구조가 얼마나 대단한지 다시금 느끼게됩니다!!
DETR연구가 이미지 분석 역시 CNN의 시대에서 Transformer 시대로의 문을 연 대표적 연구인것 같네요~!

그리고! 기존 Object Detection과 마찬가지로 학습된 객채만 인식할 수 있어 아쉬운것 같습니다!!
추후 연구 중에는 이런 문제를 해결하는 연구도 있고, gounding이라는 이름의 연구들이라고합니다
뿐만 아니라 DETR이 ViT와 결합하며 여러 후속 연구가 나왔다고하니
앞으로 차근차근 공부해봐야겠습니다!
—

AI, Research

DETR Object Detection Transformer 객체 탐지 딥러닝 CV CVPR CVPR 2020

This post is licensed under CC BY 4.0 by the author.

📌 What is DETR?

🔍 Motivation: Limitations of Traditional Object Detectors

1. Anchor Box Design: Manual Tuning Required

2. Complex Post-processing: Non-Maximum Suppression (NMS)

3. Modular Design: Hard to Train End-to-End

4. Region Proposal Required

🔴 Problems:

🧠 Key Ideas Behind DETR

✅ How does DETR solve these traditional problems?

Summary of Advantages

Advantage 1: Fully End-to-End Training

Advantage 2: 🧹 Anchor-Free & NMS-Free

Advantage 3: Global Context from Transformers

✅ DETR Architecture

DETR Pipeline Summary:

DETR Component 1: Backbone

DETR Component 2: Transformer Encoder-Decoder

DETR Component 3: Object Query

DETR Component 4: Hungarian Matching

⚠️ DETR Limitations

Summary in One Line:

🐢 Slow Convergence

📏 Weakness on Small Objects

🧠 Transformer Compute Cost

Curiosity While Studying – How is DETR related to ViT?

🧠 Core Differences

ViT’s Impact on DETR Evolution

💬 Final Thoughts

(한국어) 🧠 트랜스포머 기반의 첫 Image Detection 모델!! DETR!

📌 DETR란 무엇인가?

🔍 연구의 배경 : 기존 객체 탐지(Image Detection) 방식의 한계

1. Anchor Box 설계: 수동 튜닝 필요

2. 복잡한 후처리: Non-Maximum Suppression (NMS)

3. 모듈 분리 구조: End-to-End 학습이 어려움

4. Region Proposal 필요

🔴 문제점

🧠 DETR의 핵심 아이디어

✅ DETR는 기존의 문제들을 어떻게 해결했을까?

장점 요약

장점1 : 완전한 End-to-End 학습 구조

장점2 : 🧹 Anchor-Free & NMS-Free

장점3 : Transformer의 글로벌 컨텍스트 활용

✅ DETR의 구조!

DETR의 구조 요약!

DETR의 요소1 : Backbone

DETR의 요소2 : Transformer Encoder-Decoder

DETR의 요소3 : Object Query

DETR의 요소4 : Hungarian Matching

⚠️ DETR의 한계

한계를 요약하면!

🐢 수렴 속도 매우 느림

📏 작은 객체 탐지 성능 저하

🧠 Transformer 연산량 문제

공부하면서 궁금증!! - DETR와 ViT 관계는??

🧠 핵심 차이 요약

ViT의 영향: DETR 구조의 발전

💬 정리 및 개인 의견

Trending Tags