📝 DINO: The Evolutionary Object Detection Model of DETR!! - DINO: DETR의 진화형 객체 탐지 모델!! (ICLR 2023)

Posted May 9, 2025

By DrFirst

28 min read

🦖 DINO: The Evolutionary Object Detection Model of DETR!!

🔍 A powerful alternative that solves the slow training and small object detection issues of DETR-based models!

Paper: DINO: DETR with Improved DeNoising Anchor Boxes
Presentation: ICLR 2023 (by IDEA Research)
Code: IDEA-Research/DINO
Comment: After DETR was released, DAB-DETR/ DN-DETR / Deformable DETR, etc., were continuously released, and this model combines their concepts with DINO’s own concepts. It’s difficult to understand for someone who has only studied DETR!

✅ What is DINO?

DINO is an object detection model that overcomes the limitations of the DETR family
Designed with a focus on improving training speed and small object performance

DINO = DETR with Improved DeNoising Anchors
Basic structure is DETR-based, but performance is enhanced through various strategies
Achieves performance comparable to Two-stage with a One-stage structure!

🚨 Background of DINO’s Emergence - Major Limitations of DETR

❌ Training is too slow (hundreds of thousands of steps)
- In the early stages of training, DETR’s object queries predict boxes at random locations
- This makes effective matching between queries and GT difficult, resulting in sparse learning signals
- → Consequently, the convergence speed is very slow, requiring dozens of times more epochs than typical models (500 epochs!?)
❌ Weak at detecting small objects
- DETR uses only the final feature map of the CNN backbone, resulting in low resolution
  - (e.g., using C5 level features of ResNet → resolution reduction)
- Information about small objects almost disappears or is faintly represented in this coarse feature map
- Also, Transformer focuses on global attention, making it weak in local details
- → As a result, box predictions for small objects are not accurate
❌ Low performance of Object Query in the early stages of learning
- DETR’s object queries are randomly initialized in the beginning
- The role of which query will predict which object is not determined in the early stages of learning
- Hungarian Matching forcibly performs 1:1 matching, but this matching is inconsistent
- → In the early stages of learning, queries often overlap or predict irrelevant locations, leading to low performance

Briefly Looking at Additional Research in the DETR Family Before DINO

Here’s a brief summary of the major DETR family research before DINO!!
We should study each of these researches as well!!

The following studies have attempted to improve various aspects such as convergence speed, learning stability, and positional accuracy while maintaining the basic framework of DETR.

🔹 Deformable DETR (2021, Zhu et al.)

Core Idea: Deformable Attention
- Performs attention only on a few significant locations instead of the entire image.
Advantages:
- Significantly improved training speed (more than 10 times)
- Introduction of a two-stage structure enables coarse-to-fine detection

🔹 Anchor DETR (2021, Wang et al.)

Redefined Query in an Anchor-based manner.
Enables better local search by having Query possess location information.

🔹 DAB-DETR (2022, Liu et al.)

Initializes Query as a Dynamic Anchor Box and refines it progressively in the decoder.
Improves convergence by providing stable location information from the early stages of learning.

🔹 DN-DETR (2022, Zhang et al.)

Introduced DeNoising Training for learning stabilization.
By including fake queries with added noise to the ground truth (GT) boxes in the training, Contributes to resolving the instability of Bipartite Matching.

💡 Core Ideas of DINO

The reason why understanding DAB-DETR/ DN-DETR / Deformable DETR is necessary!!
This research successfully combines DINO’s own additional ideas (CDN, Mixed Query Selection) with successful cases from previous DETR research!

Main Components	Description	Introduced Paper (Source)
DeNoising Training (+CDN)	Intentionally generates noise boxes around GT during training to quickly converge Queries. DINO extends this contrastively to perform Contrastive DeNoising (CDN) to distinguish between correct and incorrect predictions.	DN-DETR [G. Zhang et al., 2022] + DINO [Zhang et al., 2022]
Matching Queries	Places fixed Query Anchors at locations close to GT to induce stable learning.	DAB-DETR [Liu et al., 2022]
Adding Two-stage Structure	The Encoder extracts coarse object candidates, and the Decoder performs refinement.	Deformable DETR [Zhu et al., 2021]
Look Forward Twice	Improves accuracy by giving attention twice in the Decoder instead of once.	DINO [Zhang et al., 2022]
Mixed Query Selection	Uses only the top-K locations selected from the Encoder as Anchors, and the Content remains static to balance stability and expressive power.	DINO [Zhang et al., 2022]

Idea 1: DeNoising Training (+ CDN)

DINO additionally uses intentionally noisy training samples (denoising query) to help object queries quickly recognize information around the ground truth (GT) in the early stages of training. This strategy alleviates the existing unstable bipartite matching issue and leads to DINO’s unique extension, CDN (Contrastive DeNoising).

Basic DeNoising Training Method

GT Replication & Noise Addition
- Replicates the ground truth box and label
- Adds position noise (e.g., coordinate jitter 5~10%) and class noise (e.g., person → dog)
Denoising Query Generation
- Designates some object queries as denoising queries
- Induces learning to predict the noisy boxes
Loss Calculation
- Calculates the prediction error for noise queries separately from normal matching queries and includes it in the training

CDN (Contrastive DeNoising): DINO’s Extension

Extending the existing denoising technique, DINO introduces a contrastive strategy that simultaneously trains positive / negative query pairs.

Query Type	Generation Method	Learning Objective
Positive Query	Slight noise added to GT (position/class)	Induce accurate prediction
❌ Negative Query	Insert random location or incorrect class	Induce definite ‘incorrect’ prediction

Both types are put into the same decoder, and a different learning objective (loss) is assigned to each.

⚙️ Main Components

Component	Description
Positive Query	Slight noise added to GT box
Negative Query	Incorrect box/class unrelated to GT
Matching Head	Generates prediction results for each
Loss	Induces Positive to match GT, Negative to no-object

Summary of CDN Effects

Reduced false positives → Prevents false detections in similar backgrounds/small objects/overlap situations
Induced faster convergence → Queries that were random in the early stages quickly move closer to the correct answer
Improved model’s discrimination ability → Strengthens the ability to distinguish between correct answers and similar incorrect answers

Key Summary

Item	Description
Purpose	Enhance the ability to distinguish correct answers from similar incorrect answers
Strategy	Extend DeNoising query to positive/negative
✅ Learning Effect	Fast convergence + high accuracy + robust detection

CDN is not just a simple learning stabilization technique; it is a core technology that makes DINO the fastest and most robust DETR-based model to train.

Idea 2: Matching Queries (Fixed Anchor Based)

Unlike DETR, DINO’s object queries do not find locations completely randomly but rather place pre-defined query anchors near GT locations from the beginning.

How it Works

GT Center Anchor Generation
- Generates a fixed number of query anchors based on GT locations during training
Query Assignment to Each Anchor
- These anchors are assigned as responsible queries to predict specific GTs
Matching Process Stabilization
- Hungarian Matching makes it easier to match these anchor queries and GTs in a 1:1 manner

Effects

Queries start near GT, leading to faster convergence
Reduces the matching instability issues that occurred in the early stages
Improved performance and convergence speed due to each GT having a clearly corresponding query

Idea 3: Two-stage Structure

DINO extends the existing one-stage structure of DETR by applying a two-stage structure consisting of Encoder → Decoder.

How it Works

Stage 1 (Encoder)
- Extracts dense object candidates (anchors) through a CNN + Transformer encoder
- Selects Top-K scoring anchors
Stage 2 (Decoder)
- Performs refined prediction based on the anchors selected from the Encoder
- Adjusts class and accurate box

Effects

Coarsely identifies locations in the first stage and accurately adjusts them in the second stage → Improved precision
Increased detection stability in small objects or complex backgrounds

Idea 4: Look Forward Twice (LFT)

Existing DETR-based models perform attention once on the encoder feature by the object query in the decoder. DINO repeats this attention operation twice (Look Twice) to induce deeper interaction.

How it Works

First Attention
- Object query performs basic attention with the encoder output
Second Attention
- Performs attention on the encoder feature again with the first attention result
- That is, query → encoder → query → encoder

Effects

Utilizes deeper context information
Enables accurate class and location prediction even in complex scenes
Secures strong representation power, especially for overlapping objects and small objects

Idea 5: Mixed Query Selection (MQS)

Existing DETR-based queries mostly used the same static queries for all images, and while there was a method like Deformable DETR that used dynamic queries, changing the content as well could cause confusion. DINO introduces a Mixed Query Selection strategy that compromises the advantages of both.

How it Works

Select Top-K Important Encoder Features
- Selects the features with high objectness scores from the encoder output
Anchor (Location Information) is Dynamically Set
- Sets the initial anchor box of each query based on the selected Top-K locations
Content Remains Static
- The content information of the query remains the learned fixed vector as is

In other words, a structure where “where to look changes depending on the image” and “what to look for remains as the model has learned.”

Effects

Starts searching from more accurate locations (anchors) suited for each image
Prevents confusion caused by ambiguous encoder features by maintaining content information
Achieves fast convergence + high precision simultaneously

✅ Summary

Component	Method
Anchor (Location)	Initialized with the location of the Top-K features extracted from the Encoder
Content (Meaning)	Maintains a static learned vector
Expected Effect	Adapts to the location of each image + maintains stable search content

DINO Architecture

Input Image
 → CNN Backbone (e.g., ResNet or Swin)
   → Transformer Encoder
     → Candidate Object Proposals (Two-stage)
       → Transformer Decoder
         → Predictions {Class, Bounding Box}₁~ₙ

Explanation of Main Architecture Stages

DINO maintains the simplicity of the existing DETR while also being one of the definitive DETR models that enhances training speed, accuracy, and stability.

1. ️ Input Image

The input image is typically entered into the model in 3-channel RGB format.

2. CNN Backbone

e.g., ResNet-50, Swin Transformer, etc.
Role of extracting low-level feature maps from the image

3. Transformer Encoder

Receives features extracted from the CNN and learns global context information
Enables each position to relate to other parts of the entire image

4. Candidate Object Proposals (Two-stage)

Selects the Top-K locations with high objectness from the Encoder output
Configures the initial anchor of the query based on this (including Mixed Query Selection)

5. Transformer Decoder

Queries perform attention twice on the encoder feature (Look Forward Twice)
Denoising queries are also processed together to induce stable learning (including CDN)

6. Predictions

Finally predicts the object class and box location for each query → Result: N {class, bounding box} pairs are output

Final Summary: DINO vs DETR

Item	DETR	DINO (Improved)
Training Convergence Speed	Slow	✅ Fast (DeNoising)
Small Object Detection	Low	✅ Improved
Object Query Structure	Simple	✅ Added GT-based Matching
Stage Structure	One-stage	✅ Includes Two-stage Structure

Summary

DINO maintains the structure of DETR while being a model quickly and accurately improved for practical use.
A core model that forms the basis of various subsequent studies (Grounding DINO, DINgfO-DETR, DINOv2)
A highly scalable model that combines well with the latest vision research such as open-vocabulary detection and segment anything!! :)

Personal Thoughts

DINO seems to be an excellent improvement research that solved the learning efficiency and performance issues of DETR by well combining various researches and merging them with their own new results! As the core concepts are shared when extending to Grounding DINO or DINOv2, it is a model that must be remembered to understand DETR-based Transformer detection models!

🦖 (한국어) DINO: DETR의 진화형 객체 탐지 모델 DINO!!

🔍 DETR 계열 모델의 느린 학습과 작은 객체 탐지 문제를 해결한 강력한 대안!

논문: DINO: DETR with Improved DeNoising Anchor Boxes
발표: ICLR 2023 (by IDEA Research)
코드: IDEA-Research/DINO
코멘트 : DETR 공개 이후, DAB-DETR/ DN-DETR / Deformable DETR 등 연속적으로 공개되었고 이들의 개념과 DINO 자체 개념을 조합하여 제안한 모델로,., DETR만 공부하고 넘어온 입장에서는 이해하기가 어렵다!

✅ DINO란?

DINO는 DETR 계열의 한계를 극복한 객체 탐지 모델
특히 학습 속도 향상과 소형 객체 성능 개선에 중점을 둔 구조로 설계

DINO = DETR with Improved DeNoising Anchors
기본 구조는 DETR 기반이지만, 다양한 전략으로 성능을 강화한 모델
One-stage 구조지만 Two-stage 수준의 성능을 달성!

🚨 DINO 등장의 배경 - DETR의 주요 한계

❌ 학습이 너무 느리다 (수십만 스텝)
- DETR은 학습 초기 단계에서 object query들이 무작위한 위치에 박스를 예측
- 이로 인해 query와 GT 간의 효과적인 매칭이 어렵고 학습 신호가 희박함
- → 결국 수렴 속도가 매우 느리고, 일반적인 모델보다 수십 배 더 많은 epoch 필요(500 epock!?)
❌ 작은 객체 탐지가 약하다
- DETR은 CNN backbone의 마지막 feature map만 사용하기 때문에 해상도가 낮음
  - (예: ResNet의 C5 레벨 feature 사용 → 해상도 축소)
- 작은 객체는 이 coarse feature map에서 존재 정보가 거의 사라지거나 희미하게 표현됨
- 또한, Transformer는 전역적 attention에 집중하기 때문에 로컬 디테일이 약해짐
- → 결과적으로 작은 물체에 대한 box 예측이 정확하지 않음
❌ Object Query 학습 초기에 성능이 낮다
- DETR의 object query는 초기에는 random하게 초기화되어 있음
- 학습 초기에 어떤 query가 어떤 객체를 예측할지 역할이 정해져 있지 않음
- Hungarian Matching이 강제로 1:1 매칭을 수행하지만, 이 매칭이 일관성이 없음
- → 학습 초기에 query들이 서로 중복되거나 엉뚱한 위치를 예측하는 경우가 많아 성능이 낮음