๐ DINO: The Evolutionary Object Detection Model of DETR!! - DINO: DETR์ ์งํํ ๊ฐ์ฒด ํ์ง ๋ชจ๋ธ!! (ICLR 2023)
๐ฆ DINO: The Evolutionary Object Detection Model of DETR!!
๐ A powerful alternative that solves the slow training and small object detection issues of DETR-based models!
Paper: DINO: DETR with Improved DeNoising Anchor Boxes
Presentation: ICLR 2023 (by IDEA Research)
Code: IDEA-Research/DINO
Comment: After DETR was released, DAB-DETR/ DN-DETR / Deformable DETR, etc., were continuously released, and this model combines their concepts with DINOโs own concepts. Itโs difficult to understand for someone who has only studied DETR!
โ What is DINO?
DINO is an object detection model that overcomes the limitations of the DETR family
Designed with a focus on improving training speed and small object performance
- DINO = DETR with Improved DeNoising Anchors
- Basic structure is DETR-based, but performance is enhanced through various strategies
- Achieves performance comparable to Two-stage with a One-stage structure!
๐จ Background of DINOโs Emergence - Major Limitations of DETR
- โ Training is too slow (hundreds of thousands of steps)
- In the early stages of training, DETRโs object queries predict boxes at random locations
- This makes effective matching between queries and GT difficult, resulting in sparse learning signals
- โ Consequently, the convergence speed is very slow, requiring dozens of times more epochs than typical models (500 epochs!?)
- โ Weak at detecting small objects
- DETR uses only the final feature map of the CNN backbone, resulting in low resolution
- (e.g., using C5 level features of ResNet โ resolution reduction)
- Information about small objects almost disappears or is faintly represented in this coarse feature map
- Also, Transformer focuses on global attention, making it weak in local details
- โ As a result, box predictions for small objects are not accurate
- DETR uses only the final feature map of the CNN backbone, resulting in low resolution
- โ Low performance of Object Query in the early stages of learning
- DETRโs object queries are randomly initialized in the beginning
- The role of which query will predict which object is not determined in the early stages of learning
- Hungarian Matching forcibly performs 1:1 matching, but this matching is inconsistent
- โ In the early stages of learning, queries often overlap or predict irrelevant locations, leading to low performance
Briefly Looking at Additional Research in the DETR Family Before DINO
Hereโs a brief summary of the major DETR family research before DINO!!
We should study each of these researches as well!!
The following studies have attempted to improve various aspects such as convergence speed, learning stability, and positional accuracy while maintaining the basic framework of DETR.
๐น Deformable DETR (2021, Zhu et al.)
- Core Idea: Deformable Attention
- Performs attention only on a few significant locations instead of the entire image.
- Advantages:
- Significantly improved training speed (more than 10 times)
- Introduction of a two-stage structure enables coarse-to-fine detection
๐น Anchor DETR (2021, Wang et al.)
- Redefined Query in an Anchor-based manner.
- Enables better local search by having Query possess location information.
๐น DAB-DETR (2022, Liu et al.)
- Initializes Query as a Dynamic Anchor Box and refines it progressively in the decoder.
- Improves convergence by providing stable location information from the early stages of learning.
๐น DN-DETR (2022, Zhang et al.)
- Introduced DeNoising Training for learning stabilization.
- By including fake queries with added noise to the ground truth (GT) boxes in the training, Contributes to resolving the instability of Bipartite Matching.
๐ก Core Ideas of DINO
The reason why understanding DAB-DETR/ DN-DETR / Deformable DETR is necessary!!
This research successfully combines DINOโs own additional ideas (CDN, Mixed Query Selection) with successful cases from previous DETR research!
Main Components | Description | Introduced Paper (Source) |
---|---|---|
DeNoising Training (+CDN) | Intentionally generates noise boxes around GT during training to quickly converge Queries. DINO extends this contrastively to perform Contrastive DeNoising (CDN) to distinguish between correct and incorrect predictions. | DN-DETR [G. Zhang et al., 2022] + DINO [Zhang et al., 2022] |
Matching Queries | Places fixed Query Anchors at locations close to GT to induce stable learning. | DAB-DETR [Liu et al., 2022] |
Adding Two-stage Structure | The Encoder extracts coarse object candidates, and the Decoder performs refinement. | Deformable DETR [Zhu et al., 2021] |
Look Forward Twice | Improves accuracy by giving attention twice in the Decoder instead of once. | DINO [Zhang et al., 2022] |
Mixed Query Selection | Uses only the top-K locations selected from the Encoder as Anchors, and the Content remains static to balance stability and expressive power. | DINO [Zhang et al., 2022] |
Idea 1: DeNoising Training (+ CDN)
DINO additionally uses intentionally noisy training samples (denoising query) to help object queries quickly recognize information around the ground truth (GT) in the early stages of training. This strategy alleviates the existing unstable bipartite matching issue and leads to DINOโs unique extension, CDN (Contrastive DeNoising).
Basic DeNoising Training Method
- GT Replication & Noise Addition
- Replicates the ground truth box and label
- Adds position noise (e.g., coordinate jitter 5~10%) and class noise (e.g., person โ dog)
- Denoising Query Generation
- Designates some object queries as denoising queries
- Induces learning to predict the noisy boxes
- Loss Calculation
- Calculates the prediction error for noise queries separately from normal matching queries and includes it in the training
CDN (Contrastive DeNoising): DINOโs Extension
Extending the existing denoising technique, DINO introduces a contrastive strategy that simultaneously trains positive / negative query pairs.
Query Type | Generation Method | Learning Objective |
---|---|---|
Positive Query | Slight noise added to GT (position/class) | Induce accurate prediction |
โ Negative Query | Insert random location or incorrect class | Induce definite โincorrectโ prediction |
- Both types are put into the same decoder, and a different learning objective (loss) is assigned to each.
โ๏ธ Main Components
Component | Description |
---|---|
Positive Query | Slight noise added to GT box |
Negative Query | Incorrect box/class unrelated to GT |
Matching Head | Generates prediction results for each |
Loss | Induces Positive to match GT, Negative to no-object |
Summary of CDN Effects
Reduced false positives โ Prevents false detections in similar backgrounds/small objects/overlap situations
Induced faster convergence โ Queries that were random in the early stages quickly move closer to the correct answer
Improved modelโs discrimination ability โ Strengthens the ability to distinguish between correct answers and similar incorrect answers
Key Summary
Item | Description |
---|---|
Purpose | Enhance the ability to distinguish correct answers from similar incorrect answers |
Strategy | Extend DeNoising query to positive/negative |
โ Learning Effect | Fast convergence + high accuracy + robust detection |
CDN is not just a simple learning stabilization technique; it is a core technology that makes DINO the fastest and most robust DETR-based model to train.
Idea 2: Matching Queries (Fixed Anchor Based)
Unlike DETR, DINOโs object queries do not find locations completely randomly but rather place pre-defined query anchors near GT locations from the beginning.
How it Works
- GT Center Anchor Generation
- Generates a fixed number of query anchors based on GT locations during training
- Query Assignment to Each Anchor
- These anchors are assigned as responsible queries to predict specific GTs
- Matching Process Stabilization
- Hungarian Matching makes it easier to match these anchor queries and GTs in a 1:1 manner
Effects
- Queries start near GT, leading to faster convergence
- Reduces the matching instability issues that occurred in the early stages
- Improved performance and convergence speed due to each GT having a clearly corresponding query
Idea 3: Two-stage Structure
DINO extends the existing one-stage structure of DETR by applying a two-stage structure consisting of Encoder โ Decoder.
How it Works
- Stage 1 (Encoder)
- Extracts dense object candidates (anchors) through a CNN + Transformer encoder
- Selects Top-K scoring anchors
- Stage 2 (Decoder)
- Performs refined prediction based on the anchors selected from the Encoder
- Adjusts class and accurate box
Effects
- Coarsely identifies locations in the first stage and accurately adjusts them in the second stage โ Improved precision
- Increased detection stability in small objects or complex backgrounds
Idea 4: Look Forward Twice (LFT)
Existing DETR-based models perform attention once on the encoder feature by the object query in the decoder. DINO repeats this attention operation twice (Look Twice) to induce deeper interaction.
How it Works
- First Attention
- Object query performs basic attention with the encoder output
- Second Attention
- Performs attention on the encoder feature again with the first attention result
- That is, query โ encoder โ query โ encoder
Effects
- Utilizes deeper context information
- Enables accurate class and location prediction even in complex scenes
- Secures strong representation power, especially for overlapping objects and small objects
Idea 5: Mixed Query Selection (MQS)
Existing DETR-based queries mostly used the same static queries for all images, and while there was a method like Deformable DETR that used dynamic queries, changing the content as well could cause confusion. DINO introduces a Mixed Query Selection strategy that compromises the advantages of both.
How it Works
- Select Top-K Important Encoder Features
- Selects the features with high objectness scores from the encoder output
- Anchor (Location Information) is Dynamically Set
- Sets the initial anchor box of each query based on the selected Top-K locations
- Content Remains Static
- The content information of the query remains the learned fixed vector as is
In other words, a structure where โwhere to look changes depending on the imageโ and โwhat to look for remains as the model has learned.โ
Effects
- Starts searching from more accurate locations (anchors) suited for each image
- Prevents confusion caused by ambiguous encoder features by maintaining content information
- Achieves fast convergence + high precision simultaneously
โ Summary
Component | Method |
---|---|
Anchor (Location) | Initialized with the location of the Top-K features extracted from the Encoder |
Content (Meaning) | Maintains a static learned vector |
Expected Effect | Adapts to the location of each image + maintains stable search content |
DINO Architecture
1
2
3
4
5
6
Input Image
โ CNN Backbone (e.g., ResNet or Swin)
โ Transformer Encoder
โ Candidate Object Proposals (Two-stage)
โ Transformer Decoder
โ Predictions {Class, Bounding Box}โ~โ
Explanation of Main Architecture Stages
DINO maintains the simplicity of the existing DETR while also being one of the definitive DETR models that enhances training speed, accuracy, and stability.
1. ๏ธ Input Image
- The input image is typically entered into the model in 3-channel RGB format.
2. CNN Backbone
- e.g., ResNet-50, Swin Transformer, etc.
- Role of extracting low-level feature maps from the image
3. Transformer Encoder
- Receives features extracted from the CNN and learns global context information
- Enables each position to relate to other parts of the entire image
4. Candidate Object Proposals (Two-stage)
- Selects the Top-K locations with high objectness from the Encoder output
- Configures the initial anchor of the query based on this (including Mixed Query Selection)
5. Transformer Decoder
- Queries perform attention twice on the encoder feature (Look Forward Twice)
- Denoising queries are also processed together to induce stable learning (including CDN)
6. Predictions
- Finally predicts the object class and box location for each query โ Result: N
{class, bounding box}
pairs are output
Final Summary: DINO vs DETR
Item | DETR | DINO (Improved) |
---|---|---|
Training Convergence Speed | Slow | โ Fast (DeNoising) |
Small Object Detection | Low | โ Improved |
Object Query Structure | Simple | โ Added GT-based Matching |
Stage Structure | One-stage | โ Includes Two-stage Structure |
Summary
- DINO maintains the structure of DETR while being a model quickly and accurately improved for practical use.
- A core model that forms the basis of various subsequent studies (Grounding DINO, DINgfO-DETR, DINOv2)
- A highly scalable model that combines well with the latest vision research such as open-vocabulary detection and segment anything!! :)
Personal Thoughts
DINO seems to be an excellent improvement research that solved the learning efficiency and performance issues of DETR by well combining various researches and merging them with their own new results! As the core concepts are shared when extending to Grounding DINO or DINOv2, it is a model that must be remembered to understand DETR-based Transformer detection models!
๐ฆ (ํ๊ตญ์ด) DINO: DETR์ ์งํํ ๊ฐ์ฒด ํ์ง ๋ชจ๋ธ DINO!!
๐ DETR ๊ณ์ด ๋ชจ๋ธ์ ๋๋ฆฐ ํ์ต๊ณผ ์์ ๊ฐ์ฒด ํ์ง ๋ฌธ์ ๋ฅผ ํด๊ฒฐํ ๊ฐ๋ ฅํ ๋์!
๋ ผ๋ฌธ: DINO: DETR with Improved DeNoising Anchor Boxes
๋ฐํ: ICLR 2023 (by IDEA Research)
์ฝ๋: IDEA-Research/DINO
์ฝ๋ฉํธ : DETR ๊ณต๊ฐ ์ดํ, DAB-DETR/ DN-DETR / Deformable DETR ๋ฑ ์ฐ์์ ์ผ๋ก ๊ณต๊ฐ๋์๊ณ ์ด๋ค์ ๊ฐ๋ ๊ณผ DINO ์์ฒด ๊ฐ๋ ์ ์กฐํฉํ์ฌ ์ ์ํ ๋ชจ๋ธ๋ก,., DETR๋ง ๊ณต๋ถํ๊ณ ๋์ด์จ ์ ์ฅ์์๋ ์ดํดํ๊ธฐ๊ฐ ์ด๋ ต๋ค!
โ DINO๋?
DINO๋ DETR ๊ณ์ด์ ํ๊ณ๋ฅผ ๊ทน๋ณตํ ๊ฐ์ฒด ํ์ง ๋ชจ๋ธ
ํนํ ํ์ต ์๋ ํฅ์๊ณผ ์ํ ๊ฐ์ฒด ์ฑ๋ฅ ๊ฐ์ ์ ์ค์ ์ ๋ ๊ตฌ์กฐ๋ก ์ค๊ณ
- DINO = DETR with Improved DeNoising Anchors
- ๊ธฐ๋ณธ ๊ตฌ์กฐ๋ DETR ๊ธฐ๋ฐ์ด์ง๋ง, ๋ค์ํ ์ ๋ต์ผ๋ก ์ฑ๋ฅ์ ๊ฐํํ ๋ชจ๋ธ
- One-stage ๊ตฌ์กฐ์ง๋ง Two-stage ์์ค์ ์ฑ๋ฅ์ ๋ฌ์ฑ!
๐จ DINO ๋ฑ์ฅ์ ๋ฐฐ๊ฒฝ - DETR์ ์ฃผ์ ํ๊ณ
- โ ํ์ต์ด ๋๋ฌด ๋๋ฆฌ๋ค (์์ญ๋ง ์คํ
)
- DETR์ ํ์ต ์ด๊ธฐ ๋จ๊ณ์์ object query๋ค์ด ๋ฌด์์ํ ์์น์ ๋ฐ์ค๋ฅผ ์์ธก
- ์ด๋ก ์ธํด query์ GT ๊ฐ์ ํจ๊ณผ์ ์ธ ๋งค์นญ์ด ์ด๋ ต๊ณ ํ์ต ์ ํธ๊ฐ ํฌ๋ฐํจ
- โ ๊ฒฐ๊ตญ ์๋ ด ์๋๊ฐ ๋งค์ฐ ๋๋ฆฌ๊ณ , ์ผ๋ฐ์ ์ธ ๋ชจ๋ธ๋ณด๋ค ์์ญ ๋ฐฐ ๋ ๋ง์ epoch ํ์(500 epock!?)
- โ ์์ ๊ฐ์ฒด ํ์ง๊ฐ ์ฝํ๋ค
- DETR์ CNN backbone์ ๋ง์ง๋ง feature map๋ง ์ฌ์ฉํ๊ธฐ ๋๋ฌธ์ ํด์๋๊ฐ ๋ฎ์
- (์: ResNet์ C5 ๋ ๋ฒจ feature ์ฌ์ฉ โ ํด์๋ ์ถ์)
- ์์ ๊ฐ์ฒด๋ ์ด coarse feature map์์ ์กด์ฌ ์ ๋ณด๊ฐ ๊ฑฐ์ ์ฌ๋ผ์ง๊ฑฐ๋ ํฌ๋ฏธํ๊ฒ ํํ๋จ
- ๋ํ, Transformer๋ ์ ์ญ์ attention์ ์ง์คํ๊ธฐ ๋๋ฌธ์ ๋ก์ปฌ ๋ํ ์ผ์ด ์ฝํด์ง
- โ ๊ฒฐ๊ณผ์ ์ผ๋ก ์์ ๋ฌผ์ฒด์ ๋ํ box ์์ธก์ด ์ ํํ์ง ์์
- DETR์ CNN backbone์ ๋ง์ง๋ง feature map๋ง ์ฌ์ฉํ๊ธฐ ๋๋ฌธ์ ํด์๋๊ฐ ๋ฎ์
- โ Object Query ํ์ต ์ด๊ธฐ์ ์ฑ๋ฅ์ด ๋ฎ๋ค
- DETR์ object query๋ ์ด๊ธฐ์๋ randomํ๊ฒ ์ด๊ธฐํ๋์ด ์์
- ํ์ต ์ด๊ธฐ์ ์ด๋ค query๊ฐ ์ด๋ค ๊ฐ์ฒด๋ฅผ ์์ธกํ ์ง ์ญํ ์ด ์ ํด์ ธ ์์ง ์์
- Hungarian Matching์ด ๊ฐ์ ๋ก 1:1 ๋งค์นญ์ ์ํํ์ง๋ง, ์ด ๋งค์นญ์ด ์ผ๊ด์ฑ์ด ์์
- โ ํ์ต ์ด๊ธฐ์ query๋ค์ด ์๋ก ์ค๋ณต๋๊ฑฐ๋ ์๋ฑํ ์์น๋ฅผ ์์ธกํ๋ ๊ฒฝ์ฐ๊ฐ ๋ง์ ์ฑ๋ฅ์ด ๋ฎ์
๊ฐ๋จํ๊ฒ ์ดํด๋ณด๋ DINO ์ ์ DETR ๊ณ์ด์ ์ถ๊ฐ ์ฐ๊ตฌ
DINO ์ด์ ์ ์ฃผ์ DETR ๊ณ์ด ์ฐ๊ตฌ๋ค์ ๊ฐ๋ตํ ์ ๋ฆฌํด๋ณด์์ต๋๋ค!!
๊ฐ๊ฐ์ ์ฐ๊ตฌ๋ ๋ชจ๋ ๊ณต๋ถํด๋ด์ผํ๊ฒ ์ต๋๋ค!!
์๋ ์ฐ๊ตฌ๋ค์ DETR์ ๊ธฐ๋ณธ ๊ณจ๊ฒฉ์ ์ ์งํ๋ฉด์๋, ์๋ ด ์๋, ํ์ต ์์ ์ฑ, ์์น ์ ๋ฐ๋ ๋ฑ ๋ค์ํ ์ธก๋ฉด์์ ๊ฐ์ ์ ์๋ํด์์ต๋๋ค.
๐น Deformable DETR (2021, Zhu et al.)
- ํต์ฌ ์์ด๋์ด: Deformable Attention
- ์ด๋ฏธ์ง ์ ์ฒด๊ฐ ์๋ ๋ช ๊ฐ์ ์ ์๋ฏธํ ์์น์๋ง attention์ ์ํ.
- ์ฅ์ :
- ํ์ต ์๋ ๋ํญ ํฅ์ (10๋ฐฐ ์ด์)
- ๋ ๋จ๊ณ(two-stage) ๊ตฌ์กฐ ๋์ ์ผ๋ก coarse-to-fine ํ์ง ๊ฐ๋ฅ
๐น Anchor DETR (2021, Wang et al.)
- Query๋ฅผ Anchor-based ๋ฐฉ์์ผ๋ก ์ฌ์ ์.
- Query๊ฐ ์์น ์ ๋ณด๋ฅผ ๊ฐ๋๋ก ํ์ฌ ๋ ๋์ ์ง์ญ ํ์ ๊ฐ๋ฅ.
๐น DAB-DETR (2022, Liu et al.)
- Query๋ฅผ Dynamic Anchor Box๋ก ์ด๊ธฐํํ๊ณ decoder์์ ์ ์ง์ ์ผ๋ก refine.
- ํ์ต ์ด๊ธฐ๋ถํฐ ์์ ์ ์ธ ์์น ์ ๋ณด๋ฅผ ์ ๊ณตํจ์ผ๋ก์จ ์๋ ด์ฑ ํฅ์.
๐น DN-DETR (2022, Zhang et al.)
- ํ์ต ์์ ํ๋ฅผ ์ํ DeNoising Training ๋์ .
- ์ ๋ต GT ๋ฐ์ค์ ๋ ธ์ด์ฆ๋ฅผ ์ถ๊ฐํ ๊ฐ์ง query๋ฅผ ํจ๊ป ํ์ต์ ํฌํจ์์ผ, Bipartite Matching์ ๋ถ์์ ์ฑ ํด์์ ๊ธฐ์ฌ.
๐ก DINO์ ํต์ฌ ์์ด๋์ด
DAB-DETR/ DN-DETR / Deformable DETR ๋ฑ์ ์ดํด๊ฐ ํ์ํ์ด์ !!
DINO ์์ฒด์ ์ถ๊ฐ ์์ด๋์ด(CDN, Mixed Query Selection)๊ณผ ๊ธฐ์กด DETR ๋ถ์ผ์ ์ฐ๊ตฌ์ ์ฑ๊ณต์ ์ฌ๋ก๋ฅผ ์ ์กฐํฉํ ์ฐ๊ตฌ์ ๋๋ค!
์ฃผ์ ๊ตฌ์ฑ ์์ | ์ค๋ช | ๋์ ํ ๋ ผ๋ฌธ (์ถ์ฒ) |
---|---|---|
๐ง DeNoising Training (+CDN) | ํ์ต ์, GT ์ฃผ์์ ๋
ธ์ด์ฆ ๋ฐ์ค๋ฅผ ์ผ๋ถ๋ฌ ์์ฑํ์ฌ Query๋ฅผ ๋น ๋ฅด๊ฒ ์๋ ด์ํด DINO์์๋ ์ด๋ฅผ Contrastiveํ๊ฒ ํ์ฅํ์ฌ ์ ๋ต vs ์ค๋ต์ ๊ตฌ๋ถํ๋ ํ์ต(CDN)๋ ์ํ | DN-DETR [G. Zhang et al., 2022] + DINO [Zhang et al., 2022] |
๐งฒ Matching Queries | GT์ ๊ฐ๊น์ด ์์น์ ๊ณ ์ ๋ Query Anchor๋ฅผ ๋ฐฐ์นํด ์์ ์ ์ธ ํ์ต ์ ๋ | DAB-DETR [Liu et al., 2022] |
๐ง Two-stage ๊ตฌ์กฐ ์ถ๊ฐ | Encoder์์ coarse object ํ๋ณด๋ฅผ ๋ฝ๊ณ , Decoder์์ refinement ์ํ | Deformable DETR [Zhu et al., 2021] |
๐ Look Forward Twice | Decoder์์ ํ ๋ฒ์ด ์๋๋ผ ๋ ๋ฒ attention์ ์ฃผ๋ ๋ฐฉ์์ผ๋ก ์ ํ๋ ํฅ์ | DINO [Zhang et al., 2022] |
๐งฉ Mixed Query Selection | Encoder์์ ์ ํ๋ top-K ์์น๋ง Anchor๋ก ์ฌ์ฉํ๊ณ , Content๋ staticํ๊ฒ ์ ์งํ์ฌ ์์ ์ฑ๊ณผ ํํ๋ ฅ ๊ท ํ ํ๋ณด | DINO [Zhang et al., 2022] |
๐ก ์์ด๋์ด 1: DeNoising Training (+ CDN)
DINO๋ ํ์ต ์ด๊ธฐ์ object query๋ค์ด ์ ๋ต(GT) ์ฃผ๋ณ ์ ๋ณด๋ฅผ ๋น ๋ฅด๊ฒ ์ธ์ํ๋๋ก ๋๊ธฐ ์ํด,
์๋์ ์ผ๋ก ๋
ธ์ด์ฆ๊ฐ ์์ธ ํ์ต ์ํ(denoising query)์ ์ถ๊ฐ๋ก ์ฌ์ฉํฉ๋๋ค.
์ด ์ ๋ต์ ๊ธฐ์กด์ ๋ถ์์ ํ bipartite matching ๋ฌธ์ ๋ฅผ ์ํํ๊ณ ,
DINO๋ง์ ํ์ฅ ๊ธฐ๋ฒ์ธ CDN (Contrastive DeNoising)์ผ๋ก ์ด์ด์ง๋๋ค.
๐ง ๊ธฐ๋ณธ DeNoising Training ๋ฐฉ์
- GT ๋ณต์ & ๋
ธ์ด์ฆ ์ถ๊ฐ
- Ground truth box์ label์ ๋ณต์
- ์์น ๋ ธ์ด์ฆ (e.g., ์ขํ jitter 5~10%)์ ํด๋์ค ๋ ธ์ด์ฆ (e.g., person โ dog) ์ถ๊ฐ
- Denoising Query ์์ฑ
- ์ผ๋ถ object query๋ฅผ denoising query๋ก ์ง์
- ๋ ธ์ด์ฆ๋ box๋ฅผ ์์ธกํ๊ฒ ํ์ต ์ ๋
- Loss ๊ณ์ฐ
- ์ผ๋ฐ matching query์ ๋ณ๋๋ก, ๋ ธ์ด์ฆ query์ ๋ํ ์์ธก ์ค์ฐจ๋ ํจ๊ป ํ์ต
๐ง CDN (Contrastive DeNoising): DINO์ ํ์ฅ
๊ธฐ์กด denoising ๊ธฐ๋ฒ์ ํ์ฅํ์ฌ,
positive / negative query ์์ ๋์์ ํ์ตํ๋ contrastive ์ ๋ต์ ๋์
ํฉ๋๋ค.
Query ์ข ๋ฅ | ์์ฑ ๋ฐฉ์ | ํ์ต ๋ชฉ์ |
---|---|---|
๐ฏ Positive Query | GT์ ์ฝ๊ฐ์ ๋ ธ์ด์ฆ ์ถ๊ฐ (์์น/ํด๋์ค) | ์ ํํ ์์ธก ์ ๋ |
โ Negative Query | ๋ฌด์์ ์์น๋ ์๋ชป๋ ํด๋์ค ์ฝ์ | ํ์คํ โ์ค๋ตโ์ผ๋ก ์์ธก ์ ๋ |
- ๋ ์ข
๋ฅ๋ฅผ ๋์ผํ decoder์ ๋ฃ๊ณ ,
๊ฐ๊ฐ ๋ค๋ฅธ ๋ฐฉ์์ผ๋ก ํ์ต ๋ชฉํ(loss)๋ฅผ ๋ถ์ฌ
โ๏ธ ์ฃผ์ ๊ตฌ์ฑ ์์
๊ตฌ์ฑ ์์ | ์ค๋ช |
---|---|
Positive Query | GT box์ ์ฝ๊ฐ์ ๋ ธ์ด์ฆ ์ถ๊ฐ |
Negative Query | GT์ ๋ฌด๊ดํ ์๋ชป๋ box/class |
Matching Head | ๊ฐ๊ฐ์ ๋ํด ์์ธก ๊ฒฐ๊ณผ ์์ฑ |
Loss | Positive๋ ์ ๋ต๊ณผ ์ผ์นํ๊ฒ, Negative๋ no-object๋ก ์ ๋ |
๐ง CDN์ ํจ๊ณผ ์์ฝ
false positive ๊ฐ์
โ ์ ์ฌํ ๋ฐฐ๊ฒฝ/์์ ๊ฐ์ฒด/overlap ์ํฉ์์ ์คํ ๋ฐฉ์ง๋น ๋ฅธ ์๋ ด ์ ๋
โ ์ด๊ธฐ์ ๋ฌด์์์๋ query๋ค์ด ๋น ๋ฅด๊ฒ ์ ๋ต ๊ทผ์ฒ๋ก ์ด๋๋ชจ๋ธ์ ๊ตฌ๋ถ ๋ฅ๋ ฅ ํฅ์
โ ์ ๋ต๊ณผ ์ ์ฌํ ์ค๋ต์ ํ๋ณํ๋ ํํ๋ ฅ ๊ฐํ
๐ ํต์ฌ ์์ฝ
ํญ๋ชฉ | ์ค๋ช |
---|---|
๐ฏ ๋ชฉ์ | ์ ๋ต๊ณผ ์ ์ฌํ ์ค๋ต์ ๊ตฌ๋ถํ๋ ๋ฅ๋ ฅ ๊ฐํ |
๐ก ์ ๋ต | DeNoising query๋ฅผ positive/negative๋ก ํ์ฅ |
โ ํ์ต ํจ๊ณผ | ๋น ๋ฅธ ์๋ ด + ๋์ ์ ํ๋ + robust detection |
CDN์ ๋จ์ํ ํ์ต ์์ ํ ๊ธฐ๋ฒ์ ๋์ด,
DINO๊ฐ DETR ๊ณ์ด ์ค ๊ฐ์ฅ ๋น ๋ฅด๊ณ ๊ฐ๊ฑดํ๊ฒ ํ์ต๋ ์ ์๋๋ก ๋ง๋ ํต์ฌ ๊ธฐ์ ์ ๋๋ค.
๐ก ์์ด๋์ด2 : Matching Queries (๊ณ ์ Anchor ๊ธฐ๋ฐ)
DINO๋ DETR์ ๋ฌ๋ฆฌ, object query๊ฐ ์์ ํ ๋๋คํ๊ฒ ์์น๋ฅผ ์ฐพ๋ ๋ฐฉ์์ด ์๋๋ผ
์ด๊ธฐ๋ถํฐ GT ์์น ๊ทผ์ฒ์ ์ ํด์ง query anchor๋ฅผ ๋ฐฐ์นํฉ๋๋ค.
๐งฒ ์๋ ๋ฐฉ์
- GT ์ค์ฌ Anchor ์์ฑ
- ํ์ต ์ GT ์์น๋ฅผ ๊ธฐ์ค์ผ๋ก ์ผ์ ์์ ๊ณ ์ ๋ query anchor๋ฅผ ์์ฑ
- ๊ฐ anchor์ query ์ง์
- ์ด anchor๋ ํน์ GT๋ฅผ ์์ธกํด์ผ ํ ์ฑ ์ ์๋ query๋ก ํ ๋น๋จ
- Matching ๊ณผ์ ์์ ํ
- Hungarian Matching์ด ์ด anchor query์ GT๋ฅผ 1:1 ๋งค์นญํ๊ธฐ ์ฌ์์ง
๐ฏ ํจ๊ณผ
- query๊ฐ GT ๊ทผ์ฒ์์ ์์ํ๋ฏ๋ก ๋น ๋ฅด๊ฒ ์๋ ด
- ์ด๊ธฐ์ ๋ฐ์ํ๋ ๋งค์นญ ๋ถ์์ ๋ฌธ์ ๋ฅผ ์ค์
- GT๋ง๋ค ๋ช ํํ ๋์๋๋ query๊ฐ ์์ด ์ฑ๋ฅ๊ณผ ์๋ ด ์๋ ํฅ์
๐ก ์์ด๋์ด3: Two-stage ๊ตฌ์กฐ
DINO๋ ๊ธฐ์กด DETR์ one-stage ๊ตฌ์กฐ๋ฅผ ํ์ฅํ์ฌ
Encoder โ Decoder๋ก ์ด์ด์ง๋ ๋ ๋จ๊ณ ๊ตฌ์กฐ๋ฅผ ์ ์ฉํฉ๋๋ค.
๐ง ์๋ ๋ฐฉ์
- 1๋จ๊ณ (Encoder)
- CNN + Transformer encoder๋ฅผ ํตํด denseํ object ํ๋ณด (anchors) ์ถ์ถ
- Top-K scoring anchor๋ค ์ ํ
- 2๋จ๊ณ (Decoder)
- Encoder์์ ์ ํ๋ anchor๋ค์ ๊ธฐ๋ฐ์ผ๋ก refined prediction ์ํ
- ํด๋์ค ๋ฐ ์ ํํ box ์กฐ์
๐ฏ ํจ๊ณผ
- ์ฒซ ๋จ๊ณ์์ coarseํ๊ฒ ์์น๋ฅผ ํ์ ํ๊ณ ,
- ๋ ๋ฒ์งธ ๋จ๊ณ์์ ์ ํํ ์กฐ์ โ ์ ๋ฐ๋ ํฅ์
- ์์ ๊ฐ์ฒด๋ ๋ณต์กํ ๋ฐฐ๊ฒฝ์์์ ํ์ง ์์ ์ฑ ์ฆ๊ฐ
๐ก ์์ด๋์ด4: Look Forward Twice (LFT)
๊ธฐ์กด DETR ๊ณ์ด์ decoder์์ object query๊ฐ encoder feature์ attention์ ํ ๋ฒ ์ํํฉ๋๋ค.
DINO๋ ์ด attention ์ฐ์ฐ์ ๋ ๋ฒ ๋ฐ๋ณต(Look Twice) ํ์ฌ ๋ ๊น์ ์ํธ์์ฉ์ ์ ๋ํฉ๋๋ค.
๐ ์๋ ๋ฐฉ์
- ์ฒซ ๋ฒ์งธ attention
- object query๊ฐ encoder output๊ณผ ๊ธฐ๋ณธ attention ์ํ
- ๋ ๋ฒ์งธ attention
- ์ฒซ attention ๊ฒฐ๊ณผ๋ฅผ ๋ค์ encoder feature์ attention
- ์ฆ, query โ encoder โ query โ encoder
๐ฏ ํจ๊ณผ
- ๋ ๊น์ context ์ ๋ณด ํ์ฉ
- ๋ณต์กํ ์ฅ๋ฉด์์๋ ์ ํํ ํด๋์ค ๋ฐ ์์น ์์ธก ๊ฐ๋ฅ
- ํนํ overlapping ๊ฐ์ฒด, ์์ ๋ฌผ์ฒด์ ๋ํด ๊ฐํ ํํ๋ ฅ ํ๋ณด
๐ก ์์ด๋์ด5: Mixed Query Selection (MQS)
๊ธฐ์กด DETR ๊ณ์ด์ query๋ ๋๋ถ๋ถ ๋ชจ๋ ์ด๋ฏธ์ง์์ ๋์ผํ static query๋ฅผ ์ฌ์ฉํ์ผ๋ฉฐ,
Deformable DETR์ฒ๋ผ dynamic query๋ฅผ ์ฌ์ฉํ๋ ๋ฐฉ์๋ ์์์ง๋ง content๊น์ง ๋ฐ๊พธ๋ฉด์ ์คํ๋ ค ํผ๋์ ์ค ์ ์์
DINO๋ ์ด ๋์ ์ฅ์ ์ ์ ์ถฉํ Mixed Query Selection ์ ๋ต์ ๋์
๐งฒ ์๋ ๋ฐฉ์
- Top-K ์ค์ํ encoder feature ์ ํ
- encoder ์ถ๋ ฅ ์ค objectness ์ ์๊ฐ ๋์ feature๋ค์ ๊ณจ๋ผ๋
- Anchor (์์น ์ ๋ณด)๋ ๋์ ์ผ๋ก ์ค์
- ์ ํ๋ Top-K ์์น๋ฅผ ๊ธฐ๋ฐ์ผ๋ก ๊ฐ query์ ์ด๊ธฐ anchor box๋ฅผ ์ค์
- Content๋ staticํ๊ฒ ์ ์ง
- query์ ๋ด์ฉ ์ ๋ณด๋ ํ์ต๋ ๊ณ ์ ๋ vector ๊ทธ๋๋ก ์ฌ์ฉ
์ฆ, โ์ด๋๋ฅผ ๋ณผ์ง๋ ์ด๋ฏธ์ง์ ๋ฐ๋ผ ๋ค๋ฅด๊ฒโ,
โ๋ฌด์์ ์ฐพ์์ง๋ ๋ชจ๋ธ์ด ๋ฐฐ์ด ๋๋ก ์ ์งโํ๋ ๊ตฌ์กฐ
๐ฏ ํจ๊ณผ
- ๊ฐ ์ด๋ฏธ์ง์ ๋ง๋ ๋ ์ ํํ ์์น(anchor)์์ ํ์ ์์
- content ์ ๋ณด๋ฅผ ์ ์งํจ์ผ๋ก์จ ๋ชจํธํ encoder feature๋ก ์ธํ ํผ๋ ๋ฐฉ์ง
- ๋น ๋ฅธ ์๋ ด + ๋์ ์ ํ๋ฅ ์ ๋์์ ๋ฌ์ฑ
โ ์์ฝ
๊ตฌ์ฑ ์์ | ๋ฐฉ์ |
---|---|
Anchor (์์น) | Encoder์์ ๋ฝ์ Top-K feature์ ์์น๋ก ์ด๊ธฐํ |
Content (๋ด์ฉ) | Staticํ ํ์ต vector ์ ์ง |
๊ธฐ๋ ํจ๊ณผ | ์ด๋ฏธ์ง๋ณ ์์น ์ ์ + ์์ ์ ์ธ ํ์ ๋ด์ฉ ์ ์ง |
๐งฑ DINO ์ํคํ ์ฒ
1
2
3
4
5
6
Input Image
โ CNN Backbone (e.g., ResNet or Swin)
โ Transformer Encoder
โ Candidate Object Proposals (Two-stage)
โ Transformer Decoder
โ Predictions {Class, Bounding Box}โ~โ
๐ ์ฃผ์ ๊ตฌ์ฑ ๋จ๊ณ ์ค๋ช
DINO๋ ๊ธฐ์กด DETR์ ์ฌํํจ์ ์ ์งํ๋ฉด์๋,
ํ์ต ์๋, ์ ํ๋, ์์ ์ฑ์ ๋ชจ๋ ๊ฐํํ DETR์ ๊ฒฐ์ ํ ๋ชจ๋ธ ์ค ํ๋์ ๋๋ค.
1. ๐ผ๏ธ Input Image
- ์ ๋ ฅ ์ด๋ฏธ์ง๋ ๋ณดํต 3์ฑ๋ RGB ํํ๋ก ๋ชจ๋ธ์ ์ ๋ ฅ๋ฉ๋๋ค.
2. ๐ง CNN Backbone
- ์: ResNet-50, Swin Transformer ๋ฑ
- ์ด๋ฏธ์ง๋ก๋ถํฐ ์ ์์ค ํน์ง(feature map)์ ์ถ์ถํ๋ ์ญํ
3. ๐ Transformer Encoder
- CNN์์ ์ถ์ถํ feature๋ฅผ ๋ฐ์ ๊ธ๋ก๋ฒ context ์ ๋ณด๋ฅผ ํ์ต
- ๊ฐ ์์น๊ฐ ์ ์ฒด ์ด๋ฏธ์ง์ ๋ค๋ฅธ ๋ถ๋ถ๊ณผ ๊ด๊ณ๋ฅผ ๋งบ๋๋ก ํจ
4. ๐ฏ Candidate Object Proposals (Two-stage)
- Encoder ์ถ๋ ฅ์์ objectness๊ฐ ๋์ ์์น Top-K๋ฅผ ์ ํ
- ์ด๋ฅผ ๊ธฐ๋ฐ์ผ๋ก query์ ์ด๊ธฐ anchor๋ฅผ ๊ตฌ์ฑ (Mixed Query Selection ํฌํจ)
5. ๐งฉ Transformer Decoder
- query๋ค์ด encoder feature์ attention์ ๋ ๋ฒ ์ํ (Look Forward Twice)
- denoising query๋ค๋ ํจ๊ป ์ฒ๋ฆฌ๋์ด ์์ ์ ํ์ต ์ ๋ (CDN ํฌํจ)
6. ๐ฆ Predictions
- ๊ฐ query์ ๋ํด ์ต์ข
์ ์ผ๋ก ๋ฌผ์ฒด ํด๋์ค์ ๋ฐ์ค ์์น๋ฅผ ์์ธก
โ ๊ฒฐ๊ณผ:{class, bounding box}
์์ด N๊ฐ ์ถ๋ ฅ๋จ
๐ง ์ต์ข ์ ๋ฆฌ : DINO vs DETR
ํญ๋ชฉ | DETR | DINO (Improved) |
---|---|---|
ํ์ต ์๋ ด ์๋ | ๋๋ฆผ | โ ๋น ๋ฆ (DeNoising) |
์์ ๊ฐ์ฒด ํ์ง ์ฑ๋ฅ | ๋ฎ์ | โ ํฅ์๋จ |
Object Query ๊ตฌ์กฐ | ๋จ์ | โ GT ๊ธฐ๋ฐ Matching ์ถ๊ฐ |
Stage ๊ตฌ์กฐ | One-stage | โ Two-stage ๊ตฌ์กฐ ํฌํจ |
๐ ์์ฝ
- DINO๋ DETR์ ๊ตฌ์กฐ๋ฅผ ์ ์งํ๋ฉด์, ์ค์ ์ฌ์ฉ์ ์ ํฉํ๋๋ก ๋น ๋ฅด๊ณ ์ ํํ๊ฒ ๊ฐ์ ํ ๋ชจ๋ธ
- ๋ค์ํ ํ์ ์ฐ๊ตฌ(Grounding DINO, DINgfO-DETR, DINOv2)์ ๊ธฐ๋ฐ์ด ๋๋ ํต์ฌ ๋ชจ๋ธ
- ๐ฅ open-vocabulary detection, segment anything ๊ฐ์ ์ต์ ๋น์ ์ฐ๊ตฌ์๋ ์ ๊ฒฐํฉ๋๋, ํ์ฅ๊ฐ๋ฅ์ฑ์ด ํฐ ๋ชจ๋ธ!! :)
๐ฌ ๊ฐ์ธ ์ ๋ฆฌ
DINO๋ ์ฌ๋ฌ ์ฐ๊ตฌ๋ค์ ์ ์กฐํฉ, ๋ณธ์ธ๋ค๋ง์ ์๋ก์ด ๊ฒฐ๊ณผํ ํฉ์ณ์ DETR์ ํ์ต ํจ์จ์ฑ๊ณผ ์ฑ๋ฅ ๋ฌธ์ ๋ฅผ ํด๊ฒฐํ ํ๋ฅญํ ๊ฐ์ ์ฐ๊ตฌ์ธ ๊ฒ ๊ฐ๋ค!
Grounding DINO๋ DINOv2 ๋ฑ์ผ๋ก ํ์ฅํ ๋๋ ํต์ฌ ๊ฐ๋
์ ๊ทธ๋๋ก ๊ณต์ ํ๋ฏ๋ก
DETR ๊ณ์ด Transformer ํ์ง ๋ชจ๋ธ์ ์ดํดํ๋ ค๋ฉด ๋ฐ๋์ ๊ธฐ์ตํด์ผ ํ ๋ชจ๋ธ!