Post

๐Ÿ“ DINO: The Evolutionary Object Detection Model of DETR!! - DINO: DETR์˜ ์ง„ํ™”ํ˜• ๊ฐ์ฒด ํƒ์ง€ ๋ชจ๋ธ!! (ICLR 2023)

๐Ÿ“ DINO: The Evolutionary Object Detection Model of DETR!! - DINO: DETR์˜ ์ง„ํ™”ํ˜• ๊ฐ์ฒด ํƒ์ง€ ๋ชจ๋ธ!! (ICLR 2023)

๐Ÿฆ– DINO: The Evolutionary Object Detection Model of DETR!!

๐Ÿ” A powerful alternative that solves the slow training and small object detection issues of DETR-based models!

Paper: DINO: DETR with Improved DeNoising Anchor Boxes
Presentation: ICLR 2023 (by IDEA Research)
Code: IDEA-Research/DINO
Comment: After DETR was released, DAB-DETR/ DN-DETR / Deformable DETR, etc., were continuously released, and this model combines their concepts with DINOโ€™s own concepts. Itโ€™s difficult to understand for someone who has only studied DETR!


โœ… What is DINO?

manwha

DINO is an object detection model that overcomes the limitations of the DETR family
Designed with a focus on improving training speed and small object performance

  • DINO = DETR with Improved DeNoising Anchors
  • Basic structure is DETR-based, but performance is enhanced through various strategies
  • Achieves performance comparable to Two-stage with a One-stage structure!

๐Ÿšจ Background of DINOโ€™s Emergence - Major Limitations of DETR

  1. โŒ Training is too slow (hundreds of thousands of steps)
    • In the early stages of training, DETRโ€™s object queries predict boxes at random locations
    • This makes effective matching between queries and GT difficult, resulting in sparse learning signals
    • โ†’ Consequently, the convergence speed is very slow, requiring dozens of times more epochs than typical models (500 epochs!?)
  2. โŒ Weak at detecting small objects
    • DETR uses only the final feature map of the CNN backbone, resulting in low resolution
      • (e.g., using C5 level features of ResNet โ†’ resolution reduction)
    • Information about small objects almost disappears or is faintly represented in this coarse feature map
    • Also, Transformer focuses on global attention, making it weak in local details
    • โ†’ As a result, box predictions for small objects are not accurate
  3. โŒ Low performance of Object Query in the early stages of learning
    • DETRโ€™s object queries are randomly initialized in the beginning
    • The role of which query will predict which object is not determined in the early stages of learning
    • Hungarian Matching forcibly performs 1:1 matching, but this matching is inconsistent
    • โ†’ In the early stages of learning, queries often overlap or predict irrelevant locations, leading to low performance

Briefly Looking at Additional Research in the DETR Family Before DINO

Hereโ€™s a brief summary of the major DETR family research before DINO!!
We should study each of these researches as well!!

The following studies have attempted to improve various aspects such as convergence speed, learning stability, and positional accuracy while maintaining the basic framework of DETR.


๐Ÿ”น Deformable DETR (2021, Zhu et al.)

  • Core Idea: Deformable Attention
    • Performs attention only on a few significant locations instead of the entire image.
  • Advantages:
    • Significantly improved training speed (more than 10 times)
    • Introduction of a two-stage structure enables coarse-to-fine detection

๐Ÿ”น Anchor DETR (2021, Wang et al.)

  • Redefined Query in an Anchor-based manner.
  • Enables better local search by having Query possess location information.

๐Ÿ”น DAB-DETR (2022, Liu et al.)

  • Initializes Query as a Dynamic Anchor Box and refines it progressively in the decoder.
  • Improves convergence by providing stable location information from the early stages of learning.

๐Ÿ”น DN-DETR (2022, Zhang et al.)

  • Introduced DeNoising Training for learning stabilization.
  • By including fake queries with added noise to the ground truth (GT) boxes in the training, Contributes to resolving the instability of Bipartite Matching.

๐Ÿ’ก Core Ideas of DINO

The reason why understanding DAB-DETR/ DN-DETR / Deformable DETR is necessary!!
This research successfully combines DINOโ€™s own additional ideas (CDN, Mixed Query Selection) with successful cases from previous DETR research!

Main ComponentsDescriptionIntroduced Paper (Source)
DeNoising Training (+CDN)Intentionally generates noise boxes around GT during training to quickly converge Queries.
DINO extends this contrastively to perform Contrastive DeNoising (CDN) to distinguish between correct and incorrect predictions.
DN-DETR [G. Zhang et al., 2022] + DINO [Zhang et al., 2022]
Matching QueriesPlaces fixed Query Anchors at locations close to GT to induce stable learning.DAB-DETR [Liu et al., 2022]
Adding Two-stage StructureThe Encoder extracts coarse object candidates, and the Decoder performs refinement.Deformable DETR [Zhu et al., 2021]
Look Forward TwiceImproves accuracy by giving attention twice in the Decoder instead of once.DINO [Zhang et al., 2022]
Mixed Query SelectionUses only the top-K locations selected from the Encoder as Anchors, and the Content remains static to balance stability and expressive power.DINO [Zhang et al., 2022]

Idea 1: DeNoising Training (+ CDN)

DINO additionally uses intentionally noisy training samples (denoising query) to help object queries quickly recognize information around the ground truth (GT) in the early stages of training. This strategy alleviates the existing unstable bipartite matching issue and leads to DINOโ€™s unique extension, CDN (Contrastive DeNoising).


Basic DeNoising Training Method
  1. GT Replication & Noise Addition
    • Replicates the ground truth box and label
    • Adds position noise (e.g., coordinate jitter 5~10%) and class noise (e.g., person โ†’ dog)
  2. Denoising Query Generation
    • Designates some object queries as denoising queries
    • Induces learning to predict the noisy boxes
  3. Loss Calculation
    • Calculates the prediction error for noise queries separately from normal matching queries and includes it in the training

CDN (Contrastive DeNoising): DINOโ€™s Extension

Extending the existing denoising technique, DINO introduces a contrastive strategy that simultaneously trains positive / negative query pairs.

Query TypeGeneration MethodLearning Objective
Positive QuerySlight noise added to GT (position/class)Induce accurate prediction
โŒ Negative QueryInsert random location or incorrect classInduce definite โ€˜incorrectโ€™ prediction
  • Both types are put into the same decoder, and a different learning objective (loss) is assigned to each.

โš™๏ธ Main Components
ComponentDescription
Positive QuerySlight noise added to GT box
Negative QueryIncorrect box/class unrelated to GT
Matching HeadGenerates prediction results for each
LossInduces Positive to match GT, Negative to no-object

Summary of CDN Effects
  • Reduced false positives โ†’ Prevents false detections in similar backgrounds/small objects/overlap situations

  • Induced faster convergence โ†’ Queries that were random in the early stages quickly move closer to the correct answer

  • Improved modelโ€™s discrimination ability โ†’ Strengthens the ability to distinguish between correct answers and similar incorrect answers


Key Summary
ItemDescription
PurposeEnhance the ability to distinguish correct answers from similar incorrect answers
StrategyExtend DeNoising query to positive/negative
โœ… Learning EffectFast convergence + high accuracy + robust detection

CDN is not just a simple learning stabilization technique; it is a core technology that makes DINO the fastest and most robust DETR-based model to train.


Idea 2: Matching Queries (Fixed Anchor Based)

Unlike DETR, DINOโ€™s object queries do not find locations completely randomly but rather place pre-defined query anchors near GT locations from the beginning.


How it Works
  1. GT Center Anchor Generation
    • Generates a fixed number of query anchors based on GT locations during training
  2. Query Assignment to Each Anchor
    • These anchors are assigned as responsible queries to predict specific GTs
  3. Matching Process Stabilization
    • Hungarian Matching makes it easier to match these anchor queries and GTs in a 1:1 manner

Effects
  • Queries start near GT, leading to faster convergence
  • Reduces the matching instability issues that occurred in the early stages
  • Improved performance and convergence speed due to each GT having a clearly corresponding query

Idea 3: Two-stage Structure

DINO extends the existing one-stage structure of DETR by applying a two-stage structure consisting of Encoder โ†’ Decoder.


How it Works
  1. Stage 1 (Encoder)
    • Extracts dense object candidates (anchors) through a CNN + Transformer encoder
    • Selects Top-K scoring anchors
  2. Stage 2 (Decoder)
    • Performs refined prediction based on the anchors selected from the Encoder
    • Adjusts class and accurate box

Effects
  • Coarsely identifies locations in the first stage and accurately adjusts them in the second stage โ†’ Improved precision
  • Increased detection stability in small objects or complex backgrounds

Idea 4: Look Forward Twice (LFT)

LFT

Existing DETR-based models perform attention once on the encoder feature by the object query in the decoder. DINO repeats this attention operation twice (Look Twice) to induce deeper interaction.


How it Works
  1. First Attention
    • Object query performs basic attention with the encoder output
  2. Second Attention
    • Performs attention on the encoder feature again with the first attention result
    • That is, query โ†’ encoder โ†’ query โ†’ encoder

Effects
  • Utilizes deeper context information
  • Enables accurate class and location prediction even in complex scenes
  • Secures strong representation power, especially for overlapping objects and small objects

Idea 5: Mixed Query Selection (MQS)

Existing DETR-based queries mostly used the same static queries for all images, and while there was a method like Deformable DETR that used dynamic queries, changing the content as well could cause confusion. DINO introduces a Mixed Query Selection strategy that compromises the advantages of both.


How it Works

MQS

  1. Select Top-K Important Encoder Features
    • Selects the features with high objectness scores from the encoder output
  2. Anchor (Location Information) is Dynamically Set
    • Sets the initial anchor box of each query based on the selected Top-K locations
  3. Content Remains Static
    • The content information of the query remains the learned fixed vector as is

In other words, a structure where โ€œwhere to look changes depending on the imageโ€ and โ€œwhat to look for remains as the model has learned.โ€


Effects
  • Starts searching from more accurate locations (anchors) suited for each image
  • Prevents confusion caused by ambiguous encoder features by maintaining content information
  • Achieves fast convergence + high precision simultaneously

โœ… Summary
ComponentMethod
Anchor (Location)Initialized with the location of the Top-K features extracted from the Encoder
Content (Meaning)Maintains a static learned vector
Expected EffectAdapts to the location of each image + maintains stable search content

DINO Architecture

archi

1
2
3
4
5
6
Input Image
 โ†’ CNN Backbone (e.g., ResNet or Swin)
   โ†’ Transformer Encoder
     โ†’ Candidate Object Proposals (Two-stage)
       โ†’ Transformer Decoder
         โ†’ Predictions {Class, Bounding Box}โ‚~โ‚™

Explanation of Main Architecture Stages

DINO maintains the simplicity of the existing DETR while also being one of the definitive DETR models that enhances training speed, accuracy, and stability.

1. ๏ธ Input Image
  • The input image is typically entered into the model in 3-channel RGB format.
2. CNN Backbone
  • e.g., ResNet-50, Swin Transformer, etc.
  • Role of extracting low-level feature maps from the image
3. Transformer Encoder
  • Receives features extracted from the CNN and learns global context information
  • Enables each position to relate to other parts of the entire image
4. Candidate Object Proposals (Two-stage)
  • Selects the Top-K locations with high objectness from the Encoder output
  • Configures the initial anchor of the query based on this (including Mixed Query Selection)
5. Transformer Decoder
  • Queries perform attention twice on the encoder feature (Look Forward Twice)
  • Denoising queries are also processed together to induce stable learning (including CDN)
6. Predictions
  • Finally predicts the object class and box location for each query โ†’ Result: N {class, bounding box} pairs are output

Final Summary: DINO vs DETR

ItemDETRDINO (Improved)
Training Convergence SpeedSlowโœ… Fast (DeNoising)
Small Object DetectionLowโœ… Improved
Object Query StructureSimpleโœ… Added GT-based Matching
Stage StructureOne-stageโœ… Includes Two-stage Structure

Summary

  • DINO maintains the structure of DETR while being a model quickly and accurately improved for practical use.
  • A core model that forms the basis of various subsequent studies (Grounding DINO, DINgfO-DETR, DINOv2)
  • A highly scalable model that combines well with the latest vision research such as open-vocabulary detection and segment anything!! :)

Personal Thoughts

DINO seems to be an excellent improvement research that solved the learning efficiency and performance issues of DETR by well combining various researches and merging them with their own new results! As the core concepts are shared when extending to Grounding DINO or DINOv2, it is a model that must be remembered to understand DETR-based Transformer detection models!


๐Ÿฆ– (ํ•œ๊ตญ์–ด) DINO: DETR์˜ ์ง„ํ™”ํ˜• ๊ฐ์ฒด ํƒ์ง€ ๋ชจ๋ธ DINO!!

๐Ÿ” DETR ๊ณ„์—ด ๋ชจ๋ธ์˜ ๋А๋ฆฐ ํ•™์Šต๊ณผ ์ž‘์€ ๊ฐ์ฒด ํƒ์ง€ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•œ ๊ฐ•๋ ฅํ•œ ๋Œ€์•ˆ!

๋…ผ๋ฌธ: DINO: DETR with Improved DeNoising Anchor Boxes
๋ฐœํ‘œ: ICLR 2023 (by IDEA Research)
์ฝ”๋“œ: IDEA-Research/DINO
์ฝ”๋ฉ˜ํŠธ : DETR ๊ณต๊ฐœ ์ดํ›„, DAB-DETR/ DN-DETR / Deformable DETR ๋“ฑ ์—ฐ์†์ ์œผ๋กœ ๊ณต๊ฐœ๋˜์—ˆ๊ณ  ์ด๋“ค์˜ ๊ฐœ๋…๊ณผ DINO ์ž์ฒด ๊ฐœ๋…์„ ์กฐํ•ฉํ•˜์—ฌ ์ œ์•ˆํ•œ ๋ชจ๋ธ๋กœ,., DETR๋งŒ ๊ณต๋ถ€ํ•˜๊ณ  ๋„˜์–ด์˜จ ์ž…์žฅ์—์„œ๋Š” ์ดํ•ดํ•˜๊ธฐ๊ฐ€ ์–ด๋ ต๋‹ค!


โœ… DINO๋ž€?

manwha

DINO๋Š” DETR ๊ณ„์—ด์˜ ํ•œ๊ณ„๋ฅผ ๊ทน๋ณตํ•œ ๊ฐ์ฒด ํƒ์ง€ ๋ชจ๋ธ
ํŠนํžˆ ํ•™์Šต ์†๋„ ํ–ฅ์ƒ๊ณผ ์†Œํ˜• ๊ฐ์ฒด ์„ฑ๋Šฅ ๊ฐœ์„ ์— ์ค‘์ ์„ ๋‘” ๊ตฌ์กฐ๋กœ ์„ค๊ณ„

  • DINO = DETR with Improved DeNoising Anchors
  • ๊ธฐ๋ณธ ๊ตฌ์กฐ๋Š” DETR ๊ธฐ๋ฐ˜์ด์ง€๋งŒ, ๋‹ค์–‘ํ•œ ์ „๋žต์œผ๋กœ ์„ฑ๋Šฅ์„ ๊ฐ•ํ™”ํ•œ ๋ชจ๋ธ
  • One-stage ๊ตฌ์กฐ์ง€๋งŒ Two-stage ์ˆ˜์ค€์˜ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑ!

๐Ÿšจ DINO ๋“ฑ์žฅ์˜ ๋ฐฐ๊ฒฝ - DETR์˜ ์ฃผ์š” ํ•œ๊ณ„

  1. โŒ ํ•™์Šต์ด ๋„ˆ๋ฌด ๋А๋ฆฌ๋‹ค (์ˆ˜์‹ญ๋งŒ ์Šคํ…)
    • DETR์€ ํ•™์Šต ์ดˆ๊ธฐ ๋‹จ๊ณ„์—์„œ object query๋“ค์ด ๋ฌด์ž‘์œ„ํ•œ ์œ„์น˜์— ๋ฐ•์Šค๋ฅผ ์˜ˆ์ธก
    • ์ด๋กœ ์ธํ•ด query์™€ GT ๊ฐ„์˜ ํšจ๊ณผ์ ์ธ ๋งค์นญ์ด ์–ด๋ ต๊ณ  ํ•™์Šต ์‹ ํ˜ธ๊ฐ€ ํฌ๋ฐ•ํ•จ
    • โ†’ ๊ฒฐ๊ตญ ์ˆ˜๋ ด ์†๋„๊ฐ€ ๋งค์šฐ ๋А๋ฆฌ๊ณ , ์ผ๋ฐ˜์ ์ธ ๋ชจ๋ธ๋ณด๋‹ค ์ˆ˜์‹ญ ๋ฐฐ ๋” ๋งŽ์€ epoch ํ•„์š”(500 epock!?)
  2. โŒ ์ž‘์€ ๊ฐ์ฒด ํƒ์ง€๊ฐ€ ์•ฝํ•˜๋‹ค
    • DETR์€ CNN backbone์˜ ๋งˆ์ง€๋ง‰ feature map๋งŒ ์‚ฌ์šฉํ•˜๊ธฐ ๋•Œ๋ฌธ์— ํ•ด์ƒ๋„๊ฐ€ ๋‚ฎ์Œ
      • (์˜ˆ: ResNet์˜ C5 ๋ ˆ๋ฒจ feature ์‚ฌ์šฉ โ†’ ํ•ด์ƒ๋„ ์ถ•์†Œ)
    • ์ž‘์€ ๊ฐ์ฒด๋Š” ์ด coarse feature map์—์„œ ์กด์žฌ ์ •๋ณด๊ฐ€ ๊ฑฐ์˜ ์‚ฌ๋ผ์ง€๊ฑฐ๋‚˜ ํฌ๋ฏธํ•˜๊ฒŒ ํ‘œํ˜„๋จ
    • ๋˜ํ•œ, Transformer๋Š” ์ „์—ญ์  attention์— ์ง‘์ค‘ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋กœ์ปฌ ๋””ํ…Œ์ผ์ด ์•ฝํ•ด์ง
    • โ†’ ๊ฒฐ๊ณผ์ ์œผ๋กœ ์ž‘์€ ๋ฌผ์ฒด์— ๋Œ€ํ•œ box ์˜ˆ์ธก์ด ์ •ํ™•ํ•˜์ง€ ์•Š์Œ
  3. โŒ Object Query ํ•™์Šต ์ดˆ๊ธฐ์— ์„ฑ๋Šฅ์ด ๋‚ฎ๋‹ค
    • DETR์˜ object query๋Š” ์ดˆ๊ธฐ์—๋Š” randomํ•˜๊ฒŒ ์ดˆ๊ธฐํ™”๋˜์–ด ์žˆ์Œ
    • ํ•™์Šต ์ดˆ๊ธฐ์— ์–ด๋–ค query๊ฐ€ ์–ด๋–ค ๊ฐ์ฒด๋ฅผ ์˜ˆ์ธกํ• ์ง€ ์—ญํ• ์ด ์ •ํ•ด์ ธ ์žˆ์ง€ ์•Š์Œ
    • Hungarian Matching์ด ๊ฐ•์ œ๋กœ 1:1 ๋งค์นญ์„ ์ˆ˜ํ–‰ํ•˜์ง€๋งŒ, ์ด ๋งค์นญ์ด ์ผ๊ด€์„ฑ์ด ์—†์Œ
    • โ†’ ํ•™์Šต ์ดˆ๊ธฐ์— query๋“ค์ด ์„œ๋กœ ์ค‘๋ณต๋˜๊ฑฐ๋‚˜ ์—‰๋šฑํ•œ ์œ„์น˜๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์•„ ์„ฑ๋Šฅ์ด ๋‚ฎ์Œ

๊ฐ„๋‹จํ•˜๊ฒŒ ์‚ดํŽด๋ณด๋Š” DINO ์ „์˜ DETR ๊ณ„์—ด์˜ ์ถ”๊ฐ€ ์—ฐ๊ตฌ

DINO ์ด์ „์˜ ์ฃผ์š” DETR ๊ณ„์—ด ์—ฐ๊ตฌ๋“ค์„ ๊ฐ„๋žตํžˆ ์ •๋ฆฌํ•ด๋ณด์•˜์Šต๋‹ˆ๋‹ค!!
๊ฐ๊ฐ์˜ ์—ฐ๊ตฌ๋„ ๋ชจ๋‘ ๊ณต๋ถ€ํ•ด๋ด์•ผํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค!!

์•„๋ž˜ ์—ฐ๊ตฌ๋“ค์€ DETR์˜ ๊ธฐ๋ณธ ๊ณจ๊ฒฉ์„ ์œ ์ง€ํ•˜๋ฉด์„œ๋„, ์ˆ˜๋ ด ์†๋„, ํ•™์Šต ์•ˆ์ •์„ฑ, ์œ„์น˜ ์ •๋ฐ€๋„ ๋“ฑ ๋‹ค์–‘ํ•œ ์ธก๋ฉด์—์„œ ๊ฐœ์„ ์„ ์‹œ๋„ํ•ด์™”์Šต๋‹ˆ๋‹ค.


๐Ÿ”น Deformable DETR (2021, Zhu et al.)

  • ํ•ต์‹ฌ ์•„์ด๋””์–ด: Deformable Attention
    • ์ด๋ฏธ์ง€ ์ „์ฒด๊ฐ€ ์•„๋‹Œ ๋ช‡ ๊ฐœ์˜ ์œ ์˜๋ฏธํ•œ ์œ„์น˜์—๋งŒ attention์„ ์ˆ˜ํ–‰.
  • ์žฅ์ :
    • ํ•™์Šต ์†๋„ ๋Œ€ํญ ํ–ฅ์ƒ (10๋ฐฐ ์ด์ƒ)
    • ๋‘ ๋‹จ๊ณ„(two-stage) ๊ตฌ์กฐ ๋„์ž…์œผ๋กœ coarse-to-fine ํƒ์ง€ ๊ฐ€๋Šฅ

๐Ÿ”น Anchor DETR (2021, Wang et al.)

  • Query๋ฅผ Anchor-based ๋ฐฉ์‹์œผ๋กœ ์žฌ์ •์˜.
  • Query๊ฐ€ ์œ„์น˜ ์ •๋ณด๋ฅผ ๊ฐ–๋„๋ก ํ•˜์—ฌ ๋” ๋‚˜์€ ์ง€์—ญ ํƒ์ƒ‰ ๊ฐ€๋Šฅ.

๐Ÿ”น DAB-DETR (2022, Liu et al.)

  • Query๋ฅผ Dynamic Anchor Box๋กœ ์ดˆ๊ธฐํ™”ํ•˜๊ณ  decoder์—์„œ ์ ์ง„์ ์œผ๋กœ refine.
  • ํ•™์Šต ์ดˆ๊ธฐ๋ถ€ํ„ฐ ์•ˆ์ •์ ์ธ ์œ„์น˜ ์ •๋ณด๋ฅผ ์ œ๊ณตํ•จ์œผ๋กœ์จ ์ˆ˜๋ ด์„ฑ ํ–ฅ์ƒ.

๐Ÿ”น DN-DETR (2022, Zhang et al.)

  • ํ•™์Šต ์•ˆ์ •ํ™”๋ฅผ ์œ„ํ•œ DeNoising Training ๋„์ž….
  • ์ •๋‹ต GT ๋ฐ•์Šค์— ๋…ธ์ด์ฆˆ๋ฅผ ์ถ”๊ฐ€ํ•œ ๊ฐ€์งœ query๋ฅผ ํ•จ๊ป˜ ํ•™์Šต์— ํฌํ•จ์‹œ์ผœ, Bipartite Matching์˜ ๋ถˆ์•ˆ์ •์„ฑ ํ•ด์†Œ์— ๊ธฐ์—ฌ.

๐Ÿ’ก DINO์˜ ํ•ต์‹ฌ ์•„์ด๋””์–ด

DAB-DETR/ DN-DETR / Deformable DETR ๋“ฑ์˜ ์ดํ•ด๊ฐ€ ํ•„์š”ํ•œ์ด์œ !!
DINO ์ž์ฒด์˜ ์ถ”๊ฐ€ ์•„์ด๋””์–ด(CDN, Mixed Query Selection)๊ณผ ๊ธฐ์กด DETR ๋ถ„์•ผ์˜ ์—ฐ๊ตฌ์˜ ์„ฑ๊ณต์  ์‚ฌ๋ก€๋ฅผ ์ž˜ ์กฐํ•ฉํ•œ ์—ฐ๊ตฌ์ž…๋‹ˆ๋‹ค!

์ฃผ์š” ๊ตฌ์„ฑ ์š”์†Œ์„ค๋ช…๋„์ž…ํ•œ ๋…ผ๋ฌธ (์ถœ์ฒ˜)
๐Ÿ”ง DeNoising Training (+CDN)ํ•™์Šต ์‹œ, GT ์ฃผ์œ„์— ๋…ธ์ด์ฆˆ ๋ฐ•์Šค๋ฅผ ์ผ๋ถ€๋Ÿฌ ์ƒ์„ฑํ•˜์—ฌ Query๋ฅผ ๋น ๋ฅด๊ฒŒ ์ˆ˜๋ ด์‹œํ‚ด
DINO์—์„œ๋Š” ์ด๋ฅผ Contrastiveํ•˜๊ฒŒ ํ™•์žฅํ•˜์—ฌ ์ •๋‹ต vs ์˜ค๋‹ต์„ ๊ตฌ๋ถ„ํ•˜๋Š” ํ•™์Šต(CDN)๋„ ์ˆ˜ํ–‰
DN-DETR [G. Zhang et al., 2022] + DINO [Zhang et al., 2022]
๐Ÿงฒ Matching QueriesGT์— ๊ฐ€๊นŒ์šด ์œ„์น˜์— ๊ณ ์ •๋œ Query Anchor๋ฅผ ๋ฐฐ์น˜ํ•ด ์•ˆ์ •์ ์ธ ํ•™์Šต ์œ ๋„DAB-DETR [Liu et al., 2022]
๐Ÿง  Two-stage ๊ตฌ์กฐ ์ถ”๊ฐ€Encoder์—์„œ coarse object ํ›„๋ณด๋ฅผ ๋ฝ‘๊ณ , Decoder์—์„œ refinement ์ˆ˜ํ–‰Deformable DETR [Zhu et al., 2021]
๐Ÿ” Look Forward TwiceDecoder์—์„œ ํ•œ ๋ฒˆ์ด ์•„๋‹ˆ๋ผ ๋‘ ๋ฒˆ attention์„ ์ฃผ๋Š” ๋ฐฉ์‹์œผ๋กœ ์ •ํ™•๋„ ํ–ฅ์ƒDINO [Zhang et al., 2022]
๐Ÿงฉ Mixed Query SelectionEncoder์—์„œ ์„ ํƒ๋œ top-K ์œ„์น˜๋งŒ Anchor๋กœ ์‚ฌ์šฉํ•˜๊ณ , Content๋Š” staticํ•˜๊ฒŒ ์œ ์ง€ํ•˜์—ฌ ์•ˆ์ •์„ฑ๊ณผ ํ‘œํ˜„๋ ฅ ๊ท ํ˜• ํ™•๋ณดDINO [Zhang et al., 2022]

๐Ÿ’ก ์•„์ด๋””์–ด 1: DeNoising Training (+ CDN)

DINO๋Š” ํ•™์Šต ์ดˆ๊ธฐ์— object query๋“ค์ด ์ •๋‹ต(GT) ์ฃผ๋ณ€ ์ •๋ณด๋ฅผ ๋น ๋ฅด๊ฒŒ ์ธ์‹ํ•˜๋„๋ก ๋•๊ธฐ ์œ„ํ•ด,
์˜๋„์ ์œผ๋กœ ๋…ธ์ด์ฆˆ๊ฐ€ ์„ž์ธ ํ•™์Šต ์ƒ˜ํ”Œ(denoising query)์„ ์ถ”๊ฐ€๋กœ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
์ด ์ „๋žต์€ ๊ธฐ์กด์˜ ๋ถˆ์•ˆ์ •ํ•œ bipartite matching ๋ฌธ์ œ๋ฅผ ์™„ํ™”ํ•˜๊ณ ,
DINO๋งŒ์˜ ํ™•์žฅ ๊ธฐ๋ฒ•์ธ CDN (Contrastive DeNoising)์œผ๋กœ ์ด์–ด์ง‘๋‹ˆ๋‹ค.


๐Ÿ”ง ๊ธฐ๋ณธ DeNoising Training ๋ฐฉ์‹
  1. GT ๋ณต์ œ & ๋…ธ์ด์ฆˆ ์ถ”๊ฐ€
    • Ground truth box์™€ label์„ ๋ณต์ œ
    • ์œ„์น˜ ๋…ธ์ด์ฆˆ (e.g., ์ขŒํ‘œ jitter 5~10%)์™€ ํด๋ž˜์Šค ๋…ธ์ด์ฆˆ (e.g., person โ†’ dog) ์ถ”๊ฐ€
  2. Denoising Query ์ƒ์„ฑ
    • ์ผ๋ถ€ object query๋ฅผ denoising query๋กœ ์ง€์ •
    • ๋…ธ์ด์ฆˆ๋œ box๋ฅผ ์˜ˆ์ธกํ•˜๊ฒŒ ํ•™์Šต ์œ ๋„
  3. Loss ๊ณ„์‚ฐ
    • ์ผ๋ฐ˜ matching query์™€ ๋ณ„๋„๋กœ, ๋…ธ์ด์ฆˆ query์— ๋Œ€ํ•œ ์˜ˆ์ธก ์˜ค์ฐจ๋„ ํ•จ๊ป˜ ํ•™์Šต

๐Ÿง  CDN (Contrastive DeNoising): DINO์˜ ํ™•์žฅ

๊ธฐ์กด denoising ๊ธฐ๋ฒ•์„ ํ™•์žฅํ•˜์—ฌ,
positive / negative query ์Œ์„ ๋™์‹œ์— ํ•™์Šตํ•˜๋Š” contrastive ์ „๋žต์„ ๋„์ž…ํ•ฉ๋‹ˆ๋‹ค.

Query ์ข…๋ฅ˜์ƒ์„ฑ ๋ฐฉ์‹ํ•™์Šต ๋ชฉ์ 
๐ŸŽฏ Positive QueryGT์— ์•ฝ๊ฐ„์˜ ๋…ธ์ด์ฆˆ ์ถ”๊ฐ€ (์œ„์น˜/ํด๋ž˜์Šค)์ •ํ™•ํ•œ ์˜ˆ์ธก ์œ ๋„
โŒ Negative Query๋ฌด์ž‘์œ„ ์œ„์น˜๋‚˜ ์ž˜๋ชป๋œ ํด๋ž˜์Šค ์‚ฝ์ž…ํ™•์‹คํžˆ โ€˜์˜ค๋‹ตโ€™์œผ๋กœ ์˜ˆ์ธก ์œ ๋„
  • ๋‘ ์ข…๋ฅ˜๋ฅผ ๋™์ผํ•œ decoder์— ๋„ฃ๊ณ ,
    ๊ฐ๊ฐ ๋‹ค๋ฅธ ๋ฐฉ์‹์œผ๋กœ ํ•™์Šต ๋ชฉํ‘œ(loss)๋ฅผ ๋ถ€์—ฌ

โš™๏ธ ์ฃผ์š” ๊ตฌ์„ฑ ์š”์†Œ
๊ตฌ์„ฑ ์š”์†Œ์„ค๋ช…
Positive QueryGT box์— ์•ฝ๊ฐ„์˜ ๋…ธ์ด์ฆˆ ์ถ”๊ฐ€
Negative QueryGT์™€ ๋ฌด๊ด€ํ•œ ์ž˜๋ชป๋œ box/class
Matching Head๊ฐ๊ฐ์— ๋Œ€ํ•ด ์˜ˆ์ธก ๊ฒฐ๊ณผ ์ƒ์„ฑ
LossPositive๋Š” ์ •๋‹ต๊ณผ ์ผ์น˜ํ•˜๊ฒŒ, Negative๋Š” no-object๋กœ ์œ ๋„

๐Ÿง  CDN์˜ ํšจ๊ณผ ์š”์•ฝ
  • false positive ๊ฐ์†Œ
    โ†’ ์œ ์‚ฌํ•œ ๋ฐฐ๊ฒฝ/์ž‘์€ ๊ฐ์ฒด/overlap ์ƒํ™ฉ์—์„œ ์˜คํƒ ๋ฐฉ์ง€

  • ๋น ๋ฅธ ์ˆ˜๋ ด ์œ ๋„
    โ†’ ์ดˆ๊ธฐ์— ๋ฌด์ž‘์œ„์˜€๋˜ query๋“ค์ด ๋น ๋ฅด๊ฒŒ ์ •๋‹ต ๊ทผ์ฒ˜๋กœ ์ด๋™

  • ๋ชจ๋ธ์˜ ๊ตฌ๋ถ„ ๋Šฅ๋ ฅ ํ–ฅ์ƒ
    โ†’ ์ •๋‹ต๊ณผ ์œ ์‚ฌํ•œ ์˜ค๋‹ต์„ ํŒ๋ณ„ํ•˜๋Š” ํ‘œํ˜„๋ ฅ ๊ฐ•ํ™”


๐Ÿ“Œ ํ•ต์‹ฌ ์š”์•ฝ
ํ•ญ๋ชฉ์„ค๋ช…
๐ŸŽฏ ๋ชฉ์ ์ •๋‹ต๊ณผ ์œ ์‚ฌํ•œ ์˜ค๋‹ต์„ ๊ตฌ๋ถ„ํ•˜๋Š” ๋Šฅ๋ ฅ ๊ฐ•ํ™”
๐Ÿ’ก ์ „๋žตDeNoising query๋ฅผ positive/negative๋กœ ํ™•์žฅ
โœ… ํ•™์Šต ํšจ๊ณผ๋น ๋ฅธ ์ˆ˜๋ ด + ๋†’์€ ์ •ํ™•๋„ + robust detection

CDN์€ ๋‹จ์ˆœํ•œ ํ•™์Šต ์•ˆ์ •ํ™” ๊ธฐ๋ฒ•์„ ๋„˜์–ด,
DINO๊ฐ€ DETR ๊ณ„์—ด ์ค‘ ๊ฐ€์žฅ ๋น ๋ฅด๊ณ  ๊ฐ•๊ฑดํ•˜๊ฒŒ ํ•™์Šต๋  ์ˆ˜ ์žˆ๋„๋ก ๋งŒ๋“  ํ•ต์‹ฌ ๊ธฐ์ˆ ์ž…๋‹ˆ๋‹ค.


๐Ÿ’ก ์•„์ด๋””์–ด2 : Matching Queries (๊ณ ์ • Anchor ๊ธฐ๋ฐ˜)

DINO๋Š” DETR์™€ ๋‹ฌ๋ฆฌ, object query๊ฐ€ ์™„์ „ํžˆ ๋žœ๋คํ•˜๊ฒŒ ์œ„์น˜๋ฅผ ์ฐพ๋Š” ๋ฐฉ์‹์ด ์•„๋‹ˆ๋ผ
์ดˆ๊ธฐ๋ถ€ํ„ฐ GT ์œ„์น˜ ๊ทผ์ฒ˜์— ์ •ํ•ด์ง„ query anchor๋ฅผ ๋ฐฐ์น˜ํ•ฉ๋‹ˆ๋‹ค.


๐Ÿงฒ ์ž‘๋™ ๋ฐฉ์‹
  1. GT ์ค‘์‹ฌ Anchor ์ƒ์„ฑ
    • ํ•™์Šต ์‹œ GT ์œ„์น˜๋ฅผ ๊ธฐ์ค€์œผ๋กœ ์ผ์ • ์ˆ˜์˜ ๊ณ ์ •๋œ query anchor๋ฅผ ์ƒ์„ฑ
  2. ๊ฐ anchor์— query ์ง€์ •
    • ์ด anchor๋Š” ํŠน์ • GT๋ฅผ ์˜ˆ์ธกํ•ด์•ผ ํ•  ์ฑ…์ž„ ์žˆ๋Š” query๋กœ ํ• ๋‹น๋จ
  3. Matching ๊ณผ์ • ์•ˆ์ •ํ™”
    • Hungarian Matching์ด ์ด anchor query์™€ GT๋ฅผ 1:1 ๋งค์นญํ•˜๊ธฐ ์‰ฌ์›Œ์ง

๐ŸŽฏ ํšจ๊ณผ
  • query๊ฐ€ GT ๊ทผ์ฒ˜์—์„œ ์‹œ์ž‘ํ•˜๋ฏ€๋กœ ๋น ๋ฅด๊ฒŒ ์ˆ˜๋ ด
  • ์ดˆ๊ธฐ์— ๋ฐœ์ƒํ•˜๋˜ ๋งค์นญ ๋ถˆ์•ˆ์ • ๋ฌธ์ œ๋ฅผ ์ค„์ž„
  • GT๋งˆ๋‹ค ๋ช…ํ™•ํžˆ ๋Œ€์‘๋˜๋Š” query๊ฐ€ ์žˆ์–ด ์„ฑ๋Šฅ๊ณผ ์ˆ˜๋ ด ์†๋„ ํ–ฅ์ƒ

๐Ÿ’ก ์•„์ด๋””์–ด3: Two-stage ๊ตฌ์กฐ

DINO๋Š” ๊ธฐ์กด DETR์˜ one-stage ๊ตฌ์กฐ๋ฅผ ํ™•์žฅํ•˜์—ฌ
Encoder โ†’ Decoder๋กœ ์ด์–ด์ง€๋Š” ๋‘ ๋‹จ๊ณ„ ๊ตฌ์กฐ๋ฅผ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค.


๐Ÿง  ์ž‘๋™ ๋ฐฉ์‹
  1. 1๋‹จ๊ณ„ (Encoder)
    • CNN + Transformer encoder๋ฅผ ํ†ตํ•ด denseํ•œ object ํ›„๋ณด (anchors) ์ถ”์ถœ
    • Top-K scoring anchor๋“ค ์„ ํƒ
  2. 2๋‹จ๊ณ„ (Decoder)
    • Encoder์—์„œ ์„ ํƒ๋œ anchor๋“ค์„ ๊ธฐ๋ฐ˜์œผ๋กœ refined prediction ์ˆ˜ํ–‰
    • ํด๋ž˜์Šค ๋ฐ ์ •ํ™•ํ•œ box ์กฐ์ •

๐ŸŽฏ ํšจ๊ณผ
  • ์ฒซ ๋‹จ๊ณ„์—์„œ coarseํ•˜๊ฒŒ ์œ„์น˜๋ฅผ ํŒŒ์•…ํ•˜๊ณ ,
  • ๋‘ ๋ฒˆ์งธ ๋‹จ๊ณ„์—์„œ ์ •ํ™•ํžˆ ์กฐ์ • โ†’ ์ •๋ฐ€๋„ ํ–ฅ์ƒ
  • ์ž‘์€ ๊ฐ์ฒด๋‚˜ ๋ณต์žกํ•œ ๋ฐฐ๊ฒฝ์—์„œ์˜ ํƒ์ง€ ์•ˆ์ •์„ฑ ์ฆ๊ฐ€

๐Ÿ’ก ์•„์ด๋””์–ด4: Look Forward Twice (LFT)

LFT

๊ธฐ์กด DETR ๊ณ„์—ด์€ decoder์—์„œ object query๊ฐ€ encoder feature์— attention์„ ํ•œ ๋ฒˆ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.
DINO๋Š” ์ด attention ์—ฐ์‚ฐ์„ ๋‘ ๋ฒˆ ๋ฐ˜๋ณต(Look Twice) ํ•˜์—ฌ ๋” ๊นŠ์€ ์ƒํ˜ธ์ž‘์šฉ์„ ์œ ๋„ํ•ฉ๋‹ˆ๋‹ค.


๐Ÿ” ์ž‘๋™ ๋ฐฉ์‹
  1. ์ฒซ ๋ฒˆ์งธ attention
    • object query๊ฐ€ encoder output๊ณผ ๊ธฐ๋ณธ attention ์ˆ˜ํ–‰
  2. ๋‘ ๋ฒˆ์งธ attention
    • ์ฒซ attention ๊ฒฐ๊ณผ๋ฅผ ๋‹ค์‹œ encoder feature์— attention
    • ์ฆ‰, query โ†’ encoder โ†’ query โ†’ encoder

๐ŸŽฏ ํšจ๊ณผ
  • ๋” ๊นŠ์€ context ์ •๋ณด ํ™œ์šฉ
  • ๋ณต์žกํ•œ ์žฅ๋ฉด์—์„œ๋„ ์ •ํ™•ํ•œ ํด๋ž˜์Šค ๋ฐ ์œ„์น˜ ์˜ˆ์ธก ๊ฐ€๋Šฅ
  • ํŠนํžˆ overlapping ๊ฐ์ฒด, ์ž‘์€ ๋ฌผ์ฒด์— ๋Œ€ํ•ด ๊ฐ•ํ•œ ํ‘œํ˜„๋ ฅ ํ™•๋ณด

๐Ÿ’ก ์•„์ด๋””์–ด5: Mixed Query Selection (MQS)

๊ธฐ์กด DETR ๊ณ„์—ด์˜ query๋Š” ๋Œ€๋ถ€๋ถ„ ๋ชจ๋“  ์ด๋ฏธ์ง€์—์„œ ๋™์ผํ•œ static query๋ฅผ ์‚ฌ์šฉํ–ˆ์œผ๋ฉฐ,
Deformable DETR์ฒ˜๋Ÿผ dynamic query๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ์‹๋„ ์žˆ์—ˆ์ง€๋งŒ content๊นŒ์ง€ ๋ฐ”๊พธ๋ฉด์„œ ์˜คํžˆ๋ ค ํ˜ผ๋ž€์„ ์ค„ ์ˆ˜ ์žˆ์Œ
DINO๋Š” ์ด ๋‘˜์˜ ์žฅ์ ์„ ์ ˆ์ถฉํ•œ Mixed Query Selection ์ „๋žต์„ ๋„์ž…


๐Ÿงฒ ์ž‘๋™ ๋ฐฉ์‹

MQS

  1. Top-K ์ค‘์š”ํ•œ encoder feature ์„ ํƒ
    • encoder ์ถœ๋ ฅ ์ค‘ objectness ์ ์ˆ˜๊ฐ€ ๋†’์€ feature๋“ค์„ ๊ณจ๋ผ๋ƒ„
  2. Anchor (์œ„์น˜ ์ •๋ณด)๋Š” ๋™์ ์œผ๋กœ ์„ค์ •
    • ์„ ํƒ๋œ Top-K ์œ„์น˜๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๊ฐ query์˜ ์ดˆ๊ธฐ anchor box๋ฅผ ์„ค์ •
  3. Content๋Š” staticํ•˜๊ฒŒ ์œ ์ง€
    • query์˜ ๋‚ด์šฉ ์ •๋ณด๋Š” ํ•™์Šต๋œ ๊ณ ์ •๋œ vector ๊ทธ๋Œ€๋กœ ์‚ฌ์šฉ

์ฆ‰, โ€œ์–ด๋””๋ฅผ ๋ณผ์ง€๋Š” ์ด๋ฏธ์ง€์— ๋”ฐ๋ผ ๋‹ค๋ฅด๊ฒŒโ€,
โ€œ๋ฌด์—‡์„ ์ฐพ์„์ง€๋Š” ๋ชจ๋ธ์ด ๋ฐฐ์šด ๋Œ€๋กœ ์œ ์ง€โ€ํ•˜๋Š” ๊ตฌ์กฐ


๐ŸŽฏ ํšจ๊ณผ
  • ๊ฐ ์ด๋ฏธ์ง€์— ๋งž๋Š” ๋” ์ •ํ™•ํ•œ ์œ„์น˜(anchor)์—์„œ ํƒ์ƒ‰ ์‹œ์ž‘
  • content ์ •๋ณด๋ฅผ ์œ ์ง€ํ•จ์œผ๋กœ์จ ๋ชจํ˜ธํ•œ encoder feature๋กœ ์ธํ•œ ํ˜ผ๋ž€ ๋ฐฉ์ง€
  • ๋น ๋ฅธ ์ˆ˜๋ ด + ๋†’์€ ์ •ํƒ๋ฅ ์„ ๋™์‹œ์— ๋‹ฌ์„ฑ

โœ… ์š”์•ฝ
๊ตฌ์„ฑ ์š”์†Œ๋ฐฉ์‹
Anchor (์œ„์น˜)Encoder์—์„œ ๋ฝ‘์€ Top-K feature์˜ ์œ„์น˜๋กœ ์ดˆ๊ธฐํ™”
Content (๋‚ด์šฉ)Staticํ•œ ํ•™์Šต vector ์œ ์ง€
๊ธฐ๋Œ€ ํšจ๊ณผ์ด๋ฏธ์ง€๋ณ„ ์œ„์น˜ ์ ์‘ + ์•ˆ์ •์ ์ธ ํƒ์ƒ‰ ๋‚ด์šฉ ์œ ์ง€

๐Ÿงฑ DINO ์•„ํ‚คํ…์ฒ˜

archi

1
2
3
4
5
6
Input Image
 โ†’ CNN Backbone (e.g., ResNet or Swin)
   โ†’ Transformer Encoder
     โ†’ Candidate Object Proposals (Two-stage)
       โ†’ Transformer Decoder
         โ†’ Predictions {Class, Bounding Box}โ‚~โ‚™

๐Ÿ” ์ฃผ์š” ๊ตฌ์„ฑ ๋‹จ๊ณ„ ์„ค๋ช…

DINO๋Š” ๊ธฐ์กด DETR์˜ ์‹ฌํ”Œํ•จ์€ ์œ ์ง€ํ•˜๋ฉด์„œ๋„,
ํ•™์Šต ์†๋„, ์ •ํ™•๋„, ์•ˆ์ •์„ฑ์„ ๋ชจ๋‘ ๊ฐ•ํ™”ํ•œ DETR์˜ ๊ฒฐ์ •ํŒ ๋ชจ๋ธ ์ค‘ ํ•˜๋‚˜์ž…๋‹ˆ๋‹ค.

1. ๐Ÿ–ผ๏ธ Input Image
  • ์ž…๋ ฅ ์ด๋ฏธ์ง€๋Š” ๋ณดํ†ต 3์ฑ„๋„ RGB ํ˜•ํƒœ๋กœ ๋ชจ๋ธ์— ์ž…๋ ฅ๋ฉ๋‹ˆ๋‹ค.
2. ๐Ÿง  CNN Backbone
  • ์˜ˆ: ResNet-50, Swin Transformer ๋“ฑ
  • ์ด๋ฏธ์ง€๋กœ๋ถ€ํ„ฐ ์ €์ˆ˜์ค€ ํŠน์ง•(feature map)์„ ์ถ”์ถœํ•˜๋Š” ์—ญํ• 
3. ๐Ÿ” Transformer Encoder
  • CNN์—์„œ ์ถ”์ถœํ•œ feature๋ฅผ ๋ฐ›์•„ ๊ธ€๋กœ๋ฒŒ context ์ •๋ณด๋ฅผ ํ•™์Šต
  • ๊ฐ ์œ„์น˜๊ฐ€ ์ „์ฒด ์ด๋ฏธ์ง€์˜ ๋‹ค๋ฅธ ๋ถ€๋ถ„๊ณผ ๊ด€๊ณ„๋ฅผ ๋งบ๋„๋ก ํ•จ
4. ๐ŸŽฏ Candidate Object Proposals (Two-stage)
  • Encoder ์ถœ๋ ฅ์—์„œ objectness๊ฐ€ ๋†’์€ ์œ„์น˜ Top-K๋ฅผ ์„ ํƒ
  • ์ด๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ query์˜ ์ดˆ๊ธฐ anchor๋ฅผ ๊ตฌ์„ฑ (Mixed Query Selection ํฌํ•จ)
5. ๐Ÿงฉ Transformer Decoder
  • query๋“ค์ด encoder feature์— attention์„ ๋‘ ๋ฒˆ ์ˆ˜ํ–‰ (Look Forward Twice)
  • denoising query๋“ค๋„ ํ•จ๊ป˜ ์ฒ˜๋ฆฌ๋˜์–ด ์•ˆ์ •์  ํ•™์Šต ์œ ๋„ (CDN ํฌํ•จ)
6. ๐Ÿ“ฆ Predictions
  • ๊ฐ query์— ๋Œ€ํ•ด ์ตœ์ข…์ ์œผ๋กœ ๋ฌผ์ฒด ํด๋ž˜์Šค์™€ ๋ฐ•์Šค ์œ„์น˜๋ฅผ ์˜ˆ์ธก
    โ†’ ๊ฒฐ๊ณผ: {class, bounding box} ์Œ์ด N๊ฐœ ์ถœ๋ ฅ๋จ

๐Ÿง  ์ตœ์ข… ์ •๋ฆฌ : DINO vs DETR

ํ•ญ๋ชฉDETRDINO (Improved)
ํ•™์Šต ์ˆ˜๋ ด ์†๋„๋А๋ฆผโœ… ๋น ๋ฆ„ (DeNoising)
์ž‘์€ ๊ฐ์ฒด ํƒ์ง€ ์„ฑ๋Šฅ๋‚ฎ์Œโœ… ํ–ฅ์ƒ๋จ
Object Query ๊ตฌ์กฐ๋‹จ์ˆœโœ… GT ๊ธฐ๋ฐ˜ Matching ์ถ”๊ฐ€
Stage ๊ตฌ์กฐOne-stageโœ… Two-stage ๊ตฌ์กฐ ํฌํ•จ

๐Ÿ“Œ ์š”์•ฝ

  • DINO๋Š” DETR์˜ ๊ตฌ์กฐ๋ฅผ ์œ ์ง€ํ•˜๋ฉด์„œ, ์‹ค์ œ ์‚ฌ์šฉ์— ์ ํ•ฉํ•˜๋„๋ก ๋น ๋ฅด๊ณ  ์ •ํ™•ํ•˜๊ฒŒ ๊ฐœ์„ ํ•œ ๋ชจ๋ธ
  • ๋‹ค์–‘ํ•œ ํ›„์† ์—ฐ๊ตฌ(Grounding DINO, DINgfO-DETR, DINOv2)์˜ ๊ธฐ๋ฐ˜์ด ๋˜๋Š” ํ•ต์‹ฌ ๋ชจ๋ธ
  • ๐Ÿ”ฅ open-vocabulary detection, segment anything ๊ฐ™์€ ์ตœ์‹  ๋น„์ „ ์—ฐ๊ตฌ์™€๋„ ์ž˜ ๊ฒฐํ•ฉ๋˜๋Š”, ํ™•์žฅ๊ฐ€๋Šฅ์„ฑ์ด ํฐ ๋ชจ๋ธ!! :)

๐Ÿ’ฌ ๊ฐœ์ธ ์ •๋ฆฌ

DINO๋Š” ์—ฌ๋Ÿฌ ์—ฐ๊ตฌ๋“ค์„ ์ž˜ ์กฐํ•ฉ, ๋ณธ์ธ๋“ค๋งŒ์˜ ์ƒˆ๋กœ์šด ๊ฒฐ๊ณผํ•˜ ํ•ฉ์ณ์„œ DETR์˜ ํ•™์Šต ํšจ์œจ์„ฑ๊ณผ ์„ฑ๋Šฅ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•œ ํ›Œ๋ฅญํ•œ ๊ฐœ์„ ์—ฐ๊ตฌ์ธ ๊ฒƒ ๊ฐ™๋‹ค!
Grounding DINO๋‚˜ DINOv2 ๋“ฑ์œผ๋กœ ํ™•์žฅํ•  ๋•Œ๋„ ํ•ต์‹ฌ ๊ฐœ๋…์„ ๊ทธ๋Œ€๋กœ ๊ณต์œ ํ•˜๋ฏ€๋กœ
DETR ๊ณ„์—ด Transformer ํƒ์ง€ ๋ชจ๋ธ์„ ์ดํ•ดํ•˜๋ ค๋ฉด ๋ฐ˜๋“œ์‹œ ๊ธฐ์–ตํ•ด์•ผ ํ•  ๋ชจ๋ธ!

This post is licensed under CC BY 4.0 by the author.