Post

๐Ÿฆ– DINO: DETR์˜ ์ง„ํ™”ํ˜• ๊ฐ์ฒด ํƒ์ง€ ๋ชจ๋ธ DINO!! (ICLR 2023)

๐Ÿฆ– DINO: DETR์˜ ์ง„ํ™”ํ˜• ๊ฐ์ฒด ํƒ์ง€ ๋ชจ๋ธ DINO!! (ICLR 2023)

๐Ÿฆ– DINO: DETR์˜ ์ง„ํ™”ํ˜• ๊ฐ์ฒด ํƒ์ง€ ๋ชจ๋ธ DINO!!

๐Ÿ” DETR ๊ณ„์—ด ๋ชจ๋ธ์˜ ๋А๋ฆฐ ํ•™์Šต๊ณผ ์ž‘์€ ๊ฐ์ฒด ํƒ์ง€ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•œ ๊ฐ•๋ ฅํ•œ ๋Œ€์•ˆ!

๋…ผ๋ฌธ: DINO: DETR with Improved DeNoising Anchor Boxes
๋ฐœํ‘œ: ICLR 2023 (by IDEA Research)
์ฝ”๋“œ: IDEA-Research/DINO


โœ… DINO๋ž€?

DINO๋Š” DETR ๊ณ„์—ด์˜ ํ•œ๊ณ„๋ฅผ ๊ทน๋ณตํ•œ ๊ฐ์ฒด ํƒ์ง€ ๋ชจ๋ธ
ํŠนํžˆ ํ•™์Šต ์†๋„ ํ–ฅ์ƒ๊ณผ ์†Œํ˜• ๊ฐ์ฒด ์„ฑ๋Šฅ ๊ฐœ์„ ์— ์ค‘์ ์„ ๋‘” ๊ตฌ์กฐ๋กœ ์„ค๊ณ„

  • DINO = DETR with Improved DeNoising Anchors
  • ๊ธฐ๋ณธ ๊ตฌ์กฐ๋Š” DETR ๊ธฐ๋ฐ˜์ด์ง€๋งŒ, ๋‹ค์–‘ํ•œ ์ „๋žต์œผ๋กœ ์„ฑ๋Šฅ์„ ๊ฐ•ํ™”ํ•œ ๋ชจ๋ธ
  • One-stage ๊ตฌ์กฐ์ง€๋งŒ Two-stage ์ˆ˜์ค€์˜ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑ!

๐Ÿšจ DINO ๋“ฑ์žฅ์˜ ๋ฐฐ๊ฒฝ - DETR์˜ ์ฃผ์š” ํ•œ๊ณ„

  1. โŒ ํ•™์Šต์ด ๋„ˆ๋ฌด ๋А๋ฆฌ๋‹ค (์ˆ˜์‹ญ๋งŒ ์Šคํ…)
    • DETR์€ ํ•™์Šต ์ดˆ๊ธฐ ๋‹จ๊ณ„์—์„œ object query๋“ค์ด ๋ฌด์ž‘์œ„ํ•œ ์œ„์น˜์— ๋ฐ•์Šค๋ฅผ ์˜ˆ์ธก
    • ์ด๋กœ ์ธํ•ด query์™€ GT ๊ฐ„์˜ ํšจ๊ณผ์ ์ธ ๋งค์นญ์ด ์–ด๋ ต๊ณ  ํ•™์Šต ์‹ ํ˜ธ๊ฐ€ ํฌ๋ฐ•ํ•จ
    • โ†’ ๊ฒฐ๊ตญ ์ˆ˜๋ ด ์†๋„๊ฐ€ ๋งค์šฐ ๋А๋ฆฌ๊ณ , ์ผ๋ฐ˜์ ์ธ ๋ชจ๋ธ๋ณด๋‹ค ์ˆ˜์‹ญ ๋ฐฐ ๋” ๋งŽ์€ epoch ํ•„์š”(500 epock!?)
  2. โŒ ์ž‘์€ ๊ฐ์ฒด ํƒ์ง€๊ฐ€ ์•ฝํ•˜๋‹ค
    • DETR์€ CNN backbone์˜ ๋งˆ์ง€๋ง‰ feature map๋งŒ ์‚ฌ์šฉํ•˜๊ธฐ ๋•Œ๋ฌธ์— ํ•ด์ƒ๋„๊ฐ€ ๋‚ฎ์Œ
      • (์˜ˆ: ResNet์˜ C5 ๋ ˆ๋ฒจ feature ์‚ฌ์šฉ โ†’ ํ•ด์ƒ๋„ ์ถ•์†Œ)
    • ์ž‘์€ ๊ฐ์ฒด๋Š” ์ด coarse feature map์—์„œ ์กด์žฌ ์ •๋ณด๊ฐ€ ๊ฑฐ์˜ ์‚ฌ๋ผ์ง€๊ฑฐ๋‚˜ ํฌ๋ฏธํ•˜๊ฒŒ ํ‘œํ˜„๋จ
    • ๋˜ํ•œ, Transformer๋Š” ์ „์—ญ์  attention์— ์ง‘์ค‘ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋กœ์ปฌ ๋””ํ…Œ์ผ์ด ์•ฝํ•ด์ง
    • โ†’ ๊ฒฐ๊ณผ์ ์œผ๋กœ ์ž‘์€ ๋ฌผ์ฒด์— ๋Œ€ํ•œ box ์˜ˆ์ธก์ด ์ •ํ™•ํ•˜์ง€ ์•Š์Œ
  3. โŒ Object Query ํ•™์Šต ์ดˆ๊ธฐ์— ์„ฑ๋Šฅ์ด ๋‚ฎ๋‹ค
    • DETR์˜ object query๋Š” ์ดˆ๊ธฐ์—๋Š” randomํ•˜๊ฒŒ ์ดˆ๊ธฐํ™”๋˜์–ด ์žˆ์Œ
    • ํ•™์Šต ์ดˆ๊ธฐ์— ์–ด๋–ค query๊ฐ€ ์–ด๋–ค ๊ฐ์ฒด๋ฅผ ์˜ˆ์ธกํ• ์ง€ ์—ญํ• ์ด ์ •ํ•ด์ ธ ์žˆ์ง€ ์•Š์Œ
    • Hungarian Matching์ด ๊ฐ•์ œ๋กœ 1:1 ๋งค์นญ์„ ์ˆ˜ํ–‰ํ•˜์ง€๋งŒ, ์ด ๋งค์นญ์ด ์ผ๊ด€์„ฑ์ด ์—†์Œ
    • โ†’ ํ•™์Šต ์ดˆ๊ธฐ์— query๋“ค์ด ์„œ๋กœ ์ค‘๋ณต๋˜๊ฑฐ๋‚˜ ์—‰๋šฑํ•œ ์œ„์น˜๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์•„ ์„ฑ๋Šฅ์ด ๋‚ฎ์Œ

๐Ÿ’ก DINO์˜ ํ•ต์‹ฌ ์•„์ด๋””์–ด

์ฃผ์š” ๊ตฌ์„ฑ ์š”์†Œ์„ค๋ช…
๐Ÿ”ง DeNoising Training (+CDN)ํ•™์Šต ์‹œ, GT ์ฃผ์œ„์— ๋…ธ์ด์ฆˆ ๋ฐ•์Šค๋ฅผ ์ผ๋ถ€๋Ÿฌ ์ƒ์„ฑํ•˜์—ฌ Query๋ฅผ ๋น ๋ฅด๊ฒŒ ์ˆ˜๋ ด์‹œํ‚ด
DINO์—์„œ๋Š” ์ด๋ฅผ Contrastiveํ•˜๊ฒŒ ํ™•์žฅํ•˜์—ฌ ์ •๋‹ต vs ์˜ค๋‹ต์„ ๊ตฌ๋ถ„ํ•˜๋Š” ํ•™์Šต(CDN)๋„ ์ˆ˜ํ–‰
๐Ÿงฒ Matching QueriesGT์— ๊ฐ€๊นŒ์šด ์œ„์น˜์— ๊ณ ์ •๋œ Query Anchor๋ฅผ ๋ฐฐ์น˜ํ•ด ์•ˆ์ •์ ์ธ ํ•™์Šต ์œ ๋„
๐Ÿง  Two-stage ๊ตฌ์กฐ ์ถ”๊ฐ€Encoder์—์„œ coarse object ํ›„๋ณด๋ฅผ ๋ฝ‘๊ณ , Decoder์—์„œ refinement ์ˆ˜ํ–‰
Look Forward TwiceDecoder์—์„œ ํ•œ ๋ฒˆ์ด ์•„๋‹ˆ๋ผ ๋‘ ๋ฒˆ attention์„ ์ฃผ๋Š” ๋ฐฉ์‹์œผ๋กœ ์ •ํ™•๋„ ํ–ฅ์ƒ

๐Ÿ’ก ํ•ด๊ฒฐ์ฑ…1: DeNoising Training (+ CDN)

DINO์—์„œ๋Š” ํ•™์Šต ์ดˆ๊ธฐ์— object query๋“ค์ด ์ •๋‹ต(GT) ์ฃผ๋ณ€ ์ •๋ณด๋ฅผ ๋น ๋ฅด๊ฒŒ ์ธ์‹ํ•˜๊ณ  ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋„๋ก ๋•๊ธฐ ์œ„ํ•ด
โ€œ์˜๋„์ ์œผ๋กœ ๋…ธ์ด์ฆˆ๋ฅผ ์ฃผ์ž…ํ•œ ํ•™์Šต ์ƒ˜ํ”Œโ€์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.


๐Ÿ”ง ์ž‘๋™ ๋ฐฉ์‹
  1. Ground Truth ๋ณต์ œ
    • Ground Truth box์™€ label์„ ๋ณต์ œํ•˜์—ฌ query target์œผ๋กœ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
  2. ์˜๋„์ ์œผ๋กœ ๋…ธ์ด์ฆˆ ์ถ”๊ฐ€
    • ๋ณต์ œ๋œ box์— ์œ„์น˜ ๋…ธ์ด์ฆˆ (์ขŒํ‘œ jittering)์™€ class ๋…ธ์ด์ฆˆ (์ž˜๋ชป๋œ label)๋ฅผ ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.
    • ์˜ˆ:
      • box ์ขŒํ‘œ๋ฅผ ์‚ด์ง ์ด๋™์‹œํ‚ด (e.g., 5~10% jitter)
      • class label์„ ๋‹ค๋ฅธ label๋กœ ๋ฐ”๊ฟˆ (e.g., person โ†’ dog)
  3. Query ๋ถ„๋ฆฌ ํ•™์Šต
    • ์ „์ฒด object query ์ค‘ ์ผ๋ถ€๋Š” denoising query๋กœ ์ง€์ •๋˜๊ณ ,
    • ์ด query๋Š” ์›๋ž˜ GT๊ฐ€ ์•„๋‹Œ, ๋…ธ์ด์ฆˆ๊ฐ€ ์„ž์ธ box๋ฅผ ์˜ˆ์ธกํ•˜๋„๋ก ์œ ๋„๋ฉ๋‹ˆ๋‹ค.
  4. Loss ๊ณ„์‚ฐ์— ์‚ฌ์šฉ
    • GT์— ๋Œ€ํ•œ matching loss ์™ธ์—๋„, ๋…ธ์ด์ฆˆ๋œ query์— ๋Œ€ํ•ด ์˜ˆ์ธก ์ •ํ™•์„ฑ์„ ์ธก์ •ํ•˜๋Š” loss๊ฐ€ ํ•จ๊ป˜ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.

๐Ÿง  ๐Ÿ“Œ CDN(Contrastive DeNoising) ํ™•์žฅ

DINO์—์„œ๋Š” ์ด DeNoising ์ „๋žต์„ ๋”์šฑ ํ™•์žฅํ•˜์—ฌ, positive์™€ negative query๋ฅผ ๋™์‹œ์— ๊ตฌ์„ฑํ•˜๋Š” Contrastive DeNoising (CDN)์„ ๋„์ž…ํ•ฉ๋‹ˆ๋‹ค.

  • Positive query:
    • GT์—์„œ ์ƒ์„ฑ๋œ ๋…ธ์ด์ฆˆ ๋ฐ•์Šค (์œ„์น˜/ํด๋ž˜์Šค๋งŒ ์•ฝ๊ฐ„ ๋ณ€๊ฒฝ๋œ ์ง„์งœ์— ๊ฐ€๊นŒ์šด ๊ฒƒ)
  • Negative query:
    • ์™„์ „ํžˆ ๋ฌด๊ด€ํ•œ ๋ฐ•์Šค๋‚˜ ํด๋ž˜์Šค ์ •๋ณด๋กœ ์ƒ์„ฑ๋œ โ€œํ‹€๋ฆฐ ์˜ˆ์ธกโ€ ํ›„๋ณด
  • ์ด ๋‘ ์ข…๋ฅ˜์˜ query๋ฅผ ๋ชจ๋‘ decoder์— ๋„ฃ์–ด ํ•™์Šตํ•จ์œผ๋กœ์จ,
    • ๋ชจ๋ธ์ด ์ •๋‹ต์„ ๋งž์ถ”๋Š” ๊ฒƒ๋ฟ ์•„๋‹ˆ๋ผ,
    • โ€œ์ •๋‹ต๊ณผ ์œ ์‚ฌํ•œ ์˜ค๋‹ต์„ ๊ตฌ๋ถ„ํ•˜๋Š” ๋Šฅ๋ ฅ๊นŒ์ง€ ํ•™์Šตโ€ํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

๐Ÿ’ก ์ฆ‰, CDN์€ ๋‹จ์ˆœํžˆ ๋น ๋ฅธ ์ˆ˜๋ ด์„ ๋„˜์–ด์„œ,
๋ชจ๋ธ์˜ ํ‘œํ˜„๋ ฅ๊ณผ ๊ตฌ๋ถ„ ๋Šฅ๋ ฅ ์ž์ฒด๋ฅผ ๊ฐ•ํ™”ํ•˜๋Š” contrastive ํ•™์Šต ์š”์†Œ์ž…๋‹ˆ๋‹ค.


โš™๏ธ ๊ตฌ์„ฑ ์š”์†Œ
์š”์†Œ์„ค๋ช…
๐ŸŽฏ Positive queryGround truth box์— ๋…ธ์ด์ฆˆ๋ฅผ ์ถ”๊ฐ€ํ•œ DeNoising ์ƒ˜ํ”Œ
โŒ Negative query์™„์ „ํžˆ ์ž˜๋ชป๋œ ์œ„์น˜๋‚˜ ํด๋ž˜์Šค ์ •๋ณด๋ฅผ ์ฃผ์ž…ํ•œ ์ƒ˜ํ”Œ
๐Ÿงฒ Matching Head๊ฐ๊ฐ์— ๋Œ€ํ•ด ๋ถ„๋ฆฌ๋œ ๋””์ฝ”๋”์—์„œ ์˜ˆ์ธก๊ฐ’์„ ์–ป๊ณ  ํ•™์Šต
๐Ÿงช LossPositive์—๋Š” ์ •ํ™•ํžˆ ์˜ˆ์ธกํ•˜๋„๋ก, Negative์—๋Š” ํ™•์‹คํžˆ ํ‹€๋ฆฌ๊ฒŒ ์˜ˆ์ธกํ•˜๋„๋ก ์œ ๋„

๐Ÿ’ก ์ž‘๋™ ๋ฐฉ์‹
  1. GT box ๋ณต์ œ โ†’ Positive Query
    • ์•ฝ๊ฐ„์˜ ๋…ธ์ด์ฆˆ๋ฅผ ์ถ”๊ฐ€ํ•˜์—ฌ GT ๊ทผ์ฒ˜์—์„œ ์‹œ์ž‘
  2. ๋žœ๋ค ๋ฐ•์Šค ์ƒ์„ฑ โ†’ Negative Query
    • ํด๋ž˜์Šค ์˜ค๋ฅ˜, ์œ„์น˜ ์˜ค๋ฅ˜ ๋“ฑ ์˜๋„์  ํ˜ผ๋ž€ ์‚ฝ์ž…
  3. ๋‘ query๋ฅผ ๊ฐ™์€ ๋””์ฝ”๋”์— ๋„ฃ์–ด ์˜ˆ์ธก
  4. Loss ๊ณ„์‚ฐ ์‹œ Positive๋Š” ground truth์™€ ์ •๋ ฌ๋˜๋„๋ก, Negative๋Š” no-object๋กœ ๋ถ„๋ฅ˜๋˜๋„๋ก ์œ ๋„

๐Ÿง  Contrastive ํšจ๊ณผ
  • ๋ชจ๋ธ์ด โ€œ์ด๊ฑด ์ง„์งœ ๊ฐ์ฒด์•ผ!โ€ vs โ€œ์ด๊ฑด ํ—ท๊ฐˆ๋ฆฌ์ง€๋งŒ ๊ฐ€์งœ์•ผ!โ€ ๋ฅผ ๋ช…ํ™•ํžˆ ํŒ๋‹จํ•˜๊ฒŒ ๋จ
  • ํŠนํžˆ ๋น„์Šทํ•œ ๋ฐฐ๊ฒฝ, ์ž‘์€ ๊ฐ์ฒด, overlap ์ƒํ™ฉ์—์„œ ์˜คํƒ์ง€ ์ค„์ด๋Š” ๋ฐ ๊ธฐ์—ฌ

โœ… ์š”์•ฝ
ํ•ญ๋ชฉ์„ค๋ช…
CDN ๋ชฉ์ ์ •๋‹ต๊ณผ ์œ ์‚ฌํ•œ ์˜ค๋‹ต์„ ๊ตฌ๋ถ„ํ•˜๋Š” ๋Šฅ๋ ฅ ๊ฐ•ํ™”
Positive ์ƒ˜ํ”ŒGT ์ฃผ๋ณ€ ๋…ธ์ด์ฆˆ ์ถ”๊ฐ€๋œ query
Negative ์ƒ˜ํ”Œ๋žœ๋คํ•˜๊ฑฐ๋‚˜ ์ž˜๋ชป๋œ box/class๋ฅผ ๊ฐ€์ง„ query
ํ•™์Šต ํšจ๊ณผfalse positive ๊ฐ์†Œ, ์ดˆ๊ธฐ ์ˆ˜๋ ด ๊ฐ€์†ํ™”, ๋” ๊ฒฌ๊ณ ํ•œ ํƒ์ง€

๐Ÿ“Œ CDN์€ DeNoising Training์„ contrastive ํ•™์Šต ํ˜•ํƒœ๋กœ ํ™•์žฅํ•œ ๊ธฐ๋ฒ•์ด๋ฉฐ,
DINO๊ฐ€ ๊ธฐ์กด DETR๋ณด๋‹ค ๋” ๋น ๋ฅด๊ณ  ์ •ํ™•ํ•˜๊ฒŒ ์ˆ˜๋ ดํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋งŒ๋“ค์–ด์ฃผ๋Š” ํ•ต์‹ฌ ๊ธฐ์ˆ  ์ค‘ ํ•˜๋‚˜์ž…๋‹ˆ๋‹ค.

๐Ÿ“ˆ ์‹œ๊ฐ์ ์œผ๋กœ ํ‘œํ˜„ํ•˜๋ฉด:

Query TypeInput๋ชฉํ‘œ
Matching QueryGT box์ •ํ™•ํ•œ ๊ฐ์ฒด ์˜ˆ์ธก
Denoising QueryGT + noise (jittered box)๋…ธ์ด์ฆˆ์— ๊ฐ•์ธํ•œ ์˜ˆ์ธก ํ•™์Šต

๐ŸŽฏ ํšจ๊ณผ
  • Query๊ฐ€ GT ๊ทผ์ฒ˜์—์„œ ํ•™์Šต๋˜๋„๋ก ์œ ๋„
  • โ€œ์ •๋‹ต ๊ทผ์ฒ˜์ง€๋งŒ ์ •ํ™•ํ•˜์ง€ ์•Š์€ ์˜ˆ์ธกโ€์„ ์ฒ˜๋ฆฌํ•˜๋Š” ๋Šฅ๋ ฅ ํ–ฅ์ƒ
  • ์ดˆ๊ธฐ์— ์˜๋ฏธ ์—†๋Š” ์˜ˆ์ธก์„ ํ•˜๋˜ query๋“ค์ด ๋น ๋ฅด๊ฒŒ ์ •๋‹ต๊ณผ ๊ด€๋ จ๋œ ์œ„์น˜๋กœ ์ˆ˜๋ ด
  • ์ „์ฒด ํ•™์Šต ์†๋„ ํ–ฅ์ƒ + ์„ฑ๋Šฅ ์•ˆ์ •ํ™”

๐Ÿ’ก ํ•ด๊ฒฐ์ฑ…2: Matching Queries (๊ณ ์ • Anchor ๊ธฐ๋ฐ˜)

DINO๋Š” DETR์™€ ๋‹ฌ๋ฆฌ, object query๊ฐ€ ์™„์ „ํžˆ ๋žœ๋คํ•˜๊ฒŒ ์œ„์น˜๋ฅผ ์ฐพ๋Š” ๋ฐฉ์‹์ด ์•„๋‹ˆ๋ผ
์ดˆ๊ธฐ๋ถ€ํ„ฐ GT ์œ„์น˜ ๊ทผ์ฒ˜์— ์ •ํ•ด์ง„ query anchor๋ฅผ ๋ฐฐ์น˜ํ•ฉ๋‹ˆ๋‹ค.


๐Ÿงฒ ์ž‘๋™ ๋ฐฉ์‹
  1. GT ์ค‘์‹ฌ Anchor ์ƒ์„ฑ
    • ํ•™์Šต ์‹œ GT ์œ„์น˜๋ฅผ ๊ธฐ์ค€์œผ๋กœ ์ผ์ • ์ˆ˜์˜ ๊ณ ์ •๋œ query anchor๋ฅผ ์ƒ์„ฑ
  2. ๊ฐ anchor์— query ์ง€์ •
    • ์ด anchor๋Š” ํŠน์ • GT๋ฅผ ์˜ˆ์ธกํ•ด์•ผ ํ•  ์ฑ…์ž„ ์žˆ๋Š” query๋กœ ํ• ๋‹น๋จ
  3. Matching ๊ณผ์ • ์•ˆ์ •ํ™”
    • Hungarian Matching์ด ์ด anchor query์™€ GT๋ฅผ 1:1 ๋งค์นญํ•˜๊ธฐ ์‰ฌ์›Œ์ง

๐ŸŽฏ ํšจ๊ณผ
  • query๊ฐ€ GT ๊ทผ์ฒ˜์—์„œ ์‹œ์ž‘ํ•˜๋ฏ€๋กœ ๋น ๋ฅด๊ฒŒ ์ˆ˜๋ ด
  • ์ดˆ๊ธฐ์— ๋ฐœ์ƒํ•˜๋˜ ๋งค์นญ ๋ถˆ์•ˆ์ • ๋ฌธ์ œ๋ฅผ ์ค„์ž„
  • GT๋งˆ๋‹ค ๋ช…ํ™•ํžˆ ๋Œ€์‘๋˜๋Š” query๊ฐ€ ์žˆ์–ด ์„ฑ๋Šฅ๊ณผ ์ˆ˜๋ ด ์†๋„ ํ–ฅ์ƒ

๐Ÿ’ก ํ•ด๊ฒฐ์ฑ…3: Two-stage ๊ตฌ์กฐ

DINO๋Š” ๊ธฐ์กด DETR์˜ one-stage ๊ตฌ์กฐ๋ฅผ ํ™•์žฅํ•˜์—ฌ
Encoder โ†’ Decoder๋กœ ์ด์–ด์ง€๋Š” ๋‘ ๋‹จ๊ณ„ ๊ตฌ์กฐ๋ฅผ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค.


๐Ÿง  ์ž‘๋™ ๋ฐฉ์‹
  1. 1๋‹จ๊ณ„ (Encoder)
    • CNN + Transformer encoder๋ฅผ ํ†ตํ•ด denseํ•œ object ํ›„๋ณด (anchors) ์ถ”์ถœ
    • Top-K scoring anchor๋“ค ์„ ํƒ
  2. 2๋‹จ๊ณ„ (Decoder)
    • Encoder์—์„œ ์„ ํƒ๋œ anchor๋“ค์„ ๊ธฐ๋ฐ˜์œผ๋กœ refined prediction ์ˆ˜ํ–‰
    • ํด๋ž˜์Šค ๋ฐ ์ •ํ™•ํ•œ box ์กฐ์ •

๐ŸŽฏ ํšจ๊ณผ
  • ์ฒซ ๋‹จ๊ณ„์—์„œ coarseํ•˜๊ฒŒ ์œ„์น˜๋ฅผ ํŒŒ์•…ํ•˜๊ณ ,
  • ๋‘ ๋ฒˆ์งธ ๋‹จ๊ณ„์—์„œ ์ •ํ™•ํžˆ ์กฐ์ • โ†’ ์ •๋ฐ€๋„ ํ–ฅ์ƒ
  • ์ž‘์€ ๊ฐ์ฒด๋‚˜ ๋ณต์žกํ•œ ๋ฐฐ๊ฒฝ์—์„œ์˜ ํƒ์ง€ ์•ˆ์ •์„ฑ ์ฆ๊ฐ€

๐Ÿ’ก ํ•ด๊ฒฐ์ฑ…4: Look Forward Twice

๊ธฐ์กด DETR ๊ณ„์—ด์€ decoder์—์„œ object query๊ฐ€ encoder feature์— attention์„ ํ•œ ๋ฒˆ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.
DINO๋Š” ์ด attention ์—ฐ์‚ฐ์„ ๋‘ ๋ฒˆ ๋ฐ˜๋ณต(Look Twice) ํ•˜์—ฌ ๋” ๊นŠ์€ ์ƒํ˜ธ์ž‘์šฉ์„ ์œ ๋„ํ•ฉ๋‹ˆ๋‹ค.


๐Ÿ” ์ž‘๋™ ๋ฐฉ์‹
  1. ์ฒซ ๋ฒˆ์งธ attention
    • object query๊ฐ€ encoder output๊ณผ ๊ธฐ๋ณธ attention ์ˆ˜ํ–‰
  2. ๋‘ ๋ฒˆ์งธ attention
    • ์ฒซ attention ๊ฒฐ๊ณผ๋ฅผ ๋‹ค์‹œ encoder feature์— attention
    • ์ฆ‰, query โ†’ encoder โ†’ query โ†’ encoder

๐ŸŽฏ ํšจ๊ณผ
  • ๋” ๊นŠ์€ context ์ •๋ณด ํ™œ์šฉ
  • ๋ณต์žกํ•œ ์žฅ๋ฉด์—์„œ๋„ ์ •ํ™•ํ•œ ํด๋ž˜์Šค ๋ฐ ์œ„์น˜ ์˜ˆ์ธก ๊ฐ€๋Šฅ
  • ํŠนํžˆ overlapping ๊ฐ์ฒด, ์ž‘์€ ๋ฌผ์ฒด์— ๋Œ€ํ•ด ๊ฐ•ํ•œ ํ‘œํ˜„๋ ฅ ํ™•๋ณด

๐Ÿงฑ DINO ์•„ํ‚คํ…์ฒ˜ ์š”์•ฝ

1
2
3
4
5
6
Input Image
 โ†’ CNN Backbone (e.g., ResNet or Swin)
   โ†’ Transformer Encoder
     โ†’ Candidate Object Proposals (Two-stage)
       โ†’ Transformer Decoder
         โ†’ Predictions {Class, Bounding Box}โ‚~โ‚™

โœ… ์š”์•ฝ

ํ•ญ๋ชฉ์„ค๋ช…
๋ชฉ์ Object query ํ•™์Šต ์ดˆ๊ธฐ ์ˆ˜๋ ด ๊ฐ€์†
๋ฐฉ๋ฒ•GT box์— ๋…ธ์ด์ฆˆ๋ฅผ ์ถ”๊ฐ€ํ•ด query์— ํ•™์Šต ์œ ๋„
ํšจ๊ณผํ•™์Šต ์•ˆ์ •ํ™”, ์ž‘์€ ๊ฐ์ฒด์—๋„ ๋ฏผ๊ฐํ•œ ์˜ˆ์ธก ๊ฐ€๋Šฅ
์ตœ์ข… ์„ฑ๋Šฅ ๊ธฐ์—ฌํ•™์Šต ์†๋„ ํ–ฅ์ƒ + AP ์„ฑ๋Šฅ ํ–ฅ์ƒ

DeNoising Training์€ DINO๋ฅผ DETR๋ณด๋‹ค ํ›จ์”ฌ ์‹ค์šฉ์ ์ด๊ณ  ๋น ๋ฅธ ๊ฐ์ฒด ํƒ์ง€๊ธฐ๋กœ ๋งŒ๋“ค์–ด์ฃผ๋Š” ํ•ต์‹ฌ ๊ธฐ์ˆ ์ž…๋‹ˆ๋‹ค.


๐Ÿ“Š ์„ฑ๋Šฅ ๋น„๊ต (COCO ๊ธฐ์ค€)

๋ชจ๋ธAP (val)FPSBackbone
DETR42.010ResNet-50
DAB-DETR44.911ResNet-50
DINO49.0+12ResNet-50
DINO~54.0โ€“Swin-L

๐Ÿง  DINO vs DETR

ํ•ญ๋ชฉDETRDINO (Improved)
ํ•™์Šต ์ˆ˜๋ ด ์†๋„๋А๋ฆผโœ… ๋น ๋ฆ„ (DeNoising)
์ž‘์€ ๊ฐ์ฒด ํƒ์ง€ ์„ฑ๋Šฅ๋‚ฎ์Œโœ… ํ–ฅ์ƒ๋จ
Object Query ๊ตฌ์กฐ๋‹จ์ˆœโœ… GT ๊ธฐ๋ฐ˜ Matching ์ถ”๊ฐ€
Stage ๊ตฌ์กฐOne-stageโœ… Two-stage ๊ตฌ์กฐ ํฌํ•จ

๐Ÿ“Œ ์š”์•ฝ

  • DINO๋Š” DETR์˜ ๊ตฌ์กฐ๋ฅผ ์œ ์ง€ํ•˜๋ฉด์„œ, ์‹ค์ œ ์‚ฌ์šฉ์— ์ ํ•ฉํ•˜๋„๋ก ๋น ๋ฅด๊ณ  ์ •ํ™•ํ•˜๊ฒŒ ๊ฐœ์„ ํ•œ ๋ชจ๋ธ
  • ๋‹ค์–‘ํ•œ ํ›„์† ์—ฐ๊ตฌ(Grounding DINO, DINgfO-DETR, DINOv2)์˜ ๊ธฐ๋ฐ˜์ด ๋˜๋Š” ํ•ต์‹ฌ ๋ชจ๋ธ
  • ๐Ÿ”ฅ open-vocabulary detection, grounding, segment anything ๊ฐ™์€ ์ตœ์‹  ๋น„์ „ ์—ฐ๊ตฌ์™€๋„ ์ž˜ ๊ฒฐํ•ฉ๋จ

๐Ÿ’ฌ ๊ฐœ์ธ ์ •๋ฆฌ

DINO๋Š” DETR์˜ ํ•™์Šต ํšจ์œจ์„ฑ๊ณผ ์„ฑ๋Šฅ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•œ ํ›Œ๋ฅญํ•œ ๊ฐœ์„ ์•ˆ์ด๋‹ค.
ํŠนํžˆ ์ž‘์€ ๊ฐ์ฒด, ๋น ๋ฅธ ํ•™์Šต ์ˆ˜๋ ด, ViT ๋ฐฑ๋ณธ ํ˜ธํ™˜ ๋“ฑ ์‹ค๋ฌด ํ™œ์šฉ๋„๊ฐ€ ๋งค์šฐ ๋†’์Œ!
Grounding DINO๋‚˜ DINOv2 ๋“ฑ์œผ๋กœ ํ™•์žฅํ•  ๋•Œ๋„ ํ•ต์‹ฌ ๊ฐœ๋…์„ ๊ทธ๋Œ€๋กœ ๊ณต์œ ํ•˜๋ฏ€๋กœ
DETR ๊ณ„์—ด Transformer ํƒ์ง€ ๋ชจ๋ธ์„ ์ดํ•ดํ•˜๋ ค๋ฉด ๋ฐ˜๋“œ์‹œ ์•Œ์•„์•ผ ํ•  ๋ชจ๋ธ!

This post is licensed under CC BY 4.0 by the author.