Post

๐Ÿ”— Understanding GLIP - CLIP์ดํ•ดํ•˜๊ธฐ!!!

๐Ÿ”— Understanding GLIP - CLIP์ดํ•ดํ•˜๊ธฐ!!!

๐Ÿง  (ํ•œ๊ตญ์–ด) GLIP ์•Œ์•„๋ณด๊ธฐ!

๐Ÿ” Faster R-CNN์˜ Open Vocabulary ๋ฒ„์ „!!

๊ฐ์ฑ„ ์ธ์‹(Object Detection) ์ค‘ Stage2 ๋ชจ๋ธ์˜ ๋Œ€ํ‘œ Faster R-CNN์—,
์ž์œ ๋กœ์šด ํƒ์ŠคํŠธ ํ”„๋กฌํฌํŠธ ๊ธฐ๋Šฅ์„ ์ถ”๊ฐ€ํ•œ!! GLIP์— ๋Œ€ํ•ด ์•Œ์•„๋ณด์•„์š”!!
OVOD : Open Vocabulary Object Detection

manhwa

๋…ผ๋ฌธ: Grounded Language-Image Pre-training
๋ฐœํ‘œ: CVPR 2022 (Microsoft Research)
๐Ÿ”— GitHub ์ €์žฅ์†Œ


๐Ÿ’ก GLIP์˜ ํŠน์ง• ์š”์•ฝ!!

  1. ์–ธ์–ด ๊ธฐ๋ฐ˜ ํƒ์ง€
    • โ€œ๋นจ๊ฐ„ ๋ชจ์ž๋ฅผ ์“ด ์‚ฌ๋žŒโ€, โ€œ์ฑ…์ƒ ์œ„์˜ ์Šค๋งˆํŠธํฐโ€ ๊ฐ™์€ ์ž์—ฐ์–ด ์„ค๋ช…์œผ๋กœ ๊ฐ์ฒด ํƒ์ง€ ๊ฐ€๋Šฅ!
  2. ์ œ๋กœ์ƒท ๋Šฅ๋ ฅ
    • ํ•™์Šต ๋•Œ ๋ณธ ์  ์—†๋Š” ๊ฐ์ฒด๋„ ํ…์ŠคํŠธ ์„ค๋ช…๋งŒ์œผ๋กœ ํƒ์ง€ ๊ฐ€๋Šฅ
  3. ํ†ตํ•ฉ ํ”„๋ ˆ์ž„์›Œํฌ
    • ๊ฐ์ฒด ํƒ์ง€, ๊ตฌ๋ฌธ ๊ทธ๋ผ์šด๋”ฉ, ๋น„์ „-์–ธ์–ด ์ดํ•ด๋ฅผ ํ•˜๋‚˜์˜ ๋ชจ๋ธ์—์„œ ์ฒ˜๋ฆฌ

๐Ÿง  GLIP ๋“ฑ์žฅ์˜ ๋ฐฐ๊ฒฝ

๊ธฐ์กด ๊ฐ์ฒด ํƒ์ง€๋Š” ๋ฏธ๋ฆฌ ์ •์˜๋œ ์นดํ…Œ๊ณ ๋ฆฌ(fixed set)์—๋งŒ ๊ตญํ•œ๋˜์—ˆ๊ฑฐ๋‚˜,
์ผ๋ถ€ Open-Set ๋„ ์žˆ์—ˆ์ง€๋งŒ ํ•™์Šต๋ฐ์ดํ„ฐ์˜ ํ•œ๊ณ„, ๋ชจ๋ธ ๊ตฌ์กฐ์˜ ํ•œ๊ณ„ ์ฆ์ด ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค!!


๐Ÿ” ๊ธฐ์กด ๋ฐฉ์‹(fixed-sed) vs GLIP ๋ฐฉ์‹

๊ธฐ์กด ๊ฐ์ฒด ํƒ์ง€:

1
2
3
์ž…๋ ฅ: ์ด๋ฏธ์ง€
์ถœ๋ ฅ: [ํด๋ž˜์Šค_ID, ๋ฐ”์šด๋”ฉ๋ฐ•์Šค, ์‹ ๋ขฐ๋„]
์˜ˆ์‹œ: [person, (100,50,200,150), 0.95]

GLIP ๋ฐฉ์‹:

1
2
3
์ž…๋ ฅ: ์ด๋ฏธ์ง€ + ํ…์ŠคํŠธ ์ฟผ๋ฆฌ # ํ…์ŠคํŠธ ์ฟผ๋ฆฌ ex. ๋นจ๊ฐ„ ์…”์ธ ๋ฅผ ์ž…์€ ์‚ฌ๋žŒ์ด ๊ณต์›์—์„œ ๊ฐ•์•„์ง€์™€ ๋†€๊ณ  ์žˆ๋‹ค
์ถœ๋ ฅ: [๊ทธ๋ผ์šด๋”ฉ๋œ_ํ…์ŠคํŠธ, ๋ฐ”์šด๋”ฉ๋ฐ•์Šค, ์‹ ๋ขฐ๋„] # ๊ทธ๋ฆฌ์šด๋”ฉ๋œ ํ…์ŠคํŠธ ex. ๋นจ๊ฐ„ ์…”์ธ  ์ž…์€ ์‚ฌ๋žŒ or ๊ฐ•์•„์ง€
์˜ˆ์‹œ: ["๋นจ๊ฐ„ ์…”์ธ ๋ฅผ ์ž…์€ ์‚ฌ๋žŒ", (100,50,200,150), 0.89]
  • ๊ณ ์ •๋œ ์นดํ…Œ๊ณ ๋ฆฌ ํ•œ๊ณ„: YOLO, R-CNN ๊ฐ™์€ ๊ธฐ์กด ๋ชจ๋ธ์€ ๋ฏธ๋ฆฌ ์ •์˜๋œ ํด๋ž˜์Šค๋งŒ ํƒ์ง€ ๊ฐ€๋Šฅ (์˜ˆ: COCO 80๊ฐœ ํด๋ž˜์Šค)
  • ๋น„์‹ผ ์–ด๋…ธํ…Œ์ด์…˜ ๋น„์šฉ: ์ƒˆ๋กœ์šด ํด๋ž˜์Šค๋ฅผ ์œ„ํ•œ ๋ฐ”์šด๋”ฉ ๋ฐ•์Šค ๋ผ๋ฒจ๋ง์— ๋งŽ์€ ์ธ๋ ฅ๊ณผ ์‹œ๊ฐ„ ํ•„์š”
  • ์–ธ์–ด-๋น„์ „ ๊ฒฉ์ฐจ: ๋น„์ „๊ณผ ์–ธ์–ด ์ดํ•ด๊ฐ€ ๋ถ„๋ฆฌ๋˜์–ด ํ’๋ถ€ํ•œ ํฌ๋กœ์Šค๋ชจ๋‹ฌ ์ƒํ˜ธ์ž‘์šฉ ๋ถ€์žฌ
  • ์ œ๋กœ์ƒท ๋„์ „: ์ƒˆ๋กœ์šด ๋ผ๋ฒจ ๋ฐ์ดํ„ฐ ์—†์ด๋Š” ์ƒˆ๋กœ์šด ๊ฐ์ฒด ํƒ์ง€ ๋ถˆ๊ฐ€๋Šฅ

๐Ÿ” ๊ธฐ์กด Open Vocabulary Detection ์—ฐ๊ตฌ๋“ค์˜ ํ•œ๊ณ„

GLIP์ด ์ฒซ OVOD ๋ชจ๋ธ์€ ์•„๋‹Œ๋ฐ,, ๊ธฐ์กด OVOD ์˜ ํ•œ๊ณ„๋Š”?

  • ViLD (2021) ๐Ÿ“‹
    -Two-stage detector ๋ฐฉ์‹์—์„œ CLIP์„ Second Stage์—๋งŒ! ์ฆ๋ฅ˜ํ•˜๋Š” ๋ฐฉ์‹
    • ํ•œ๊ณ„: ๋ถ„๋ฆฌ๋œ ํ•™์Šต์œผ๋กœ ์ธํ•œ ์ •๋ณด ์†์‹ค, CLIP ๋ชจ๋ธ์— ์˜์กด์ 
    • ํ•œ๊ณ„๋ฅผ ์‰ฝ๊ฒŒ ์„ค๋ช…ํ•˜๋ฉด, Stage1 ๋ถ€๋ถ„(๊ฐ์ฑ„ ์—ฌ๋ถ€ ํŒŒ์•…ํ•˜๋Š”๊ณณ)์€ ๊ธฐ์กด ๋ชจ๋ธ์„ ๊ทธ๋Œ€๋กœ ์“ฐ๊ธฐ์— ์ง„์ • Open-Set์ด๊ธฐ์—๋Š” ์„ฑ๋Šฅ์ด ์ข‹์ง€ ๋ชปํ•จ
  • MDETR (2021) ๐Ÿ”—
    • End-to-end ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ detection ์‹œ๋„
    • ํ•œ๊ณ„: ์ƒ๋Œ€์ ์œผ๋กœ ์ž‘์€ ๊ทœ๋ชจ์˜ human-annotated ๋ฐ์ดํ„ฐ์— ์˜์กด

GLIP์˜ ํ•ต์‹ฌ ์ฐจ๋ณ„์ :

  • Scale: 27M๊ฐœ์˜ ๋Œ€๊ทœ๋ชจ grounded pairs (๊ธฐ์กด ๋Œ€๋น„ ์••๋„์ !)
  • Unified Learning: Detection๊ณผ Grounding์„ ํ•˜๋‚˜์˜ ์†์‹คํ•จ์ˆ˜๋กœ ๋™์‹œ ํ•™์Šต
  • Problem Reformulation: ๋ณ„๋„ ๋ชจ๋“ˆ ์—†์ด detection = phrase grounding์œผ๋กœ ์žฌ์ •์˜
  • Web-scale ํ™œ์šฉ: ๋…ธ์ด์ฆˆ ์žˆ๋Š” ์›น ๋ฐ์ดํ„ฐ๋„ ํšจ๊ณผ์ ์œผ๋กœ ํ™œ์šฉ

๐Ÿ–‡๏ธ GLIP ๋ชจ๋ธ ๊ตฌ์กฐ: Stage 1/2 ๊ด€์ ์—์„œ ์ดํ•ดํ•˜๊ธฐ

๐Ÿ” Faster R-CNN vs GLIP ๊ตฌ์กฐ ๋น„๊ต

Faster R-CNN (๊ธฐ์กด Two-stage):

1
2
3
๐Ÿ“ธ ์ด๋ฏธ์ง€ โ†’ CNN โ†’ RPN (Stage 1: ๊ฐ์ฒด ํ›„๋ณด ์˜์—ญ ์ œ์•ˆ)
                    โ†“
                Classification + Regression (Stage 2: ํด๋ž˜์Šค ๋ถ„๋ฅ˜)

GLIP (์–ธ์–ด ์ธ์‹ Two-stage):

1
2
3
4
5
6
7
๐Ÿ“ธ ์ด๋ฏธ์ง€ โ†’ Vision Encoder โ”€โ”€โ”
                           โ”œโ”€โ”€ ๋”ฅ ํ“จ์ „ (X-MHA)
๐Ÿ“ ํ…์ŠคํŠธ โ†’ Language Encoder โ”€โ”€โ”˜
                           โ†“
                    Stage 1: ์–ธ์–ด ์ธ์‹ RPN
                           โ†“
                    Stage 2: Phrase Grounding

๐Ÿ“Š Stage๋ณ„ ์„ธ๋ถ€ ๊ตฌ์กฐ

๐ŸŽฏ Stage 1: ์–ธ์–ด ์ธ์‹ Region Proposal

์ฃผ์–ด์ง„ ํ”„๋กฌํฌํŠธ๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ์ด๋ฏธ์ง€์—๋‹ค๊ฐ€ bbox๋ฅผ ๊ทธ๋ฆฐ๋‹ค!!

1
2
3
์ด๋ฏธ์ง€ ํŠน์ง• + ํ…์ŠคํŠธ ํŠน์ง• โ†’ ํฌ๋กœ์Šค๋ชจ๋‹ฌ ์œตํ•ฉ
                            โ†“
                "ํ”„๋กฌํ”„ํŠธ ๊ด€๋ จ ๊ฐ์ฒด ํ›„๋ณด" ์˜์—ญ ์ œ์•ˆ

๐ŸŽฏ Stage 2: Phrase Grounding

stage1์—์„œ ์ฐพ์€ bbox์— ๋งค์นญ๋˜๋Š” ๋‹จ์–ด๋ฅผ ์ฐพ๋Š”๋‹ค!

1
2
3
4
5
Region Features โ”€โ”€โ”
                  โ”œโ”€โ”€ Similarity Matching
Text Features โ”€โ”€โ”€โ”€โ”˜
                 โ†“
            Grounded Phrases + BBox
  • ์—ฌ๊ธฐ์„œ, ๊ธฐ์กด ๋ฌธ์žฅ๋“ฑ์˜ text prompt์—์„œ ์–ด๋–ป๊ฒŒ Grounded phases๋ฅผ ์ฐพ๋А๋ƒ๋ฉด!!
    • ์šฐ์„  Bert๋กœ ์ธ์ฝ”๋”ฉ๋œ ํ”„๋กฌํฌํŠธ๋ž‘
    • stage1์—์„œ bbox๋œ ์ด๋ฏธ์ง€ ์ธ์ฝ”๋”ฉ์„ ๋น„๊ตํ•ด์„œ!
    • ์œ ์‚ฌํ•œ ํ”„๋กฌํฌํŠธ ๋ถ€๋ถ„๋งŒ ๋‹ค์‹œ ํ† ํฐ์—์„œ ๋‹จ์–ด๋กœ ๋ฐ”๊พธ๋ฉด, ๊ทธ๊ฒŒ Grounded phase์ด๋‹ค!

๐Ÿ”„ Stage๋ณ„ ํ•ต์‹ฌ ์ฐจ์ด์ 

์ธก๋ฉดFaster R-CNNGLIP
Stage 1๐Ÿ” โ€œ์–ด๋””์— ๋ญ”๊ฐ€ ์žˆ๋‚˜?โ€๐ŸŽฏ โ€œํ”„๋กฌํ”„ํŠธ ๊ด€๋ จ ๊ฐ์ฒด๊ฐ€ ์–ด๋””์— ์žˆ๋‚˜?โ€
์ž…๋ ฅ์ด๋ฏธ์ง€๋งŒ์ด๋ฏธ์ง€ + ํ…์ŠคํŠธ
RPN ํ•™์ŠตClosed-set (COCO ๋“ฑ)Open-set (grounded pairs)
Stage 2๐Ÿท๏ธ โ€œ์ด๊ฒŒ ๋ญ”๊ฐ€?โ€ (๊ณ ์ • ํด๋ž˜์Šค)๐Ÿ”— โ€œํ”„๋กฌํฌํŠธ์˜ ์–ด๋–ค ๊ตฌ๋ฌธ๊ณผ ๋งค์นญ๋˜๋‚˜?โ€
๋ถ„๋ฅ˜ ๋ฐฉ์‹MLP โ†’ SoftmaxSimilarity Matching
์ถœ๋ ฅ[class_id, bbox, conf][grounded_phrase, bbox, conf]

๐Ÿ“Œ GLIP ๊ตฌ์„ฑ์š”์†Œ

๊ตฌ์„ฑ์š”์†Œ์„ค๋ช…๋ชฉ์ 
ํ…์ŠคํŠธ ์ธ์ฝ”๋”BERT ๊ธฐ๋ฐ˜ ์–ธ์–ด ๋ชจ๋ธํ…์ŠคํŠธ ์ฟผ๋ฆฌ์—์„œ ์˜๋ฏธ ํŠน์ง• ์ถ”์ถœ
์ด๋ฏธ์ง€ ์ธ์ฝ”๋”ResNet ๋˜๋Š” Swin Transformer์ด๋ฏธ์ง€์—์„œ ์‹œ๊ฐ ํŠน์ง• ์ถ”์ถœ
ํฌ๋กœ์Šค๋ชจ๋‹ฌ ์œตํ•ฉ๋ฉ€ํ‹ฐํ—ค๋“œ ์–ดํ…์…˜ ๋ ˆ์ด์–ดํ…์ŠคํŠธ์™€ ์‹œ๊ฐ ํŠน์ง• ์ •๋ ฌ
ํƒ์ง€ ํ—ค๋“œ๋ถ„๋ฅ˜ + ํšŒ๊ท€๋ฐ”์šด๋”ฉ ๋ฐ•์Šค์™€ ์‹ ๋ขฐ๋„ ์˜ˆ์ธก

๐Ÿ”„ GLIP ํ•™์Šต ์ „๋žต

๐ŸŽฏ ํ†ตํ•ฉ ์†์‹ค ํ•จ์ˆ˜

1
2
3
4
5
6
L_total = L_detection + L_grounding + L_alignment

์—ฌ๊ธฐ์„œ:
- L_detection: ํ‘œ์ค€ ๊ฐ์ฒด ํƒ์ง€ ์†์‹ค
- L_grounding: ๊ตฌ๋ฌธ ๊ทธ๋ผ์šด๋”ฉ ์†์‹ค
- L_alignment: ๋น„์ „-์–ธ์–ด ์ •๋ ฌ ์†์‹ค

๐Ÿ“Š ํ•™์Šต ๋ฐ์ดํ„ฐ

๋ฐ์ดํ„ฐ ํƒ€์ž…์˜ˆ์‹œ๋ชฉ์ 
๊ฐ์ฒด ํƒ์ง€COCO, Objects365๋ฐ”์šด๋”ฉ ๋ฐ•์Šค ํšŒ๊ท€ ํ•™์Šต
๊ตฌ๋ฌธ ๊ทธ๋ผ์šด๋”ฉFlickr30K, Visual Genomeํ…์ŠคํŠธ-์˜์—ญ ์ •๋ ฌ ํ•™์Šต
์ด๋ฏธ์ง€-ํ…์ŠคํŠธ ์ŒConceptual Captions, LAIONํฌ๋กœ์Šค๋ชจ๋‹ฌ ํ‘œํ˜„ ํ•™์Šต

๐Ÿงฉ GLIP ์„ฑ๋Šฅ ๊ฒฐ๊ณผ

1. ์ œ๋กœ์ƒท vs ํŒŒ์ธํŠœ๋‹ ์„ฑ๋Šฅ ๋น„๊ต

๋ชจ๋ธBackbone์‚ฌ์ „ํ•™์Šต ๋ฐ์ดํ„ฐ์ œ๋กœ์ƒท COCOํŒŒ์ธํŠœ๋‹ COCO
๊ธฐ์กด ๋ชจ๋ธ๋“คย ย ย ย 
Faster R-CNNRN50-FPN--40.2
Faster R-CNNRN101-FPN--42.0
DyHead-TSwin-T--49.7
DyHead-LSwin-L--58.4
GLIP ๋ชจ๋ธ๋“คย ย ย ย 
GLIP-TSwin-TO36542.952.9
GLIP-TSwin-TO36544.953.8
GLIP-TSwin-TO365+GoldG46.755.1
GLIP-LSwin-LFourODs+GoldG+Cap24M49.860.8

๐Ÿš€ ๋†€๋ผ์šด ๊ฒฐ๊ณผ: GLIP-T ์ œ๋กœ์ƒท ์„ฑ๋Šฅ์ด ๊ธฐ์กด Faster R-CNN ํŒŒ์ธํŠœ๋‹ ์„ฑ๋Šฅ์„ ๋Šฅ๊ฐ€!

2. ๋‹ค์–‘ํ•œ ๋ฐ์ดํ„ฐ์…‹ ์ œ๋กœ์ƒท ์„ฑ๋Šฅ

๋ชจ๋ธCOCO APLVIS APODinW (13๊ฐœ ๋ฐ์ดํ„ฐ์…‹ ํ‰๊ท )
CLIP + ํƒ์ง€ ํ—ค๋“œ12.18.315.7
GLIP-T44.926.944.9
GLIP-L49.831.851.4

3. ํ“จ์ƒท ํ•™์Šต

์ƒท ์ˆ˜COCO APLVIS AP
1์ƒท35.822.1
5์ƒท41.227.4
10์ƒท43.629.8

4. ๊ตฌ๋ฌธ ๊ทธ๋ผ์šด๋”ฉ ๊ฒฐ๊ณผ

๋ฐ์ดํ„ฐ์…‹Recall@1Recall@5Recall@10
Flickr30K82.592.895.1
RefCOCO78.987.691.2
RefCOCO+71.482.386.9

๐Ÿง  ๋งˆ๋ฌด๋ฆฌ ์ƒ๊ฐ

Yolo ์“ฐ๋ฉด์„œ Closed-set์€ ์ •๋ง ๋ถˆํŽธํ•˜๋‹ค๊ณ  ์ƒ๊ฐํ–ˆ์—ˆ๋Š”๋ฐ!
์ด๋ ‡๊ฒŒ SOTA OVOD์ธ GLIP์€ ์ฐธ ๋Œ€๋‹จํ•œ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค!!

๐Ÿ“ GLIP ์—ฐ๊ตฌ๋ฅผ ํ†ตํ•ด ๋ฐฐ์šด ์ :

  • ์–ธ์–ด๋Š” ์ปดํ“จํ„ฐ ๋น„์ „ ๊ณผ์ œ์—์„œ ๊ฐ•๋ ฅํ•œ ์ธํ„ฐํŽ˜์ด์Šค
  • ๋‹ค์–‘ํ•œ ๋ฐ์ดํ„ฐ์˜ ๋Œ€๊ทœ๋ชจ ์‚ฌ์ „ ํ•™์Šต์ด ์ผ๋ฐ˜ํ™”์— ์ค‘์š”
  • ํ†ตํ•ฉ ํ”„๋ ˆ์ž„์›Œํฌ๊ฐ€ ๋ถ„๋ฆฌ๋œ ๋ชจ๋ธ๋ณด๋‹ค ๋” ๋‚˜์€ ์„ฑ๋Šฅ ๋‹ฌ์„ฑ ๊ฐ€๋Šฅ

โ— ์ด ์—ฐ๊ตฌ๋Š” AI์˜ ๋ฏธ๋ž˜๊ฐ€ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์ดํ•ด์— ์žˆ์Œ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค,
๋น„์ „๊ณผ ์–ธ์–ด๊ฐ€ ๋งค๋„๋Ÿฝ๊ฒŒ ํ•จ๊ป˜ ์ž‘๋™ํ•˜๋Š” ๊ทธ๋Ÿฐ ๋ฏธ๋ž˜ ๋ง์ด์ง€์š”!


This post is licensed under CC BY 4.0 by the author.