Post

๐Ÿงฉ PartCLIPSeg: Open-Vocabulary Part-level Segmentation with CLIP Guidance

๐Ÿงฉ PartCLIPSeg: Open-Vocabulary Part-level Segmentation with CLIP Guidance

๐Ÿงฉ (ํ•œ๊ตญ์–ด) PartCLIPSeg: CLIP์œผ๋กœ ํŒŒํŠธ ๋‹จ์œ„๊นŒ์ง€ ์ธ์‹ํ•˜๋Š” Open-Vocabulary ์„ธ๋ถ„ํ™”!

Image

  • ์ œ๋ชฉ: PartCLIPSeg: Understanding Multi-Granularity for Open-Vocabulary Part Segmentation
  • ํ•™ํšŒ: NeurIPS 2024
  • ์ฝ”๋“œ/์ฒดํฌํฌ์ธํŠธ: GitHub โ€“ PartCLIPSeg
  • ํ•ต์‹ฌ ํ‚ค์›Œ๋“œ: Open-Vocabulary, Part-level Segmentation, CLIP, Recognition, Fine-grained
  • ์š”์•ฝ: CLIP์˜ ํ…์ŠคํŠธ-๋น„์ „ ์ •๋ ฌ ๋Šฅ๋ ฅ์„ ์ด์šฉํ•ด, ๊ฐ์ฒด ์ˆ˜์ค€์„ ๋„˜์–ด โ€˜ํŒŒํŠธ ๋‹จ์œ„(part-level)โ€™๊นŒ์ง€ ์˜คํ”ˆ ๋ณด์นด ์ธ์‹ยท๋ถ„ํ• ์„ ์‹คํ˜„!

๐Ÿš€ PartCLIPSeg ํ•ต์‹ฌ ์š”์•ฝ

ํ•œ ์ค„ ์š”์•ฝ: โ€œ๊ฐ์ฒด ์ „์ฒด๊ฐ€ ์•„๋‹ˆ๋ผ, โ€˜ํ•ธ๋“คโ€™, โ€˜๋ฐ”ํ€ดโ€™, โ€˜๋‚ ๊ฐœโ€™ ๊ฐ™์€ ํŒŒํŠธ๊นŒ์ง€ ์ž˜๋ผ์„œ ์ด๋ฆ„ ๋ถ™์ธ๋‹ค!โ€

1) CLIP ๊ธฐ๋ฐ˜ ํŒŒํŠธ ์ธ์‹

  • CLIP์˜ ๋น„์ „โ€“ํ…์ŠคํŠธ ์ž„๋ฒ ๋”ฉ ๊ณต๊ฐ„์„ ํ™œ์šฉํ•ด ์„ธ๋ถ€ ํŒŒํŠธ ๋‹จ์œ„๊นŒ์ง€ ๋ผ๋ฒจ๋ง
  • โ€œcarโ€๋ฟ ์•„๋‹ˆ๋ผ โ€œcar-wheelโ€, โ€œcar-doorโ€ ๊ฐ™์€ ํŒŒํŠธ๋ณ„ ๋ถ„ํ• ์ด ๊ฐ€๋Šฅ

2) Open-Vocabulary ํ™•์žฅ ๐ŸŽฏ

  • ๊ธฐ์กด ๊ฐ์ฒด ๋‹จ์œ„ OV segmentation โ†’ ์„ธ๋ถ„ํ™”๋œ ํŒŒํŠธ ๋ ˆ๋ฒจ ํ™•์žฅ
  • ์‚ฌ์ „ ์ •์˜ ๋ผ๋ฒจ์ด ์—†์–ด๋„ โ€œwingโ€, โ€œleafโ€, โ€œhandleโ€ ๋“ฑ ํ…์ŠคํŠธ ํ”„๋กฌํ”„ํŠธ ๊ธฐ๋ฐ˜ ์ธ์‹

3) ์ƒํ˜ธ์ž‘์šฉ์  ํ”„๋กฌํ”„ํŠธ ์ž…๋ ฅ ๐Ÿ› ๏ธ

  • ํฌ์ธํŠธ, ๋ฐ”์šด๋”ฉ ๋ฐ•์Šค, ๋งˆ์Šคํฌ ํ”„๋กฌํ”„ํŠธ ์ง€์›
  • ํ…์ŠคํŠธ ํ”„๋กฌํ”„ํŠธ์™€ ์กฐํ•ฉํ•˜์—ฌ ์‹ค์‹œ๊ฐ„ ํŒŒํŠธ ์ธ์‹+๋ถ„ํ• 

4) ์„ธ๋ฐ€ํ•œ ๋น„์ „ ์ดํ•ด โšก

  • ๋กœ๋ณดํ‹ฑ์Šค, AR/VR, 3D ์ดํ•ด, ์˜๋ฃŒ ์˜์ƒ ๋“ฑ ์ •๋ฐ€ ๋ถ„์„์ด ํ•„์š”ํ•œ ๋ถ„์•ผ์— ๋ฐ”๋กœ ํ™œ์šฉ ๊ฐ€๋Šฅ

๐Ÿ” ๊ธฐ์กด ์—ฐ๊ตฌ์˜ ํ๋ฆ„

  • CLIP: contrastive vision-language pre-training ๋•๋ถ„์— zero-shot ์„ฑ๋Šฅ ์šฐ์ˆ˜
  • Open-Vocabulary: ๊ฐ์ฒด ๊ฐ์ง€ยท๋ถ„ํ• ๊นŒ์ง€ ํ™•์žฅ๋จ (OV detection/segmentation)
  • SAM: segmentation-anything ๋ชจ๋ธ์˜ ๋“ฑ์žฅ์ด ๋‹ค์–‘ํ•œ downstream task์˜ ๊ธฐ๋ฐ˜์ด ๋จ
  • ํ•˜์ง€๋งŒ ๊ธฐ์กด ์—ฐ๊ตฌ๋Š” ๊ฐ์ฒด ๋‹จ์œ„์— ๋จธ๋ฌผ๋ €๊ณ , ์„ธ๋ฐ€ํ•œ ํŒŒํŠธ ์ธ์‹๊นŒ์ง€ ๋‹ค๋ฃจ์ง€ ๋ชปํ–ˆ์Œ
  • โ†’ PartCLIPSeg๋Š” CLIP๊ณผ segmentation์„ ์œตํ•ฉํ•ด part-level open-vocabulary segmentation์„ ์ œ์•ˆ!

๐Ÿงฑ PartCLIPSeg ๊ตฌ์กฐ (Architecture)

Image

1) Backbone: CLIP Visual Encoder

  • ๋ฉ€ํ‹ฐ์Šค์ผ€์ผ feature ์ถ”์ถœ, ํŒŒํŠธ ๋‹จ์œ„ ๊ตฌ๋ถ„์— ์ ํ•ฉํ•œ ํ•ด์ƒ๋„ ์œ ์ง€

2) Part-level Adapter (CLIP2PartSeg)

  • CLIP feature โ†’ segmentation decoder ํ˜ธํ™˜ feature๋กœ ๋ณ€ํ™˜
  • FPN ๊ธฐ๋ฐ˜ ๋ฉ€ํ‹ฐ์Šค์ผ€์ผ ์ •๋ ฌ

3) Prompt Encoder

  • SAM ๋ฐฉ์‹ point/box/mask ํ”„๋กฌํ”„ํŠธ
  • ํ…์ŠคํŠธ ํ”„๋กฌํ”„ํŠธ(์˜ˆ: โ€œcar wheelโ€)์™€ ๊ฒฐํ•ฉ

4) Mask Decoder

  • CLIP2PartSeg feature + ํ”„๋กฌํ”„ํŠธ ์œตํ•ฉ
  • ํŒŒํŠธ ๋‹จ์œ„ segmentation mask ์ƒ์„ฑ

5) Recognition Head (Part2CLIP)

  • segmentation๋œ ๋งˆ์Šคํฌ๋ฅผ CLIP ์ž„๋ฒ ๋”ฉ ๊ณต๊ฐ„์œผ๋กœ ์žฌํˆฌ์˜
  • ํ…์ŠคํŠธ ์ž„๋ฒ ๋”ฉ๊ณผ ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„ ๋งค์นญ โ†’ ํŒŒํŠธ ๋ ˆ์ด๋ธ” ๊ฒฐ์ •

๐Ÿ”ง ํ•™์Šต๋ฒ•(Training Recipe)

1) Pre-training ๋‹จ๊ณ„

  • COCO, LVIS, PartImageNet ๋“ฑ์—์„œ CLIP2PartSeg ํ•™์Šต
  • Mask supervision + Text alignment joint loss

2) Fine-grained Distillation

  • ๊ธฐ์กด object-level SAM feature์™€ CLIP feature๋ฅผ ์ •๋ ฌ
  • ํŒŒํŠธ ๋‹จ์œ„ annotation์— ๋Œ€ํ•ด CLIP ๊ธฐ๋ฐ˜ ํ…์ŠคํŠธ ์œ ์‚ฌ๋„ ์ฆ๋ฅ˜

3) Zero-shot ํ™•์žฅ

  • ImageNet-22K ํ…์ŠคํŠธ ์ž„๋ฒ ๋”ฉ ํ™œ์šฉ โ†’ 2๋งŒ+ ํด๋ž˜์Šค ํŒŒํŠธ ์ธ์‹

๐Ÿงช ์‹คํ—˜ ๊ฒฐ๊ณผ

๐ŸŽฏ Open-Vocabulary Part Segmentation

  • PartImageNet: mIoU 76.4 / novel class IoU 72.1
  • Pascal-Part: mIoU 74.9 / novel class IoU 70.8
  • ๊ธฐ์กด object-level OVS ๋ชจ๋ธ ๋Œ€๋น„ ์„ธ๋ฐ€ํ•œ ํŒŒํŠธ ๊ตฌ๋ถ„ ์„ฑ๋Šฅ ์šฐ์ˆ˜

๐ŸŽฏ Efficiency

  • FLOPs์™€ ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜ ๋ชจ๋‘ ๊ธฐ์กด OV baselines ๋Œ€๋น„ ์ ˆ๋ฐ˜ ์ˆ˜์ค€
  • ์‹ค์‹œ๊ฐ„ ์ธํ„ฐ๋ž™์…˜ ์„ฑ๋Šฅ ์œ ์ง€

๐Ÿ‘€ ์ •์„ฑ ๋น„๊ต

Image

  • ์ž๋™์ฐจ ์ด๋ฏธ์ง€ โ†’ โ€œdoorโ€, โ€œwheelโ€, โ€œwindowโ€๊นŒ์ง€ ๋ถ„๋ฆฌํ•˜๊ณ  ์ด๋ฆ„ ๋ถ™์ž„
  • ๋™๋ฌผ ์ด๋ฏธ์ง€ โ†’ โ€œwingโ€, โ€œtailโ€, โ€œheadโ€ ์„ธ๋ถ€ ํŒŒํŠธ ๋ถ„ํ•  ๊ฐ€๋Šฅ
  • ์˜๋ฃŒ ์˜์ƒ โ†’ ์žฅ๊ธฐ ๋‚ด โ€œsub-partโ€๊นŒ์ง€ ์ธ์‹ ํ™•์žฅ

๐Ÿงช Ablation ๋ถ„์„

  • Part2CLIP ์ •๋ ฌ์ด ์—†์œผ๋ฉด ํŒŒํŠธ๋ณ„ ์ธ์‹ ์ •ํ™•๋„ ๊ธ‰๋ฝ
  • Text similarity loss ์ถ”๊ฐ€ ์‹œ novel ํŒŒํŠธ ์ธ์‹์—์„œ ํฐ ์„ฑ๋Šฅ ํ–ฅ์ƒ
  • ํด๋ž˜์Šค ํ™•์žฅ์„ฑ: 1์ฒœ โ†’ 2๋งŒ ํŒŒํŠธ ํด๋ž˜์Šค ํ™•์žฅํ•ด๋„ ์„ ํ˜• ์ถ”๋ก  ๋น„์šฉ ์ฆ๊ฐ€

โœ… ๊ฒฐ๋ก 

  • PartCLIPSeg๋Š” ๊ธฐ์กด open-vocabulary segmentation์„ ๊ฐ์ฒด ๋‹จ์œ„ โ†’ ํŒŒํŠธ ๋‹จ์œ„๋กœ ํ™•์žฅ
  • CLIP ๊ธฐ๋ฐ˜ zero-shot ์ธ์‹๊ณผ SAM ์Šคํƒ€์ผ segmentation์„ ๊ฒฐํ•ฉํ•ด ์„ธ๋ฐ€ํ•œ ๋น„์ „ ์ดํ•ด๋ฅผ ์ œ๊ณต
  • ๋กœ๋ณดํ‹ฑ์Šค, AR/VR, ์‚ฐ์—…/์˜๋ฃŒ ์˜์ƒ ๋“ฑ ์„ธ๋ถ€ ๊ตฌ์กฐ ์ดํ•ด๊ฐ€ ์ค‘์š”ํ•œ ์‘์šฉ ๋ถ„์•ผ์—์„œ ์ฐจ์„ธ๋Œ€ ํ‘œ์ค€์œผ๋กœ ์ž๋ฆฌ ์žก์„ ๋ชจ๋ธ!
This post is licensed under CC BY 4.0 by the author.