Post

๐Ÿง  EfficientSAM: Leveraged Masked Image Pretraining for Efficient Segment Anything โ€” ์‹ค์ „ํ˜• SAM์˜ ํ‘œ์ค€

๐Ÿง  EfficientSAM: Leveraged Masked Image Pretraining for Efficient Segment Anything โ€” ์‹ค์ „ํ˜• SAM์˜ ํ‘œ์ค€

๐Ÿง  (English) EfficientSAM: A โ€˜Light & Fastโ€™ Segment Anything via Leveraged Masked Image Pretraining

Image


๐Ÿš€ EfficientSAM โ€” Key Points

One-liner: โ€œRetain SAMโ€™s capability, optimize weight and speed for deployment.โ€

1) Efficient architecture ๐Ÿง 

  • Lightweight image encoder: Replace SAMโ€™s heavy ViT-H with ViT-Tiny/Small backbones. Prompt encoder & mask decoder stay compatible with SAM, preserving the pipeline.

2) Smarter pretraining ๐ŸŽฏ

  • SAMI (SAM-Leveraged Masked Image Pretraining): Train the lightweight encoder to reconstruct features from SAMโ€™s ViT-H with a masked pretext task โ†’ transfers SAMโ€™s representation power into a compact backbone.

3) Practical extensibility ๐Ÿ› ๏ธ

  • Keeps SAMโ€™s interactive prompts (points/boxes/โ€œsegment everythingโ€) and can be fine-tuned for classification, detection, segmentation downstream.

4) Better efficiencyโ€“accuracy trade-off โšก

  • Aims to retain segmentation quality while cutting params/FLOPs, ideal for edge/mobile/real-time scenarios.

๐Ÿ” Prior Work

  • Making SAM & ViT efficient
    • SAM is widely used; many works reduce its compute cost.
    • FastSAM uses a CNN (e.g., YOLOv8-seg) to segment all objects efficiently.
    • MobileSAM distills a light image encoder via decoupled distillation.
    • Efficient ViT variants continue to emerge: ViT/DeiT Tiny/Small, MobileViT, LeViT, EfficientViT, etc.
  • Knowledge Distillation (KD)
    • KD transfers knowledge from a large teacher to a small student without changing architecture, supervised by hard + soft labels.
      • Hard labels: one-hot targets (e.g., [cat=1, fox=0, car=0]); typically trained with CE; lack inter-class similarity.
      • Soft labels: teacherโ€™s probability distribution (e.g., [cat=0.60, fox=0.35, car=0.05]), often with temperature to reveal dark knowledge (class relations), improving generalization/calibration.
    • Recent trends: stronger soft-label KD, decoupling (separate feature learning vs. classification), and Decoupled KD (split KD loss into target/non-target parts) so the student learns both confidence for the true class and relations among the rest.
    • Another line matches intermediate features directlyโ€”e.g., FitNet, SSTA for ViT students, or aligning features between MAE teacher/student.
  • MIM (Masked Image Modeling)
    • Self-supervised pretraining: mask patches and reconstruct the missing parts.
    • BEiT predicts visual tokens; SimMIM reconstructs pixels; MaskFeat reconstructs HOG features.
    • MAE (Masked Autoencoder): high mask ratio (~75%), asymmetric encoderโ€“decoder; encoder sees only visible patches, decoder reconstructs the full image (usually pixels).

๐Ÿงฑ EfficientSAM Architecture

Image

  • Image Encoder
    • ViT-Tiny/Small backbones.
    • SAMI pretraining teaches them to reconstruct SAM ViT-H features, so the compact encoder inherits SAM-like representations.
    • Instead of vanilla KD, masking improves local/occluded-region awareness and robustness.
  • Prompt Encoder (same as SAM)
    • Lightweight transformer that embeds points/boxes into a unified prompt embedding.
  • Mask Decoder (same as SAM)
    • Combines image & prompt embeddings with dual cross-attention, outputs masks (+ IoU prediction).
    • Full compatibility with existing SAM tooling/interfaces.

๐Ÿ”ง Training Recipe & Results

  • 1) SAMI Pretraining
    • Teacher: SAMโ€™s ViT-H image encoder features.
    • Student: lightweight ViT-T/S.
    • Goal: via masked reconstruction, reproduce SAM features โ†’ student learns promptable-segmentation-friendly representations.
  • 2) SA-1B Finetuning
    • SAMI-initialized encoder + SAM decoder are finetuned on SA-1B for points/boxes/โ€œeverythingโ€.
  • 3) Downstream transfer
    • Use the SAMI encoder for classification/detection/segmentation to show broad applicability.

Image

  • Shows solid performance on Image Classification, Object Detection & Instance Segmentation, Semantic Segmentation.

๐Ÿงช Segmentation Results & Ablations

1) Benefit of SAMI

  • Compared to vanilla MAE-like pretraining, SAMI (reconstructing SAM features) learns representations more suitable for promptable segmentation.

2) Effectiveness of lightweight backbones

  • With ViT-T/S + SAMI + finetune, EfficientSAM keeps quality while boosting efficiency, reducing reliance on ViT-H.

3) Practical compatibility

  • Maintains points/boxes/everything prompts and SAM mask decoder, minimizing replacement cost (checkpoints/examples provided).

๐ŸŽฏ Zero-shot single-point valid mask evaluation (1-click / 1-box)

  • Protocol: Random foreground point within GT mask; tight GT bbox as prompt; among multiple predictions, evaluate the highest-confidence mask.

Image

  • Highlights
    • EfficientSAM-Ti: vs MobileSAM, +1.9 mIoU (1-click), +1.5 mIoU (1-box) at similar complexity.
    • SAMI > MAE: SAMI-pretrained weights outperform MAE-pretrained on COCO/LVIS interactive.
    • EfficientSAM-S: COCO(box) โˆ’1.5 mIoU vs SAM; LVIS(box) โˆ’3.5 mIoU (~20ร— fewer params).
    • Competitive on multi-click as well.

๐Ÿ“ฆ Zero-shot instance segmentation

Image

  • Protocol: Use ViTDet-generated bbox prompts; pick the mask with max IoU to the bbox.
    • Thus ViTDet-H serves as a strong upper baseline for comparison.
  • Results
    • EfficientSAM-S: vs FastSAM COCO +6.5 AP, LVIS +7.8 AP.
    • EfficientSAM-Ti: vs FastSAM COCO +4.1 AP, LVIS +5.3 AP; vs MobileSAM COCO +3.6 AP, LVIS +5.5 AP.
    • Model size: Ti 9.8M vs FastSAM 68M โ†’ much lighter.
    • S model narrows the gap to full SAM (0.6G params) to about ~2 AP.
  • Summary: Beats other lightweight models; slightly below the very large ViTDet-H+SAM pipeline.

๐Ÿ‘€ Qualitative & Salient Instance Segmentation

Image

  • Qualitative: For points/boxes/โ€œsegment everything,โ€ EfficientSAMโ€™s boundaries & occlusion reasoning are close to SAM.
  • Salient Instance Seg.: Generate a saliency map with Uยฒ-Net, then sample 3 points (3-click) inside the map to segment with EfficientSAM.
    โ†’ Promising for accessibility (e.g., users with limited hand mobility).

๐Ÿงช Core Ablations

  • Reconstruction loss in SAMI: MSE > Cosine โ†’ directly reconstructing SAM feature values works better.
  • Cross-attention decoder: Query only masked tokens (encoder outputs act like anchors) โ†’ +3% Top-1 vs decoding all tokens (MAE-style) on ImageNet-1K (SAMI-Ti).
  • Mask ratio: High ratio (~75%) remains consistently strong (50/75/85% tested).
  • Reconstruction target: Using CLIP encoder features as target still yields +0.8%p over MAE (ViT-Tiny, IN-1K) โ†’ validates Guided MIM with strong teacher features.
  • Finetuning steps: Good results even at 0.1 epoch; +2.5 mIoU by 1 epoch.
    • EfficientSAM-S final 76.9 mIoU, only โˆ’1.5 mIoU vs SAM.

โœ… Conclusion

  • EfficientSAM transfers SAMโ€™s representational power into a lightweight encoder via SAMI pretraining, achieving similar accuracy with much better efficiency.
  • With prompt compatibility (points/boxes/everything) and open checkpoints, itโ€™s highly suitable for edge, real-time, and large-scale deployment.

๐Ÿง  (ํ•œ๊ตญ์–ด) EfficientSAM : Leveraged Masked Image Pretraining๋กœ โ€˜๊ฐ€๋ณ๊ณ  ๋น ๋ฅธโ€™ SAM!

Image


๐Ÿš€ EfficientSAM ํ•ต์‹ฌ ์š”์•ฝ

ํ•œ ์ค„ ์š”์•ฝ: โ€œSAM์˜ ๊ฐ•์ ์€ ์œ ์ง€, ๋ฌด๊ฒŒ์™€ ์†๋„๋Š” ์‹ค์ „์— ๋งž๊ฒŒ ์ตœ์ ํ™”.โ€

1) ํšจ์œจ์ ์ธ ๋ชจ๋ธ ๊ตฌ์กฐ ๐Ÿง 

  • ์ด๋ฏธ์ง€ ์ธ์ฝ”๋” ๊ฒฝ๋Ÿ‰ํ™”: SAM์˜ ๊ณ ๊ฐ€์šฉ๋Ÿ‰ ViT-H ๋Œ€์‹  ViT-Tiny/Small ๋ฐฑ๋ณธ์œผ๋กœ ๊ต์ฒด. ํ”„๋กฌํ”„ํŠธ ์ธ์ฝ”๋”/๋งˆ์Šคํฌ ๋””์ฝ”๋”๋Š” SAM๊ณผ ํ˜ธํ™˜ํ•ด ํŒŒ์ดํ”„๋ผ์ธ์€ ๊ทธ๋Œ€๋กœ ์œ ์ง€ํ•ฉ๋‹ˆ๋‹ค.

2) ๋” ๋˜‘๋˜‘ํ•ด์ง„ ์‚ฌ์ „ํ•™์Šต ๐ŸŽฏ

  • SAMI(SAM-Leveraged Masked Image Pretraining): SAM ViT-H์—์„œ ๋‚˜์˜จ ํŠน์ง•์„ โ€˜์žฌ๊ตฌ์„ฑโ€™ํ•˜๋„๋ก ๊ฒฝ๋Ÿ‰ ์ธ์ฝ”๋”๋ฅผ ๋งˆ์Šคํ‚น ํ”„๋ฆฌํ…์ŠคํŠธ๋กœ ํ•™์Šต โ†’ SAM์˜ ํ‘œํ˜„๋ ฅ์„ ๊ฒฝ๋Ÿ‰ ๋ฐฑ๋ณธ์— ์ด์‹ํ•ฉ๋‹ˆ๋‹ค.

3) ์‹ค์ „ ํ™•์žฅ์„ฑ ๐Ÿ› ๏ธ

  • ํฌ์ธํŠธ/๋ฐ•์Šค/์—๋ธŒ๋ฆฌ์‹ฑ ํ”„๋กฌํ”„ํŠธ ๋“ฑ SAM์˜ ์ƒํ˜ธ์ž‘์šฉ ๋ฐฉ์‹์„ ์œ ์ง€ํ•˜๊ณ , ๋‹ค์–‘ํ•œ ๋น„์ „ ํƒœ์Šคํฌ(๋ถ„๋ฅ˜ยท๊ฒ€์ถœยท๋ถ„ํ• )๋กœ๋„ ํ™•์žฅ ๊ฐ€๋Šฅ!!

4) ํšจ์œจโ€“์ •ํ™•๋„ ํŠธ๋ ˆ์ด๋“œ์˜คํ”„ ํ–ฅ์ƒ โšก

  • ํŒŒ๋ผ๋ฏธํ„ฐยท์—ฐ์‚ฐ ๊ฐ์†Œ ๋Œ€๋น„ Segmentation ํ’ˆ์งˆ ์œ ์ง€๋ฅผ ๋ชฉํ‘œ๋กœ ์„ค๊ณ„๋˜์–ด, ์—ฃ์ง€ยท๋ชจ๋ฐ”์ผยท์‹ค์‹œ๊ฐ„ ํ™œ์šฉ์— ์œ ๋ฆฌํ•ฉ๋‹ˆ๋‹ค.

๐Ÿ” ๊ธฐ์กด ์—ฐ๊ตฌ๋“ค!!

  • SAM&ViT์˜ ๊ฒฝ๋Ÿ‰ํ™”!
    • ๊ธฐ์กด SAM์€ ๋‹ค์–‘ํ•œ๋ถ„์•ผ์—์„œ ํ™˜์˜๋ฐ›์œผ๋ฉฐ, ๊ทธ ์—ฐ์‚ฐ๋น„์šฉ์„ ์ค„์ด๋Š”๋ฐ ์—ฐ๊ตฌ๋“ค์ด ์ด์–ด์ ธ์˜ด!
    • FastSAM์€ ํšจ์œจ ํ–ฅ์ƒ์„ ์œ„ํ•ด ์ด๋ฏธ์ง€ ๋‚ด ๋ชจ๋“  ๊ฐ์ฒด๋ฅผ ๋ถ„ํ• ํ•˜๋Š” CNN ๊ธฐ๋ฐ˜ ์•„ํ‚คํ…์ฒ˜(์˜ˆ: YOLOv8-seg[30])๋ฅผ ๊ฐœ๋ฐœ
    • MobileSAM์€ ๊ฒฝ๋Ÿ‰ ์ด๋ฏธ์ง€ ์ธ์ฝ”๋”๋ฅผ ์–ป๊ธฐ ์œ„ํ•œ decoupled distillation ๋ฐฉ๋ฒ•์„ ์ œ์‹œ
    • ViT(Vision Transformer)๋˜ํ•œ ViTSmall/Deit-Small and ViT-Tiny/DeiT-Tiny ๋“ฑ์ด ๊ณต๊ฐœ๋จ
    • ์ด์–ด์„œ MobileViT, LeViT, EfficientViT ๋“ฑ์˜ ์—ฐ๊ตฌ๊ฐ€ ๊ณต๊ฐœ๋˜๋ฉฐ ์ง€์†์ ์œผ๋กœ ๋ฐœ์ „!
  • ์ง€์‹ ์ฆ๋ฅ˜(KD)
    • ์ง€์‹์ฆ๋ฅ˜(Knowledge Distillation)์€ ๋ชจ๋ธ ๊ตฌ์กฐ๋ฅผ ๋ฐ”๊พธ์ง€ ์•Š๊ณ , ํฐ ๊ต์‚ฌ ๋ชจ๋ธ์˜ ์ง€์‹์„ ์ž‘์€ ํ•™์ƒ ๋ชจ๋ธ๋กœ ์˜ฎ๊ฒจ ์„ฑ๋Šฅ์„ ๋†’์ด๋Š” ๊ธฐ๋ฒ•์œผ๋กœ ํ•˜๋“œ ๋ผ๋ฒจ + ์†Œํ”„ํŠธ ๋ผ๋ฒจ ๊ฐ๋…์œผ๋กœ ๊ตฌ๋ถ„!!
      • ํ•˜๋“œ ๋ผ๋ฒจ (Hard label)
        • ์›-ํ•ซ(one-hot) ์ •๋‹ต: ์ •๋‹ต ํด๋ž˜์Šค๋งŒ 1, ๋‚˜๋จธ์ง€๋Š” 0 (์˜ˆ: [cat=1, fox=0, car=0]).
        • ํ•™์Šต ์†์‹ค์€ ๋ณดํ†ต ํฌ๋กœ์Šค ์—”ํŠธ๋กœํ”ผ(CE) ๋ฅผ ์‚ฌ์šฉ.
        • ๋‹จ์ : ํด๋ž˜์Šค ๊ฐ„ ์œ ์‚ฌ๋„ ์ •๋ณด๊ฐ€ ์—†์Œ โ†’ โ€œ๊ณ ์–‘์ด์™€ ์—ฌ์šฐ๊ฐ€ ๋น„์Šทํ•˜๋‹คโ€ ๊ฐ™์€ ๋ฏธ๋ฌ˜ํ•œ ๊ด€๊ณ„๋ฅผ ํ•™์ƒ์ด ๋ฐฐ์šฐ๊ธฐ ์–ด๋ ค์›€.
      • ์†Œํ”„ํŠธ ๋ผ๋ฒจ (Soft label)
        • ๊ต์‚ฌ ๋ชจ๋ธ์˜ ํ™•๋ฅ  ๋ถ„ํฌ๋ฅผ ๊ทธ๋Œ€๋กœ ์‚ฌ์šฉ(์˜ˆ: [cat=0.60, fox=0.35, car=0.05]).
        • ์˜จ๋„(Temperature) (T>1) ๋ฅผ ์ ์šฉํ•œ ์†Œํ”„ํŠธ๋งฅ์Šค๋กœ ๋ถ„ํฌ๋ฅผ ๋” ๋ถ€๋“œ๋Ÿฝ๊ฒŒ ๋งŒ๋“ค์–ด โ€˜์•”๋ฌต์ง€(dark knowledge)โ€™(์œ ์‚ฌ๋„ยท๊ฒฝ๊ณ„ ์ •๋ณด)๋ฅผ ๋“œ๋Ÿฌ๋ƒ„.
        • ํ•™์ƒ์€ ์ด ๋ถ„ํฌ๋ฅผ ๋”ฐ๋ผ๊ฐ€๋ฉฐ ํด๋ž˜์Šค ๊ฐ„ ๊ด€๊ณ„/๋‚œ์ด๋„ ๋ฅผ ํ•™์Šต โ†’ ์ผ๋ฐ˜ํ™”ยท์บ˜๋ฆฌ๋ธŒ๋ ˆ์ด์…˜ ๊ฐœ์„ .
    • ์ตœ๊ทผ ์—ฐ๊ตฌ๋Š” ์†Œํ”„ํŠธ ๋ผ๋ฒจ ํ™œ์šฉ ์œ„์ฃผ์˜ ์ง€์‹์ฆ๋ฅ˜ + ๋””์ปคํ”Œ๋ง(ํ‘œํ˜„ ํ•™์Šต๊ณผ ๋ถ„๋ฅ˜๋ฅผ ๋ณ„๋„ ํ•™์Šต์Šต) + Decoupled KD(KD ์†์‹ค์„ ํƒ€๊นƒ/๋น„ํƒ€๊นƒ์œผ๋กœ ๋ถ„๋ฆฌ)๋กœ ์ง„ํ–‰๋จ!
      • ๋””์ปคํ”Œ๋ง : Feature extractor์™€ ๋ถ„๋ฅ˜๊ธฐ๋ฅผ ๋ถ„๋ฆฌ, ๊ฐ„์„ญ์„ ์ค„์ด๊ณ  ์ž์œ ๋„๋ฅผ ๋†’์ž„!
      • Decoupled KD: ์†์‹ค์„ ํƒ€๊นƒ vs ๋น„ํƒ€๊นƒ์œผ๋กœ ๋ถ„๋ฆฌ โ†’ ์ •๋‹ต ์ž์‹ ๊ฐ๊ณผ ์˜ค๋‹ต๋“ค ๊ฐ„ ๊ด€๊ณ„๋ฅผ ๋‘˜ ๋‹ค ์ œ๋Œ€๋กœ ๋ฐฐ์šฐ๊ฒŒ ํ•จ!
    • ๋˜ ๋‹ค๋ฅธ ํ๋ฆ„์€ ์ค‘๊ฐ„ ํŠน์ง•์„ ์ง์ ‘ ๋งž์ถ”๋Š” ๋ฐฉ์‹์œผ๋กœ, FitNet์ด ๋Œ€ํ‘œ์ ์ด๋ฉฐ([47]), SSTA๋กœ ViT ํ•™์ƒ์„ ๋ณด์กฐ ์ง€๋„ํ•˜๊ฑฐ๋‚˜([60]), MAE ์‚ฌ์ „ํ•™์Šต ๊ต์‚ฌโ€“ํ•™์ƒ์˜ ์ค‘๊ฐ„ ํŠน์ง• ์ •๋ ฌ์„ ํ†ตํ•ด ์ง€์‹์„ ์ „์ดํ•œ๋‹ค([2]).
  • MIM (Masked Image Modeling)
    • ์ด๋ฏธ์ง€๋ฅผ ํŒจ์น˜ ๋‹จ์œ„๋กœ ๊ฐ€๋ฆฌ๊ณ (mask), ๊ฐ€๋ ค์ง„ ๋ถ€๋ถ„์„ ๋ณต์›ํ•˜๋„๋ก ํ•™์Šตํ•˜๋Š” Self-supervised pretraining ๋ฐฉ๋ฒ•
    • BEiT๋Š” ViT์— ์ดˆ๊ธฐ MIM์„ ๋Œ€ํ‘œํ•˜๋Š” ๋ฐฉ๋ฒ• ์ค‘ ํ•˜๋‚˜(ํ† ํฌ๋‚˜์ด์ €๋กœ ๋งŒ๋“  ๋น„์ฃผ์–ผ ํ† ํฐ ๋ณต์›)
    • ์ดํ›„ SimMIM(ํ”ฝ์…€ ๋ณต์›), MaskFeat(HOG ํŠน์ง• ๋ณต์›) ๋“ฑ ๋‹ค์–‘ํ•œ ํƒ€๊นƒ์œผ๋กœ ํ™•์žฅ.
    • MAE (Masked Autoencoder)
      • MIM์˜ ํ•œ ๋ณ€ํ˜•์œผ๋กœ, ๋†’์€ ๋งˆ์Šคํฌ ๋น„์œจ(~75%), ๋น„๋Œ€์นญ ์ธ์ฝ”๋”โ€“๋””์ฝ”๋”๋ฅผ ์‚ฌ์šฉ.
      • ์ธ์ฝ”๋”๋Š” ๋ณด์ด๋Š” ํŒจ์น˜๋งŒ ์ฒ˜๋ฆฌํ•ด ํšจ์œจ์ ์ด๊ณ , ๋””์ฝ”๋”๊ฐ€ ์ „์ฒด ์ด๋ฏธ์ง€๋ฅผ ๋ณต์›(์ฃผ๋กœ ํ”ฝ์…€ ๊ฐ’ ๋ณต์›).

#๏ผƒ# ๐Ÿงฑ EfficientSAM ๊ตฌ์กฐ(Architecture)

Image

  • Image Encoder:
    • ViT-Tiny / ViT-Small ๋“ฑ ๊ฒฝ๋Ÿ‰ ๋ฐฑ๋ณธ ๊ธฐ๋ฐ˜
    • SAMI ํ”„๋ฆฌํŠธ๋ ˆ์ด๋‹์œผ๋กœ SAM ViT-H ํŠน์ง•์„ ์žฌ๊ตฌ์„ฑํ•˜๊ฒŒ ํ•™์Šต โ†’ ๊ฒฝ๋Ÿ‰ ์ธ์ฝ”๋”๊ฐ€ SAM์˜ ํ‘œํ˜„๋ ฅ์„ ์Šต๋“.
    • SAM์˜ ํŠน์ง•์„ ๋ณด๋‹ค ์ž˜ ๋ฐฐ์šฐ๊ธฐ ์œ„ํ•ด, SAM์˜ ์ด๋ฏธ์ง€ ์ž„๋ฒ ๋”ฉ์„, ๋™์ผํ•˜๊ฒŒ KDํ•˜๋Š”๊ฒƒ์ด ์•„๋‹ˆ๋ผ, masked ๋œ ์ด๋ฏธ์ง€๋กœ ํ•™์Šตํ—ค์„œ ๋ถ€์œ„๋ถ€์œ„๋ณ„, ํ˜น์€ ๋ณด์ด์ง€ ์•Š๋Š” ๋ถ€๋ถ„๋„ ์ถ”๋ก ํ•˜์—ฌ์—ฌ ํŠน์ง•์„ ์ž˜ ํŒŒ์•…ํ• ์ˆ˜ ์žˆ๊ฒŒํ•จ
  • Prompt Encoder (SAM๊ณผ ๋™์ผ):
    • ํฌ์ธํŠธ/๋ฐ•์Šค ํ”„๋กฌํ”„ํŠธ๋ฅผ ๋™์ผํ•œ ์ž„๋ฒ ๋”ฉ์œผ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ๊ฒฝ๋Ÿ‰ ํŠธ๋žœ์Šคํฌ๋จธ ์ธ์ฝ”๋”๋ฅผ ๊ทธ๋Œ€๋กœ ์‚ฌ์šฉ
  • Mask Decoder (SAM๊ณผ ๋™์ผ):
    • ์ด๋ฏธ์ง€ยทํ”„๋กฌํ”„ํŠธ ์ž„๋ฒ ๋”ฉ์„ ๋“€์–ผ ํฌ๋กœ์Šค์–ดํ…์…˜์œผ๋กœ ๊ฒฐํ•ฉํ•ด ๋งˆ์Šคํฌ(๋ฐ IoU ์˜ˆ์ธก)๋ฅผ ์ถœ๋ ฅ.
    • ๊ตฌ์กฐ ํ˜ธํ™˜์„ ํ†ตํ•ด ๊ธฐ์กด SAM ํˆด๋ง/์ธํ„ฐํŽ˜์ด์Šค๋ฅผ ๋Œ€๋ถ€๋ถ„ ์žฌ์‚ฌ์šฉ

#๏ผƒ# ๐Ÿ”ง ํ•™์Šต๋ฒ•(Training Recipe) ๋ฐ ํ•™์Šต ๊ฒฐ๊ณผ(Results)

  • 1) SAMI ํ”„๋ฆฌํŠธ๋ ˆ์ด๋‹ (์‚ฌ์ „ํ•™์Šต)
    • ๊ต์‚ฌ: SAM์˜ ViT-H ์ธ์ฝ”๋”๋กœ๋ถ€ํ„ฐ ์–ป์€ ๊ณ ํ’ˆ์งˆ ํŠน์ง•.
    • ํ•™์ƒ: ๊ฒฝ๋Ÿ‰ ViT-T/S ์ธ์ฝ”๋”.
    • ๋ชฉํ‘œ: ๋งˆ์Šคํ‚น ๋ณต์›์„ ํ†ตํ•ด SAM ํŠน์ง•์„ ์žฌํ˜„ โ†’ ๊ฒฝ๋Ÿ‰ ์ธ์ฝ”๋”๊ฐ€ ํ”„๋กฌํ”„ํŠธ ๋ถ„ํ• ์— ์ ํ•ฉํ•œ ํ‘œํ˜„ ์Šต๋“
  • 2) SA-1B ํŒŒ์ธํŠœ๋‹ (์„ธ๊ทธ๋ฉ˜ํŠธ ์• ๋‹ˆ์‹ฑ ํƒœ์Šคํฌ ์ ํ•ฉํ™”)
    • SAMI๋กœ ์ดˆ๊ธฐํ™”๋œ ์ธ์ฝ”๋” + SAM ๋””์ฝ”๋”๋ฅผ SA-1B๋กœ ํŒŒ์ธํŠœ๋‹ํ•ด ํฌ์ธํŠธ/๋ฐ•์Šค/Everything ์„ค์ •์—์„œ์˜ ์„ฑ๋Šฅ์„ ๋งž์ถค
  • 3) ๋‹ค์šด์ŠคํŠธ๋ฆผ ์ „์ด: SAMI์—์„œ ๋‚˜์˜จ ์ธ์ฝ”๋”๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ๋ถ„๋ฅ˜ยท๊ฒ€์ถœยท๋ถ„ํ•  ๋“ฑ ๋‹ค์–‘ํ•œ ๊ณผ์ œ์— ํŒŒ์ธํŠœ๋‹ ํ•ด๋ณด๋ฉฐ ๋‹ค์–‘ํ•œ ๊ณผ์ œ์— ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•จ์„ ํ…Œ์ŠคํŠธ

Image

  • Image Classification. Object Detection and Instance Segmentation. Semantic Segmentation ์—์„œ ๋ชจ๋‘ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ž„!

๐Ÿงช Segmentation ๊ฒฐ๊ณผ ๋ถ„์„ & Ablation ํ…Œ์ŠคํŠธ

1) SAMI์˜ ์ด๋“

  • ์ผ๋ฐ˜ MAE๋ฅ˜ ๋Œ€๋น„, SAM ํŠน์ง• ๋ณต์›์— ๋ชฉํ‘œ๋ฅผ ๋‘” SAMI๊ฐ€ ํ”„๋กฌํ”„ํŠธ ๋ถ„ํ• ์— ๋” ์ ํ•ฉํ•œ ํ‘œํ˜„์„ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์œผ๋กœ ๋ณด๊ณ ๋ฉ๋‹ˆ๋‹ค. :contentReference[oaicite:16]{index=16}

2) ๊ฒฝ๋Ÿ‰ ๋ฐฑ๋ณธ์˜ ์‹คํšจ์„ฑ

  • ViT-T/S๋กœ ๊ต์ฒดํ•ด๋„, SAMI+ํŒŒ์ธํŠœ๋‹์œผ๋กœ ํ’ˆ์งˆ์„ ์œ ์ง€ํ•˜๋ฉด์„œ ํšจ์œจ์„ฑ์„ ํ™•๋ณด. ๋Œ€๊ทœ๋ชจ ViT-H ์˜์กด๋„๋ฅผ ๋‚ฎ์ถฅ๋‹ˆ๋‹ค. :contentReference[oaicite:17]{index=17}

3) ์‹ค์ „ ํ˜ธํ™˜์„ฑ

  • ํฌ์ธํŠธ/๋ฐ•์Šค/Everything ํ”„๋กฌํ”„ํŠธ์™€ SAM ๋งˆ์Šคํฌ ๋””์ฝ”๋”๋ฅผ ์œ ์ง€ํ•ด, ๊ธฐ์กด ํŒŒ์ดํ”„๋ผ์ธ ๋Œ€์ฒด ๋น„์šฉ์ด ๋‚ฎ์Šต๋‹ˆ๋‹ค(์ฒดํฌํฌ์ธํŠธยท์˜ˆ์ œ ์ œ๊ณต). :contentReference[oaicite:18]{index=18}

๐ŸŽฏ Zero-shot single point valid mask evaluation results (1-click / 1-box)

  • ํ”„๋กœํ† ์ฝœ: GT ๋งˆ์Šคํฌ ๋‚ด๋ถ€ ๋žœ๋ค ํฌ์ธํŠธ, GT ๋งˆ์Šคํฌ์— ๋Œ€ํ•œ tight bbox๋ฅผ ํ”„๋กฌํ”„ํŠธ๋กœ ์‚ฌ์šฉ. ๋‹ค์ค‘ ๋งˆ์Šคํฌ ์ค‘ ์ตœ๊ณ  ์‹ ๋ขฐ๋„ ํ•˜๋‚˜๋งŒ ํ‰๊ฐ€.

Image

  • ๊ฒฐ๊ณผ
    • EfficientSAM-Ti: MobileSAM ๋Œ€๋น„ +1.9 mIoU(1-click), +1.5 mIoU(1-box) (๋ณต์žก๋„ ์œ ์‚ฌ)
    • SAMI > MAE: COCO/LVIS ์ธํ„ฐ๋ž™ํ‹ฐ๋ธŒ์—์„œ SAMI ์‚ฌ์ „ํ•™์Šต ๊ฐ€์ค‘์น˜๊ฐ€ MAE ์‚ฌ์ „ํ•™์Šต๋ณด๋‹ค ์šฐ์ˆ˜
    • EfficientSAM-S: COCO(box) ๊ธฐ์ค€ SAM ๋Œ€๋น„ โˆ’1.5 mIoU, LVIS(box) โˆ’3.5 mIoU (ํŒŒ๋ผ๋ฏธํ„ฐ๋Š” ~20ร— ์ ์Œ)
    • ๋‹ค์ค‘ ํด๋ฆญ์—์„œ๋„ MobileSAM, SAM-MAE-Ti์™€ ๊ฒฝ์Ÿ์  ์„ฑ๋Šฅ

๐Ÿ“ฆ Zero-shot instance segmentation

Image

  • ํ”„๋กœํ† ์ฝœ: ViTDet์ด ์ƒ์„ฑํ•œ bbox ํ”„๋กฌํ”„ํŠธ ์‚ฌ์šฉ, bbox์™€ IoU ์ตœ๋Œ€ ๋งˆ์Šคํฌ๋ฅผ ์„ ํƒ
    • ๊ทธ๋ž˜์„œ ViTDet-H ๊ฐ€ ์ƒํ•œ์„ ์œผ๋กœ ๋ณด๋ฉด๋จ!!
  • ๊ฒฐ๊ณผ
    • EfficientSAM-S: FastSAM ๋Œ€๋น„ COCO +6.5 AP, LVIS +7.8 AP
    • EfficientSAM-Ti: FastSAM ๋Œ€๋น„ COCO +4.1 AP, LVIS +5.3 AP / MobileSAM ๋Œ€๋น„ COCO +3.6 AP, LVIS +5.5 AP
    • ๋ชจ๋ธ ํฌ๊ธฐ: Ti 9.8M vs FastSAM 68M โ†’ ํ›จ์”ฌ ๊ฒฝ๋Ÿ‰
    • S ๋ชจ๋ธ: 0.6G ํŒŒ๋ผ๋ฏธํ„ฐ SAM ๋Œ€๋น„ AP ~2 ์ฐจ์ด๊นŒ์ง€ ๊ฒฉ์ฐจ ์ถ•์†Œ
  • ์š”์•ฝ : ๋‹ค๋ฅธ ๊ฒฝ๋Ÿ‰๋ชจ๋ธ๋“ค์— ๋น„ํ•ด์„œ๋Š” ์„ฑ๋Šฅ์ด ์ข‹๊ณ  ํฐ๋ชจ๋ธ(ViTDet-H)์— ๋Š” ์กฐ๊ธˆ ๋–จ์–ด์ง€๋Š” ์„ฑ๋Šฅ

๐Ÿ‘€ ์ •์„ฑ ๋น„๊ต & ์ฃผ๋ชฉ ๊ฐ์ฒด(Salient) ์„ธ๊ทธ๋ฉ˜ํ…Œ์ด์…˜

Image

  • ์ •์„ฑ ๊ฒฐ๊ณผ: ํฌ์ธํŠธ/๋ฐ•์Šค/โ€œsegment everythingโ€ ์‹œ๋‚˜๋ฆฌ์˜ค์—์„œ SAM์— ๊ทผ์ ‘ํ•œ ๊ฒฝ๊ณ„ยท๊ฐ€๋ฆผ ์ถ”๋ก  ํ’ˆ์งˆ
  • Salient Instance Seg.: Uยฒ-Net์œผ๋กœ Saliency map ์ƒ์„ฑ โ†’ ๋งต ๋‚ด๋ถ€ 3์ (3-click)๋งŒ์œผ๋กœ ๊ด€์‹ฌ ๊ฐ์ฒด ๋ถ„ํ• 
    โ†’ ์† ์‚ฌ์šฉ์ด ์–ด๋ ค์šด ์‚ฌ์šฉ์ž๋ฅผ ๋•๋Š” ์ ‘๊ทผ์„ฑ ์‹œ๋‚˜๋ฆฌ์˜ค ๊ฐ€๋Šฅ์„ฑ

๐Ÿงช Ablation ํ•ต์‹ฌ

  • SAMI์—์„œ์˜ Reconstsuction lOss์˜ ์„ค๊ณ„: MSE > Cosine โ†’ SAM ํ”ผ์ฒ˜์˜ โ€˜๊ฐ’โ€™์„ ์ง์ ‘ ์žฌ๊ตฌ์„ฑํ•˜๋Š” ํŽธ์ด ์ข‹๋‹ค!!
  • ํฌ๋กœ์Šค-์–ดํ…์…˜ ๋””์ฝ”๋”: Masked ํ† ํฐ๋งŒ ๋””์ฝ”๋”์—์„œ ์ฟผ๋ฆฌ(์ธ์ฝ”๋” ์ถœ๋ ฅ ํ† ํฐ์€ ์•ต์ปค์ฒ˜๋Ÿผ ์‚ฌ์šฉ)
    โ†’ ๋ชจ๋“  ํ† ํฐ ๋””์ฝ”๋”ฉ(MAE์‹) ๋Œ€๋น„ Top-1 +3%p(ImageNet-1K, SAMI-Ti)
  • ๋งˆ์Šคํฌ ๋น„์œจ: 50/75/85% ์‹คํ—˜์—์„œ ๋†’์€ ๋น„์œจ(โ‰ˆ75%)์ด ์ผ๊ด€๋˜๊ฒŒ ์šฐ์ˆ˜
  • ์žฌ๊ตฌ์„ฑ ํƒ€๊นƒ: CLIP ์ธ์ฝ”๋” ํ”ผ์ฒ˜๋ฅผ ํƒ€๊นƒ์œผ๋กœ ํ•ด๋„ MAE ๋Œ€๋น„ +0.8%p(ViT-Tiny, IN-1K)
    โ†’ ๊ฐ•๋ ฅํ•œ ๊ต์‚ฌ ํ”ผ์ฒ˜๋ฅผ ํƒ€๊นƒ์œผ๋กœ ํ•˜๋Š” Guided MIM์˜ ํšจ๊ณผ
  • ํŒŒ์ธํŠœ๋‹ ์Šคํ…: 0.1 epoch์—์„œ๋„ ์ค€์ˆ˜, 1 epoch์— +2.5 mIoU ์ƒ์Šน
    • EfficientSAM-S ์ตœ์ข… 76.9 mIoU, SAM ๋Œ€๋น„ โˆ’1.5 mIoU

โœ… ๊ฒฐ๋ก 

  • EfficientSAM์€ SAM์˜ ํ‘œํ˜„ ๋Šฅ๋ ฅ์„ ๊ฒฝ๋Ÿ‰ ์ธ์ฝ”๋”์— ์ด์‹ํ•˜๋Š” SAMI ์‚ฌ์ „ํ•™์Šต์œผ๋กœ, ์ •ํ™•๋„ ์œ ์ง€ + ์ถ”๋ก  ํšจ์œจ ๊ฐœ์„ ์„ ์ด๋ฃธ!!
  • ํ”„๋กฌํ”„ํŠธ ํ˜ธํ™˜์„ฑ(ํฌ์ธํŠธ/๋ฐ•์Šค/Everything)๊ณผ ์˜คํ”ˆ๋œ ์ฒดํฌํฌ์ธํŠธ ๋•๋ถ„์—, ์—ฃ์ง€ยท์‹ค์‹œ๊ฐ„ยท๋Œ€๊ทœ๋ชจ ๋ฐฐํฌ์— ํ™œ์šฉ ๊ฐ€๋Šฅ!!
This post is licensed under CC BY 4.0 by the author.