Post

๐Ÿง  Understanding SAM2 - SAM2 ์•Œ์•„๋ณด๊ธฐ?!!

๐Ÿง  Understanding SAM2 - SAM2 ์•Œ์•„๋ณด๊ธฐ?!!

๐Ÿฆ– (English) ๐Ÿง  Understanding SAM2

๐Ÿ” A next-generation segmentation model with unified image & video support, real-time speed, and high accuracy!

Image

Paper: SAM 2: SEGMENT ANYTHING IN IMAGES AND VIDEOS
Conference: ICLR 2025 (by META Research)
Code: facebookresearch/SAM2
Comment: SAM2 follows SAM โ€” now moving beyond static images to dynamic videos!


In a previous post, we explored SAM, released by Facebook Research.
Today, letโ€™s dive into SAM2, which was released two years later by the same team!

โ— Limitations of the Original SAM

As the era of AR/VR and video content expands, SAM โ€” which was designed for static images โ€” has the following limitations:

  • Designed for static images only
    โ†’ SAM does not account for temporal dynamics across frames.

  • Cannot track spatio-temporal continuity
    โ†’ Cannot handle changes due to motion, deformation, occlusion, or lighting variations.

  • Vulnerable to low video quality
    โ†’ Performance drops with blur, noise, or low resolution common in videos.

  • Processes each frame independently
    โ†’ Lacks consistent tracking or segmentation continuity across frames.

  • Inefficient on long videos
    โ†’ Cannot scale well to thousands of frames due to memory and speed limitations.

โžก๏ธ Modern video applications need a model that can handle both spatial and temporal segmentation in a unified way.


โœ… Key Features of SAM2

๐Ÿ” 1. Promptable Visual Segmentation (PVS)

PVS

  • Have you ever tried background removal in PowerPoint?
  • You can click on areas to keep or remove โ€” SAM2โ€™s prompting works similarly, but for video frames!
  • You can provide point, box, or mask prompts at any video frame,
    and SAM2 generates a spatio-temporal mask (masklet).
  • Additional prompts help refine the segmentation over time.

๐Ÿง  2. The SAM2 Model: Memory-based Streaming Architecture

SAM2 is a unified model for both images and videos,
extending the original SAM with a streaming architecture and memory module for video support.

model


๐Ÿ”ง Core Components
๐Ÿ”น 1. Image Encoder
  • Uses Hiera-based hierarchical encoder (MAE pre-trained)
  • Supports streaming frame-by-frame processing
  • Enables high-resolution segmentation via multiscale features
  • Outputs unconditioned embeddings

๐Ÿ”น 2. Memory Attention
  • Conditions the current frameโ€™s features on memory from previous frames and prompts
  • Uses L transformer blocks
    (self-attention โ†’ cross-attention โ†’ MLP)
  • Leverages modern attention kernel optimizations

๐Ÿ”น 3. Prompt Encoder & Mask Decoder
  • Same prompt encoder as SAM: supports clicks, boxes, masks
  • Uses two-way transformers to update prompt/frame embeddings
  • Can generate multiple masks for ambiguous prompts
  • Adds object presence head to detect frames where the object is absent
  • Adds high-resolution skip connections for improved decoding

๐Ÿ”น 4. Memory Encoder
  • Downsamples predicted masks
  • Combines them with unconditioned embeddings via element-wise summation
  • Fuses features via lightweight CNN layers

๐Ÿ”น 5. Memory Bank
  • Stores memory features for:
    • N recent frames (auto-segmented)
    • M prompted frames (user-guided)
  • Each memory is a spatial feature map
  • Stores object pointers as high-level semantic vectors
  • Adds temporal position embeddings to N frames
    โ†’ Helps track short-term motion

+ Summary of Memory Encoder & Bank (like a tracker!)

  1. Segment current frame using prompts โ†’ mask decoder
  2. Encode memory โ†’ summarized memory features
  3. Store in memory bank โ†’ N auto-segmented + M prompted frames
  4. Next frame input โ†’ unconditioned embedding
  5. Compare via memory attention โ†’ cross-attend to past memories โ†’ localize object

๐Ÿ‹๏ธโ€โ™‚๏ธ Model Training

SAM2 is trained jointly on image and video data.
Training simulates interactive user prompting scenarios.

Each training sequence samples 8 frames, with up to 2 prompted frames.
Initial prompts are randomly selected from:

  • 50%: full mask
  • 25%: positive click
  • 25%: bounding box

Additionally, corrective clicks are generated during training to refine predictions.
The model learns to sequentially and interactively predict spatio-temporal masklets based on user guidance.


๐Ÿงฐ 3. Data Engine-based Training

SAM2 uses a Human + Model collaboration approach (data engine), organized in phases:

PhaseDescription
Phase 1Human annotates each frame using SAM1 โ†’ High-quality GT data โ†’ Trains SAM2
Phase 2Early SAM2 + human refinements โ†’ SAM2 retrained, faster propagation
Phase 3Full SAM2 with memory โ†’ humans refine via clicks only
+ QASeparate validators check quality โ†’ unsatisfactory samples corrected or rejected
+ AutoSAM2 auto-generates masklets โ†’ filtered and added if satisfactory

๐Ÿš€ SAM2 Performance

As expected, SAM2 achieves strong performance across benchmarks โ€”
No surprise it was accepted to ICLR 2025 ๐Ÿ˜‰ (details skipped here)


โœ… SAM vs. SAM2 Comparison

FeatureSAM (2023)SAM2 (2025)
Inference SpeedSlow (large ViT backbone)โœ… Up to 30ร— faster, real-time capable
ArchitectureHeavy ViT-H model (~632M params)โœ… Lightweight design with sparse attention
AccuracyStrong but struggles with small objectsโœ… Improved mask precision, especially for small objects
Prompt TypesPoint, Box, Maskโœ… Potential for text & multimodal prompts
Input ModalitiesStatic images onlyโœ… Supports video & multi-scale inputs
DeploymentCloud & research-focusedโœ… Runs on mobile & edge devices

๐Ÿฆ– (ํ•œ๊ตญ์–ด) ๐Ÿง  SAM2 ์•Œ์•„๋ณด๊ธฐ?!!

๐Ÿ” ์ด๋ฏธ์ง€์™€ ๋น„๋””์˜ค๋ฅผ ํ†ตํ•ฉ!, ์‹ค์‹œ๊ฐ„&๋†’์€ ์ •ํ™•๋„๋ฅผ ๊ตฌํ˜„ํ•œ ์ฐจ์„ธ๋Œ€ ์„ธ๊ทธ๋ฉ˜ํ…Œ์ด์…˜ ๋ชจ๋ธ!

Image

๋…ผ๋ฌธ: SAM 2: SEGMENT ANYTHING IN IMAGES AND VIDEOS
๋ฐœํ‘œ: ICLR 2025 (by META Research)
์ฝ”๋“œ: facebookresearch/SAM2
์ฝ”๋ฉ˜ํŠธ : SAM์ดํ›„ ๋“ฑ์žฅํ•œ SAM2. ์ด์ œ ์ด๋ฏธ์ง€๋ฅผ ๋„˜์–ด ์˜์ƒ์œผ๋กœ!!


์ง€๋‚œ ํฌ์ŠคํŒ…์—์„œ๋Š” facebook research์—์„œ ๊ณต๊ฐœํ•œ SAM ์— ๋Œ€ํ•˜์—ฌ ์•Œ์•„๋ณด์•˜์Šต๋‹ˆ๋‹ค.
์˜ค๋Š˜์€ ๊ฐ™์€ Facebook Research์—์„œ 2๋…„๋’ค์— ๊ณต๊ฐœํ•œ SAM2 ๋ชจ๋ธ์— ๋Œ€ํ•˜์—ฌ ์•Œ์•„๋ณด์•„์š”!

โ— ๊ธฐ์กด SAM ๋ชจ๋ธ์˜ ํ•œ๊ณ„

AR/VR ๋“ฑ ์˜์ƒ์˜ ์‹œ๋Œ€๊ฐ€ ์˜ค๋ฉด์„œ ์ด๋ฏธ์ง€ segment ์šฉ SAM๋ชจ๋ธ์€ ์•„๋ž˜์™€ ๊ฐ™์€ ํ•œ๊ณ„๊ฐ€ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.

  • ์ •์  ์ด๋ฏธ์ง€ ์ „์šฉ:
    SAM์€ ๋‹จ์ผ ์ด๋ฏธ์ง€์— ๋Œ€ํ•ด ๋™์ž‘ํ•˜๋„๋ก ์„ค๊ณ„๋˜์–ด, ์‹œ๊ฐ„ ์ถ•(temporal dimension)์„ ๊ณ ๋ คํ•˜์ง€ ์•Š์Œ.

  • ์‹œ๊ณต๊ฐ„ ์ถ”์  ๋ถˆ๊ฐ€:
    ๊ฐ์ฒด์˜ ์›€์ง์ž„, ๋ณ€ํ˜•, ๊ฐ€๋ฆผ(occlusion) ๋“ฑ ์‹œ๊ฐ„์— ๋”ฐ๋ฅธ ๋ณ€ํ™”๋ฅผ ์ฒ˜๋ฆฌํ•˜์ง€ ๋ชปํ•จ.

  • ๋‚ฎ์€ ์˜์ƒ ํ’ˆ์งˆ์— ์ทจ์•ฝ:
    ์˜์ƒ์€ ์ข…์ข… ๋ธ”๋Ÿฌ, ๋…ธ์ด์ฆˆ, ๋‚ฎ์€ ํ•ด์ƒ๋„๋ฅผ ๊ฐ€์ง€๋ฉฐ, SAM์€ ์ด๋Ÿฌํ•œ ํ’ˆ์งˆ ์ €ํ•˜์— ๊ฐ•์ธํ•˜์ง€ ์•Š์Œ.

  • ํ”„๋ ˆ์ž„ ๋‹จ์œ„ ๋…๋ฆฝ ์ฒ˜๋ฆฌ:
    ๊ฐ ํ”„๋ ˆ์ž„์„ ๋ณ„๋„๋กœ ์ฒ˜๋ฆฌํ•˜๋ฏ€๋กœ, ์—ฐ์†์ ์ธ ์ถ”์ ์ด๋‚˜ ์ผ๊ด€๋œ ์„ธ๋ถ„ํ™”(mask tracking)๊ฐ€ ์–ด๋ ค์›€.

  • ๋Œ€๊ทœ๋ชจ ์˜์ƒ ์ฒ˜๋ฆฌ ๋น„ํšจ์œจ:
    ์ˆ˜์ฒœ ๊ฐœ์˜ ํ”„๋ ˆ์ž„์„ ์ฒ˜๋ฆฌํ•ด์•ผ ํ•˜๋Š” ์˜์ƒ์—์„œ๋Š” ์ฒ˜๋ฆฌ ์†๋„์™€ ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ์ด ๋–จ์–ด์ง.

โžก๏ธ ๋”ฐ๋ผ์„œ, ๋น„๋””์˜ค ์ค‘์‹ฌ์˜ ํ˜„๋Œ€์  ์‘์šฉ์—์„œ๋Š” ๋ณด๋‹ค ํ†ตํ•ฉ์ ์ด๊ณ  ์‹œ๊ณต๊ฐ„ ์ •๋ณด๋ฅผ ๋ฐ˜์˜ํ•˜๋Š” ๋ชจ๋ธ์ด ํ•„์š”ํ–ˆ์Œ.


โœ… SAM2์˜ ํ•ต์‹ฌ ํŠน์ง•!

๐Ÿ” 1. Promptable Visual Segmentation (PVS)

PVS

  • PPT์—์„œ ์ด๋ฏธ์ง€ ๋ฐฐ๊ฒฝ์ œ๊ฑฐ๋ฅผ ํ•ด๋ณด์‹ ์ ์ด ์žˆ๋‚˜์š”?
  • PPT๋Š” ์ด๋ฏธ์ง€์—์„œ ์–ด๋–ค๋ถ€๋ถ„์„ ์ถ”๊ฐ€ํ• ์ง€, ์–ด๋–ค๋ถ€๋ถ„์„ ์ œ๊ฑฐํ• ์ง€ ์‰ฝ๊ฒŒ ์„ค์ •ํ• ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • SAM2์˜ ํ”„๋กฌํฌํŠธ(PVS)๋Š” ์ด์ฒ˜๋Ÿผ ๋™์˜์ƒ์˜ ์–ด๋–ค ํ”„๋ ˆ์ž„(์ด๋ฏธ์ง€)์—์„œ๋„ ์ถ”๊ฐ€, ์ œ๊ฑฐ์˜์—ญ์„ ์„ค์ •ํ•  ์ˆ˜ ์žˆ๊ณ 
  • ์ด์— ๋”ฐ๋ผ ์‹œ๊ณต๊ฐ„์  ๋งˆ์Šคํฌ(masklet)๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.
  • ๋˜ํ•œ ์ถ”๊ฐ€ ํ”„๋ ˆ์ž„์— ํ”„๋กฌํ”„ํŠธ๋ฅผ ์ฃผ๋ฉด ์„ธ๊ทธ๋จผํŠธ๋ฅผ ์ ์ง„์ ์œผ๋กœ ์ •๊ตํ™”ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๐Ÿง  2. SAM2 ๋ชจ๋ธ : ๋ฉ”๋ชจ๋ฆฌ ๊ธฐ๋ฐ˜ ์ŠคํŠธ๋ฆฌ๋ฐ ๊ตฌ์กฐ!

SAM2๋Š” ์ด๋ฏธ์ง€์™€ ๋น„๋””์˜ค ๋ชจ๋‘์— ์ ์šฉ ๊ฐ€๋Šฅํ•œ ํ†ตํ•ฉ ์„ธ๋ถ„ํ™” ๋ชจ๋ธ๋กœ, ๊ธฐ์กด SAM์„ ์ŠคํŠธ๋ฆฌ๋ฐ ๊ธฐ๋ฐ˜ ์•„ํ‚คํ…์ฒ˜์™€ ๋ฉ”๋ชจ๋ฆฌ ๋ชจ๋“ˆ๋กœ ํ™•์žฅํ•˜์—ฌ ์˜์ƒ์—์„œ๋„ ์œ ์—ฐํ•˜๊ฒŒ ๋™์ž‘ํ•˜๋„๋ก ์„ค๊ณ„๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

model

๐Ÿ”ง ์ฃผ์š” ๊ตฌ์„ฑ ์š”์†Œ
๐Ÿ”น 1. ์ด๋ฏธ์ง€ ์ธ์ฝ”๋” (Image Encoder)
  • Hiera ๊ธฐ๋ฐ˜ ๊ณ„์ธต์  ์ธ์ฝ”๋” ์‚ฌ์šฉ (MAE ์‚ฌ์ „ํ•™์Šต ๊ธฐ๋ฐ˜)
  • ํ”„๋ ˆ์ž„์„ ์ˆœ์ฐจ์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•˜๋Š” ์ŠคํŠธ๋ฆฌ๋ฐ ๋ฐฉ์‹ ์ง€์›
  • ๋ฉ€ํ‹ฐ์Šค์ผ€์ผ ํŠน์„ฑ(multiscale features)์„ ํ†ตํ•ด ๊ณ ํ•ด์ƒ๋„ ์„ธ๋ถ„ํ™” ๊ฐ€๋Šฅ
  • ์ถœ๋ ฅ์€ ์–ธ์ปจ๋””์…”๋‹๋œ ํ† ํฐ(embedding)

๐Ÿ”น 2. ๋ฉ”๋ชจ๋ฆฌ ์–ดํ…์…˜ (Memory Attention)
  • ํ˜„์žฌ ํ”„๋ ˆ์ž„์˜ ํ”ผ์ฒ˜๋ฅผ ์ด์ „ ํ”„๋ ˆ์ž„๋“ค๊ณผ ํ”„๋กฌํ”„ํŠธ ๊ธฐ๋ฐ˜ ๋ฉ”๋ชจ๋ฆฌ์— ์กฐ๊ฑดํ™”
  • L๊ฐœ์˜ Transformer ๋ธ”๋ก ์‚ฌ์šฉ
    • self-attention โ†’ cross-attention (memory + object pointers) โ†’ MLP
  • ์ตœ์‹  attention kernel ์ตœ์ ํ™” ์ ์šฉ ๊ฐ€๋Šฅ

๐Ÿ”น 3. ํ”„๋กฌํ”„ํŠธ ์ธ์ฝ”๋” & ๋งˆ์Šคํฌ ๋””์ฝ”๋”
  • SAM๊ณผ ๋™์ผํ•œ ๊ตฌ์กฐ์˜ Prompt Encoder
    • click, box, mask๋ฅผ positional encoding + learned embedding์œผ๋กœ ์ฒ˜๋ฆฌ
  • ๋‘ ๋ฐฉํ–ฅ(two-way) ํŠธ๋žœ์Šคํฌ๋จธ ๋ธ”๋ก์œผ๋กœ prompt/frame embedding ์ƒํ˜ธ ์—…๋ฐ์ดํŠธ
  • ๋‹ค์ค‘ ๋งˆ์Šคํฌ ์˜ˆ์ธก ๊ฐ€๋Šฅ (๋ชจํ˜ธํ•œ ํ”„๋กฌํ”„ํŠธ ๋Œ€์‘)
  • ๊ฐ์ฒด ์กด์žฌ ์—ฌ๋ถ€ ํŒ๋‹จ ํ—ค๋“œ ์ถ”๊ฐ€ : SAM๊ณผ ๋‹ค๋ฅด๊ฒŒ ํ”„๋ ˆ์ž„์— ๊ฐ์ฒด๊ฐ€ ์—†์„ ์ˆ˜๋„ ์žˆ์–ด ํŒ๋‹จ์ด ํ•„์š”!!
  • ๊ณ ํ•ด์ƒ๋„ skip connection ์ถ”๊ฐ€ (๋ฉ”๋ชจ๋ฆฌ attention์„ ๊ฑฐ์น˜์ง€ ์•Š๊ณ  ๋””์ฝ”๋”๋กœ ์—ฐ๊ฒฐ)

๐Ÿ”น 4. ๋ฉ”๋ชจ๋ฆฌ ์ธ์ฝ”๋” (Memory Encoder)
  • ํ˜„์žฌ ์˜ˆ์ธก๋œ ๋งˆ์Šคํฌ๋ฅผ ๋‹ค์šด์ƒ˜ํ”Œํ•œ ํ›„,
    ์ด๋ฏธ์ง€ ์ธ์ฝ”๋”์˜ unconditioned embedding๊ณผ element-wise summation
  • ์ดํ›„ ๊ฒฝ๋Ÿ‰ CNN ๋ ˆ์ด์–ด๋ฅผ ํ†ตํ•ด ์ •๋ณด ์œตํ•ฉ

๐Ÿ”น 5. ๋ฉ”๋ชจ๋ฆฌ ๋ฑ…ํฌ (Memory Bank)
  • ์ตœ๊ทผ N๊ฐœ์˜ ํ”„๋ ˆ์ž„๊ณผ ์ตœ๋Œ€ M๊ฐœ์˜ ํ”„๋กฌํ”„ํŠธ๋œ ํ”„๋ ˆ์ž„์— ๋Œ€ํ•œ ๋ฉ”๋ชจ๋ฆฌ ์ €์žฅ
  • ๊ฐ ๋ฉ”๋ชจ๋ฆฌ๋Š” ๊ณต๊ฐ„์  feature map ํ˜•ํƒœ
  • ๊ฐ์ฒด ํฌ์ธํ„ฐ(object pointers):
    • ๊ฐ ํ”„๋ ˆ์ž„์˜ ๋งˆ์Šคํฌ ๋””์ฝ”๋” ์ถœ๋ ฅ ํ† ํฐ ๊ธฐ๋ฐ˜
    • ๊ฐ์ฒด์˜ ๊ณ ์ˆ˜์ค€ ์˜๋ฏธ ์ •๋ณด๋ฅผ ๋ฒกํ„ฐ๋กœ ์ €์žฅ
  • ์‹œ๊ฐ„ ์ •๋ณด ์ž„๋ฒ ๋”ฉ:
    • ์ตœ๊ทผ ํ”„๋ ˆ์ž„๋“ค(N๊ฐœ)์—๋งŒ ์ ์šฉํ•˜์—ฌ ๋‹จ๊ธฐ ์›€์ง์ž„ ์ถ”์  ๊ฐ€๋Šฅ
  • SAM2 ๋ฉ”๋ชจ๋ฆฌ ์ธ์ฝ”๋” & ๋ฉ”๋ชจ๋ฆฌ ๋ฑ…ํฌ ์ž‘๋™ ์š”์•ฝ (Tracker ๊ฐ™๊ตฌ๋‚˜!)
  1. ํ˜„์žฌ ํ”„๋ ˆ์ž„ ์„ธ๊ทธ๋จผํŠธ : ํ”„๋กฌํ”„ํŠธ ๊ธฐ๋ฐ˜, ๋งˆ์Šคํฌ ๋””์ฝ”๋”๊ฐ€ ๊ฐ์ฒด ์„ธ๋ถ„ํ™”
  2. ๋ฉ”๋ชจ๋ฆฌ ์ธ์ฝ”๋”ฉ : ๊ฐ์ฒด ํŠน์ง• ์š”์•ฝ ์ƒ์„ฑ โ†’ ๋ฉ”๋ชจ๋ฆฌ feature
  3. ๋ฉ”๋ชจ๋ฆฌ ๋ฑ…ํฌ ์ €์žฅ : N๊ฐœ์˜ ํ”„๋ ˆ์ž„(์ž๋™ ์„ธ๊ทธ๋จผํŠธํ•œ๊ฑฐ) memory๋ฅผ FIFO ํ๋กœ ๊ด€๋ฆฌ + Mํ”„๋กฌํ”„ํŠธ ๋ฐ›์€ ํ”„๋ ˆ์ž„์€ ๋”ฐ๋กœ M๊ฐœ๊นŒ์ง€ ์ €์žฅ
  4. ๋‹ค์Œ ํ”„๋ ˆ์ž„ ์ž…๋ ฅ : ์ด๋ฏธ์ง€ ์ธ์ฝ”๋”๊ฐ€ ์ƒˆ ํ”„๋ ˆ์ž„์˜ unconditioned embedding ์ƒ์„ฑ
  5. ๋ฉ”๋ชจ๋ฆฌ attention ๋น„๊ต : ํ˜„์žฌ ํ”„๋ ˆ์ž„ embedding๊ณผ Memory Bank์— ์ €์žฅ๋œ ๊ณผ๊ฑฐ memory feature๋“ค์„ cross-attention์œผ๋กœ ๋น„๊ต, ๊ฐ์ฒด ์œ„์น˜ ์ถ”์ •
๐Ÿ‹๏ธโ€โ™‚๏ธ ๋ชจ๋ธ์˜ ํ•™์Šต (Training)

SAM2๋Š” ์ด๋ฏธ์ง€์™€ ๋น„๋””์˜ค ๋ฐ์ดํ„ฐ๋ฅผ ํ•จ๊ป˜ ์‚ฌ์šฉํ•˜์—ฌ ๊ณต๋™ ํ•™์Šต์„ ์ง„ํ–‰!!
ํ•™์Šต ๊ณผ์ •์€ ์‚ฌ์šฉ์ž์™€ ์ƒํ˜ธ์ž‘์šฉํ•˜๋Š” ์‹œ๋‚˜๋ฆฌ์˜ค๋ฅผ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ์„ค๊ณ„,
8๊ฐœ์˜ ํ”„๋ ˆ์ž„์œผ๋กœ ๊ตฌ์„ฑ๋œ ๋น„๋””์˜ค ์‹œํ€€์Šค๋ฅผ ์ƒ˜ํ”Œ๋งํ•œ ํ›„, ์ด ์ค‘ ์ตœ๋Œ€ 2๊ฐœ ํ”„๋ ˆ์ž„์— ํ”„๋กฌํ”„ํŠธ ์ œ์‹œ.

์ดˆ๊ธฐ ํ”„๋กฌํ”„ํŠธ๋Š” ํ™•๋ฅ ์ ์œผ๋กœ ๋‹ค์–‘ํ•˜๊ฒŒ ์ฃผ์–ด์ง€๋Š”๋ฐ,
50% ํ™•๋ฅ ๋กœ ์ •๋‹ต ๋งˆ์Šคํฌ, 25% ํ™•๋ฅ ๋กœ ๋งˆ์Šคํฌ ๋‚ด๋ถ€์˜ ํด๋ฆญ, 25% ํ™•๋ฅ ๋กœ ๋ฐ”์šด๋”ฉ ๋ฐ•์Šค๊ฐ€ ์‚ฌ์šฉ.
๋˜ํ•œ ๋ชจ๋ธ์˜ ์˜ˆ์ธก ๊ฒฐ๊ณผ์™€ ์‹ค์ œ ๋งˆ์Šคํฌ๋ฅผ ๋น„๊ตํ•˜์—ฌ, ๊ต์ • ํด๋ฆญ(corrective click)๋„ ํ•จ๊ป˜ ์ƒ์„ฑ๋˜์–ด ํ•™์Šต์— ๋ฐ˜์˜.

์ด๋Ÿฌํ•œ ๋ฐฉ์‹์œผ๋กœ SAM2๋Š” ํ”„๋กฌํ”„ํŠธ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์ •๋‹ต masklet(์‹œ๊ณต๊ฐ„ ๋งˆ์Šคํฌ)๋ฅผ
์ˆœ์ฐจ์ ์ด๊ณ  ์ ์ง„์ ์œผ๋กœ ์˜ˆ์ธกํ•˜๋Š” ๋Šฅ๋ ฅ์„ ํ•™์Šตํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.


๐Ÿงฐ 3. ๋ฐ์ดํ„ฐ ์—”์ง„ ๊ธฐ๋ฐ˜ ํ•™์Šต

  • SAM2๋Š” ์‚ฌ๋žŒ + ๋ชจ๋ธ ํ˜‘์—…(data engine) ๋ฐฉ์‹์œผ๋กœ Phase๋ฅผ ๋‚˜๋ˆ„์–ด ํ•™์Šต ๋ฐ์ดํ„ฐ๋ฅผ ์ƒ์„ฑ!!
๋‹จ๊ณ„์ฃผ์š” ๋‚ด์šฉ
Phase 1์‚ฌ๋žŒ์ด SAM1์„ ์‚ฌ์šฉํ•ด์„œ ๋งค ํ”„๋ ˆ์ž„ ์ˆ˜๋™ ๋งˆ์Šคํฌ โ†’ ์ •๋‹ต ๋ฐ์ดํ„ฐ ์ƒ์„ฑ (๋А๋ฆผ, ๊ณ ์ •ํ™•๋„) โ†’ ์ด ๋ฐ์ดํ„ฐ๋กœ SAM2 ์ดˆ๊ธฐ ๋ชจ๋ธ์„ ํ•™์Šต
Phase 2SAM2(์ดˆ๊ธฐ ๋ฒ„์ „) + ์‚ฌ๋žŒ ๋ณด์ • โ†’ SAM2 ์žฌํ•™์Šต, ๋น ๋ฅธ ๋งˆ์Šคํฌ ์ „ํŒŒ & ๊ต์ •
Phase 3SAM2(๊ธฐ์–ต ํฌํ•จ ์™„์ „์ฒด) ์ค‘์‹ฌ์œผ๋กœ ํ”„๋กฌํ”„ํŠธ ๊ธฐ๋ฐ˜ ์„ธ๊ทธ๋จผํŠธ โ†’ ์‚ฌ๋žŒ์€ ํด๋ฆญ๋งŒ์œผ๋กœ ์ •๊ตํ™”
+ ๊ฒ€์ฆ๋ณ„๋„ ๊ฒ€์ˆ˜์ž๊ฐ€ ๋งˆ์Šคํฌ ํ’ˆ์งˆ ํ‰๊ฐ€ โ†’ ๋ถˆ๋Ÿ‰์€ ๋‹ค์‹œ ๋ณด์ •, ๊ธฐ์ค€ ๋ฏธ๋‹ฌ์€ ํ๊ธฐ
+ ์ž๋™์ƒ์„ฑSAM2๊ฐ€ ์ž๋™ ์ƒ์„ฑํ•œ ๋งˆ์Šคํฌ ์ค‘ โ€˜๋งŒ์กฑ์Šค๋Ÿฌ์šด ๊ฒƒโ€™๋งŒ ํ•„ํ„ฐ๋ง โ†’ ๋ฐ์ดํ„ฐ๋กœ ์ถ”๊ฐ€

๐Ÿš€ SAM2์˜ ์„ฑ๋Šฅ

  • ์–ธ์ œ๋‚˜ ๊ทธ๋ ‡๋“  ๋‹ค์–‘ํ•œ ์ง€ํ‘œ๋กœ ์„ฑ๋Šฅ์„ ๋ฝ๋‚ด๋Š”๋ฐ, ์ข‹์€ ์„ฑ๋Šฅ์ด๋‹ˆ ICLR์— ์„ ์ •๋˜์—ˆ๊ฒ ์ง€!?ใ…Žใ…Ž ์š”๊ธด ์Šคํ‚ต!!

โœ… SAM vs SAM2 ๋น„๊ต!!

ํ•ญ๋ชฉSAM (2023)SAM2 (2025)
์ถ”๋ก  ์†๋„๋А๋ฆผ (๋Œ€ํ˜• ViT ๋ฐฑ๋ณธ ์‚ฌ์šฉ)โœ… ์ตœ๋Œ€ 30๋ฐฐ ๋” ๋น ๋ฆ„, ์‹ค์‹œ๊ฐ„ ์ฒ˜๋ฆฌ ๊ฐ€๋Šฅ
๋ชจ๋ธ ์•„ํ‚คํ…์ฒ˜๋ฌด๊ฑฐ์šด ViT-H ๋ชจ๋ธ (~632M ํŒŒ๋ผ๋ฏธํ„ฐ)โœ… ๊ฒฝ๋Ÿ‰ํ™” ์„ค๊ณ„, sparse attention ์ ์šฉ
์ •ํ™•๋„๊ฐ•๋ ฅํ•˜์ง€๋งŒ ์ž‘์€ ๊ฐ์ฒด์— ์•ฝํ•จโœ… ์ž‘์€ ๊ฐ์ฒด์— ๋Œ€ํ•œ ๋งˆ์Šคํฌ ์ •ํ™•๋„ ํ–ฅ์ƒ
ํ”„๋กฌํ”„ํŠธ ํƒ€์ž…ํฌ์ธํŠธ, ๋ฐ•์Šค, ๋งˆ์Šคํฌโœ… ํ…์ŠคํŠธ ๋ฐ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ํ”„๋กฌํ”„ํŠธ ํ™•์žฅ ๊ฐ€๋Šฅ์„ฑ
์ž…๋ ฅ ํ˜•์‹์ •์  ์ด๋ฏธ์ง€ ์ „์šฉโœ… ๋น„๋””์˜ค ๋ฐ ๋ฉ€ํ‹ฐ์Šค์ผ€์ผ ์ž…๋ ฅ ์ง€์›
๋ฐฐํฌ ํ™˜๊ฒฝํด๋ผ์šฐ๋“œ/์—ฐ๊ตฌ์šฉ ์ค‘์‹ฌโœ… ๋ชจ๋ฐ”์ผ ๋ฐ ์—ฃ์ง€ ๋””๋ฐ”์ด์Šค์—์„œ๋„ ์‹คํ–‰ ๊ฐ€๋Šฅ
This post is licensed under CC BY 4.0 by the author.